Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Device casting issues with certain aten operators #1416

Merged
merged 2 commits into from
Nov 14, 2022

Conversation

gs-olive
Copy link
Collaborator

@gs-olive gs-olive commented Oct 25, 2022

Description

  • Investigated issue arising with BART-base model (https://huggingface.co/facebook/bart-base) where certain tensor inputs to TensorRT were on the cpu, despite users explicitly casting all inputs properly
  • Traced issue to internally-generated 0D tensors, mask tensors, and operations returning CPU tensors passed between Torch and Torch-TensorRT engines
  • Added lowering passes to ensure function edge cases are appropriately dealt with, and added validation check in runtime to avoid models crashing at runtime due to device mismatches

Type of change

  • Bug fix (non-breaking change which fixes an issue)
    • Fixed device location errors relating to specific aten functions and casting
  • New feature (non-breaking change which adds functionality)
    • Added runtime check to ensure all tensor inputs to TorchTRT are on GPU

Checklist:

  • [ x ] My code follows the style guidelines of this project (You can use the linters)
  • [ x ] I have performed a self-review of my own code
  • [ x ] I have commented my code, particularly in hard-to-understand areas and hacks
  • [ x ] I have made corresponding changes to the documentation
  • [ x ] I have added tests to verify my fix or my feature
  • [ x ] New and existing unit tests pass locally with my changes
  • [ x ] I have added the relevant labels to my PR in so that relevant reviewers are notified

@github-actions github-actions bot added component: core Issues re: The core compiler component: lowering Issues re: The lowering / preprocessing passes component: runtime labels Oct 25, 2022
@gs-olive gs-olive added the release: v1.3 Tagged to be included in v1.3 label Oct 25, 2022
%false: bool = prim::Constant[value=0]()
%mask_cuda: Tensor = aten::to(%mask, %device, %dtype, %false, %false)
%self_cuda: Tensor = aten::to(%self, %device, %dtype, %false, %false)
%out: Tensor = aten::masked_fill_(%self_cuda, %mask_cuda, %value)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be replaced by aten::masked_fill, for which a converter exists, however there are a few bugs which arise in edge cases when %value is a float but %self_cuda is an int, and similar such scenarios which are not directly handled by the converter and can cause errors in TRT. As a result, I opted for the unimplemented aten::masked_fill_ version with casted tensors, in the meantime.

@github-actions github-actions bot added the component: tests Issues re: Tests label Oct 26, 2022
@gs-olive gs-olive marked this pull request as ready for review October 26, 2022 18:41
// should be casted to CUDA to avoid device mismatch errors
std::string unpacked_pattern = R"IR(
graph(%self, %mask, %value):
%device: Device = prim::Constant[value="cuda"]()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens in the case of multi-gpu systems?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially could add an argument to take the target device.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a look at snprintf and modifying core::CompileSpec::lower_info to add a device field which replicates the device info from the external API, then you should able to determine the target device at lower time.

%false: bool = prim::Constant[value=0]()
%mask_cuda: Tensor = aten::to(%mask, %device, %dtype, %false, %false)
%self_cuda: Tensor = aten::to(%self, %device, %dtype, %false, %false)
%out: Tensor = aten::masked_fill_(%self_cuda, %mask_cuda, %value)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now this would handle the in place case. Can handle the functional case here too?

nvinfer1::DataType promote_types(nvinfer1::DataType type_a, nvinfer1::DataType type_b) {

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion - this actually turns out to be a bug in the aten::masked_fill converter, as it behaves differently than Torch when the types of the input and value are different. Specifically, the converter throws an error whereas Torch just inherits the type of the first argument. I will make a new PR + Test Cases for this, as it is an unrelated bug.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix in PR #1430

std::string num_to_tensor_clean_pattern = R"IR(
graph(%1: Scalar):
%2: Tensor = prim::NumToTensor(%1)
%device: Device = prim::Constant[value="cuda"]()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above about correct device

// to avoid device mismatch issues
std::string full_clean_pattern = R"IR(
graph(%1, %2, %3, %4, %5, %6):
%cuda: Device = prim::Constant[value="cuda"]()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

""

core/runtime/execute_engine.cpp Outdated Show resolved Hide resolved
@github-actions github-actions bot added the component: api [C++] Issues re: C++ API label Nov 1, 2022
cpp/src/compile_spec.cpp Outdated Show resolved Hide resolved
Comment on lines +51 to +62
std::string clean_pattern_part_1 = R"IR(
graph(%1: Scalar):
%2: Tensor = prim::NumToTensor(%1)
%device: Device = prim::Constant[value=")IR";

std::string clean_pattern_part_2 = R"IR("]()
%dtype: NoneType = prim::Constant()
%false: bool = prim::Constant[value=0]()
%3: Tensor = aten::to(%2, %device, %dtype, %false, %false)
return (%3))IR";

auto num_to_tensor_clean_pattern = clean_pattern_part_1 + target_device_name + clean_pattern_part_2;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to use this paradigm instead of snprintf because the % symbols in the IR are registered as formatting for snprintf, which made it difficult to insert the device string

@github-actions github-actions bot added the component: conversion Issues re: Conversion stage label Nov 1, 2022
core/util/trt_util.h Outdated Show resolved Hide resolved

for (auto& in : inputs) {
in = in.to(torch::Device(target_device));
}
} else {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't need to be an else, could just be a second check.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the else block to just assign the cuda target device name, and now the runtime device check is applied as a second check

@narendasan
Copy link
Collaborator

fixes: #1446

- Investigated issue arising with BART-base model
(https://huggingface.co/facebook/bart-base) where certain tensor inputs
to TensorRT were on the cpu, despite users explicitly casting all inputs
properly
- Traced issue to internally-generated 0D tensors, mask
tensors, and operations returning CPU tensors passed between Torch and
Torch-TensorRT engines
- Added lowering passes to ensure function edge cases are appropriately
dealt with, tensors are located on the proper device at runtime, and added validation check in runtime to avoid models
crashing at runtime due to device mismatches
- Added testing for lowering passes to ensure output values are accurate
…evice

- Adde field to LowerInfo to hold device information
- Update internal Device struct location to allow streamlined imports
- Update BUILD files
- Build strings in lowering phase using user-specified target device
- Update CMakeLists to reflect IR dependency in lowering
- Update runtime device location code to run regardless of whether a switch is required or not.
@narendasan narendasan merged commit 3d84b43 into pytorch:master Nov 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed component: api [C++] Issues re: C++ API component: conversion Issues re: Conversion stage component: core Issues re: The core compiler component: lowering Issues re: The lowering / preprocessing passes component: runtime component: tests Issues re: Tests release: v1.3 Tagged to be included in v1.3
Projects
None yet
Development

Successfully merging this pull request may close these issues.

🐛 [Bug] 0-D Tensor input for TensorRT subgraph is usually on cpu, not cuda
3 participants