Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: CUDA error 710 bugfix #1424

Merged
merged 1 commit into from
Nov 9, 2022
Merged

Conversation

gs-olive
Copy link
Collaborator

@gs-olive gs-olive commented Oct 28, 2022

Description

Resolves a CUDA 710 error Issue arising when compiling BERT models with 3+ inputs. The issue arises due to the role of the third tensor in inference computations. Specifically, as specified in the BERT model code linked here, the third argument, token_type_ids is of type torch.LongTensor, but can only take indices in $[0,1]$. This means that when values outside of this set are used, the input is invalid.

This becomes problematic when the inputs are, for example, indices in a dictionary or embedding - which seems to be the case here. Specifically, aten::embedding is used with Tensors which are the product of token_type_ids. The issue traces to one line in the shape_analysis code previewed below, which initializes a random tensor with values in the range $[0,4]$.

// shape_analysis.cpp [Line 23, Commit 5f3a5a3]
auto in = at::randint(5, shape, {at::kCUDA}).to(type);

This tensor is run through the forward function of the module to determine the shapes of outputs and causes the model compilation-time error, as featured here in the shape analysis code.

I have added a temporary fix by decreasing the range of allowed values to the random number generator for creating input tensors to 0-1, instead of 0-4, and am working on a more robust fix.

Fixes #1418

Type of change

Please delete options that are not relevant and/or add your own.

  • Bug fix (non-breaking change which fixes an issue)

Checklist:

  • [ x ] My code follows the style guidelines of this project (You can use the linters)
  • [ x ] I have performed a self-review of my own code
  • [ x ] I have commented my code, particularly in hard-to-understand areas and hacks
  • [ x ] I have made corresponding changes to the documentation
  • [ x ] I have added tests to verify my fix or my feature
  • [ x ] New and existing unit tests pass locally with my changes
  • [ x ] I have added the relevant labels to my PR in so that relevant reviewers are notified

@narendasan
Copy link
Collaborator

@bowang007 Make sure to review this

@narendasan
Copy link
Collaborator

From my perspective see nothing wrong with sampling between $[0,1)$

@gs-olive gs-olive added the release: v1.3 Tagged to be included in v1.3 label Nov 1, 2022
@gs-olive gs-olive self-assigned this Nov 1, 2022
- Issue arising when compiling BERT models with 3+ inputs
- Added temporary fix by decreasing the range of allowed values to the
random number generator for creating input tensors to [0,2), instead of [0,5)
- Used random float inputs in the range [0, 2) instead of int, then casted to desired
type. The ultimate effect of this change with regard to bug pytorch#1418, is
random floats are selected in the range [0, 2), then casted to Int, effectively making the
range of allowed ints {0, 1}, as required by the model
- More robust fix to follow

// Make the value range for input tensor a uniform (float) distribution
// over [LoValIncl, HiValExcl), then cast to the desired dtype
auto in = ((HiValExcl - LoValIncl) * at::rand(shape, {at::kCUDA}) + LoValIncl).to(type);
Copy link
Collaborator Author

@gs-olive gs-olive Nov 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used float inputs in the range $[LoValIncl, HiValExcl)$, then casted to the desired type to avoid divide-by-zero errors potentially arising from only selecting integer random values (even for float tensors). Currently, $LoValIncl = 0$ and $HiValExcl = 2$, but this will be made optionally user-customizeable in a later PR, as discussed in RFC #1425.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like a little bit hard-coded for this model only, but will be resolved once the input range is open to users by this RFC #1425.

Copy link
Collaborator

@bowang007 bowang007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gs-olive gs-olive merged commit 1951525 into pytorch:master Nov 9, 2022
@gs-olive gs-olive deleted the cuda_error_bugfix branch November 9, 2022 02:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed component: core Issues re: The core compiler component: partitioning release: v1.3 Tagged to be included in v1.3
Projects
None yet
Development

Successfully merging this pull request may close these issues.

🐛 [Bug] Encountered cuda 710 error when apply Torch-TensorRT to BERT
4 participants