-
Notifications
You must be signed in to change notification settings - Fork 21.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
torch.matmul output contains some nan value for large size fp16 tensors in V100 GPU #45724
Comments
This is strange... I can reproduce the issue on V100 with PyTorch 1.6. Is FP16 batched matrix multiply buggy for large batch sizes? |
Thanks @colesbury. Add some more information: if change inputs to be created in CUDA with fp16 types as below, about 38% of torch.matmul output are 0 which are not expected. And seems the problem became even worser. The whole updated repro code:
|
This looks like cublas bug, can you please output your cuda version obtained with
|
CUDA name: Tesla V100-SXM2-16GB
|
Add more info:
|
@bbfrog Thanks for reporting this issue. As a workaround, could you try to install the nightly binaries with CUDA11? |
THCudaBlas_HgemmStridedBatched correctly sets computeType to disclaimer: I did not confirm the particular failing case routes to |
Thanks @ptrblck very much! I will try out cublas11.1.0.024 in dev machine and update the results. But in our production training environment, it may takes a long time to upgrade to cublas11.1.0.024 (I will try to push hard:)), so any other workaround is also appreciated. Thanks! |
At the very least we should add an error when we hit this case on cublas 10 for as long as we support cuda 10. We don't think it would be too difficult to split the batch size into smaller sizes when we know we have a buggy cublas, but it will be a little tricky to test if we did it correctly (but not too bad, only 150M of cuda data to trigger the problem). |
The surface of the bug:
@ptrblck does this sound right? Can you guys submit a PR throwing an error under those conditions (or, ideally, implementing workaround calling batched gemm multiple times. |
I checked the code fix checkin, and looks great to me. Thanks! |
Issue description
Please see the simple code below: If running in Nvidia V100 GPU and with the randomly generated fp16 tensors with size [13269, 8, 22, 64] as input, the torch.matmul output contains some nan value which are not expected.
This problem can not be reproed if running in P100 or 1080Ti, seems it is related with the fp16 computation kernel in V100 (both 16GB and 32GB V100 can repro).
This problem can be reproed by many other large size fp16 tensors, the code below is just an example.
This problem can be reproed by both pytorch 1.3.1 and pytorch 1.5.1. I haven't try other pytorch versions.
This problem can not be reproed in V100 if using fp32 computation.
The code stdout when running in P100 or 1080Ti:
CUDA name: GeForce GTX 1080 Ti
nan items count: 0, ratio: 0.0%
The code stdout when running in V100:
CUDA name: Tesla V100-SXM2-16GB
nan items count: 305560, ratio: 0.59473426223678%
nan examples:
index: 8191,7,2,0
Computed attention_scores: nan
Expected attention_scores: 15.4609375
index: 8191,7,11,4
Computed attention_scores: nan
Expected attention_scores: 14.203125
...
Code example
cc @ezyang @gchanan @zou3519 @csarofeen @ptrblck @ngimel
The text was updated successfully, but these errors were encountered: