-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BF16] Visit all the type cast from integer to BF16 type for potential accuracy loss #89212
Comments
…loat16 representation of large integer (#89210) Fixes #88939 The root cause of the issue is that BF16 cannot accurately represent big integer values. In the test case below, `539` as one of the corner pixel index is wrongly represented as `540` (from https://github.com/jgong5/pytorch/blob/fc60a1865eafc985217eccc0251f82014041e6a7/aten/src/ATen/native/UpSample.h#L271) and then the access out of the range with this index. Thanks to @malfet for the investigation and initial fix. I also reported an issue #89212 to track the issue of inaccurate integer representation of bf16 that need to be addressed in other places of PyTorch. ```python import torch def test(): arg_1 = torch.rand([1, 10, 540, 540], dtype=torch.bfloat16).clone() res = torch.nn.functional.interpolate(arg_1,2,mode='bilinear',align_corners=True) test() ``` Pull Request resolved: #89210 Approved by: https://github.com/malfet
Well, float can not accurately represent large integers either, so imo a proper fix would be to add some sort of |
Yes, I just planned to add the check as well. Thanks for the reminder. The reason I didn't add it the first place is that the size would be very big and the size would be too big to get allocated. I reported this issue mainly for the accuracy concern BTW. |
…da (#89252) Improve the fix in #89210 See discussion in #89212 (comment) Pull Request resolved: #89252 Approved by: https://github.com/mingfeima, https://github.com/weiwangmeta
…loat16 representation of large integer (pytorch#89210) Fixes pytorch#88939 The root cause of the issue is that BF16 cannot accurately represent big integer values. In the test case below, `539` as one of the corner pixel index is wrongly represented as `540` (from https://github.com/jgong5/pytorch/blob/fc60a1865eafc985217eccc0251f82014041e6a7/aten/src/ATen/native/UpSample.h#L271) and then the access out of the range with this index. Thanks to @malfet for the investigation and initial fix. I also reported an issue pytorch#89212 to track the issue of inaccurate integer representation of bf16 that need to be addressed in other places of PyTorch. ```python import torch def test(): arg_1 = torch.rand([1, 10, 540, 540], dtype=torch.bfloat16).clone() res = torch.nn.functional.interpolate(arg_1,2,mode='bilinear',align_corners=True) test() ``` Pull Request resolved: pytorch#89210 Approved by: https://github.com/malfet
…da (pytorch#89252) Improve the fix in pytorch#89210 See discussion in pytorch#89212 (comment) Pull Request resolved: pytorch#89252 Approved by: https://github.com/mingfeima, https://github.com/weiwangmeta
There is also a lot of accumulations in bfloat16 for backward, e.g. :
|
BF16 is not supported by FractionalMaxPool yet. |
### Description - Fix precision issue for BFloat16 upsampling: #89212 - Improve performance for BFloat16 upsampling. ### Testing data type: BFloat16 - Single core contiguous: mode | scale_factor | shape | before backward / ms | after backward / ms -- | -- | -- | -- | -- nearest | 2 | [10, 3, 200, 200] | 14.47 | 8.34 linear | 2 | [3, 200, 200] | 3.69 | 2.74 bilinear | 2 | [3, 5, 200, 200] | 87.99 | 49.05 trilinear | 2 | [3, 3, 3, 100, 100] | 171.02 | 72.53 bicubic | 2 | [3, 3, 200, 200 ] | 176.29 | 78 channels last: mode | scale_factor | shape | before backward / ms | after backward / ms -- | -- | -- | -- | -- nearest | 2 | [10, 3, 200, 200] | 17.70 | 10.30 linear | 2 | [3, 200, 200] | \ | \ bilinear | 2 | [3, 5, 200, 200] | 50.90 | 18.83 trilinear | 2 | [3, 3, 3, 100, 100] | 121.56 | 42.60 bicubic | 2 | [3, 3, 200, 200 ] | 179.40 | 80 - 20 cores contiguous: mode | scale_factor | shape | before backward / ms | after backward / ms -- | -- | -- | -- | -- nearest | 2 | [10, 3, 200, 200] | 1.17 | 1.01 linear | 2 | [3, 200, 200] | 0.41 | 0.26 bilinear | 2 | [3, 5, 200, 200] | 7.19 | 4.07 trilinear | 2 | [3, 3, 3, 100, 100] | 21.32 | 9.33 bicubic | 2 | [3, 3, 200, 200 ] | 178.67 | 10 channels last: mode | scale_factor | shape | before backward / ms | after backward / ms -- | -- | -- | -- | -- nearest | 2 | [10, 3, 200, 200] | 2.25 | 1.55 linear | 2 | [3, 200, 200] | \ | \ bilinear | 2 | [3, 5, 200, 200] | 20.17 | 7.20 trilinear | 2 | [3, 3, 3, 100, 100] | 43.33 | 15.66 bicubic | 2 | [3, 3, 200, 200 ] | 176.76 | 10 Pull Request resolved: #91169 Approved by: https://github.com/jgong5, https://github.com/mingfeima, https://github.com/Skylion007
It was marked as done. |
Issue description
BF16 cannot accurately represent all integer values outside the range of [-255,255]. When there is a cast from an integer value from BF16 data type, there would be potential accuracy loss. Typically, such a type cast is used in pixel index calculation in ops such as upsample, pooling etc. FP32 is preferred in these cases. See #88939 for a related functionality issue.
Code example
area_pixel_compute_scale
(https://github.com/jgong5/pytorch/blob/fc60a1865eafc985217eccc0251f82014041e6a7/aten/src/ATen/native/UpSample.h#L263) is one of the examples and used in a lot of places. There are also direct usage ofstatic_cast<scalar_t>
from integer to bf16, such as https://github.com/jgong5/pytorch/blob/fc60a1865eafc985217eccc0251f82014041e6a7/aten/src/ATen/native/FractionalMaxPool2d.cpp#L139cc @VitalyFedyunin @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10
The text was updated successfully, but these errors were encountered: