-
Notifications
You must be signed in to change notification settings - Fork 21.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running torch.sort
can corrupt memory
#111189
Labels
high priority
module: correctness (silent)
issue that returns an incorrect result silently
module: crash
Problem manifests as a hard crash, as opposed to a RuntimeError
module: regression
It used to work, and now it doesn't
module: sorting and selection
triage review
Milestone
Comments
malfet
added a commit
to pytorch/FBGEMM
that referenced
this issue
Oct 13, 2023
Setting `histogram_ps[RDX_HIST_SIZE * (nthreads - 1) + 127] = offset;` in `combine_prefix_sum_for_msb` is guaranteed to result in `heap-buffer-overflow` if bucket is not empty during the scatter stage (as all values of `histogram_ps` should be strictly less than `element_count` Will fix pytorch/pytorch#111189 once FBGEMM is updated to the correct version
Another interesting aspect, is that ASAN in clang-12 fails to capture it, perhaps time to migrate to clang-15 |
malfet
added a commit
that referenced
this issue
Oct 13, 2023
Hopefully it will align with internal system and they will detect heap-overlow access reported in #111189 Also, do not build neither Triton, nor protobuf nor DB dependencies (as they are not neded for ASAN builds/tess)
pytorchmergebot
pushed a commit
that referenced
this issue
Oct 14, 2023
Hopefully it will align with internal system and they will detect heap-overlow access reported in #111189 Also, do not build neither Triton, nor protobuf nor DB dependencies (as they are not needed for ASAN builds/tests) Pull Request resolved: #111218 Approved by: https://github.com/Skylion007
pytorchmergebot
pushed a commit
that referenced
this issue
Oct 14, 2023
If `USE_ASAN` is set, compile FBGEMM with ASAN as well, by setting `USE_SANITIZER` to `address,undefined` This fixes regression in sanitizer coverage introduced by #93147 that change effects of sanitizer from the entire project to just torch libraries, and finally allows one to reliably catch regression reported in #111189 Pull Request resolved: #111266 Approved by: https://github.com/huydhn
yeounoh
pushed a commit
to yeounoh/pytorch
that referenced
this issue
Oct 16, 2023
Hopefully it will align with internal system and they will detect heap-overlow access reported in pytorch#111189 Also, do not build neither Triton, nor protobuf nor DB dependencies (as they are not needed for ASAN builds/tests) Pull Request resolved: pytorch#111218 Approved by: https://github.com/Skylion007
yeounoh
pushed a commit
to yeounoh/pytorch
that referenced
this issue
Oct 16, 2023
If `USE_ASAN` is set, compile FBGEMM with ASAN as well, by setting `USE_SANITIZER` to `address,undefined` This fixes regression in sanitizer coverage introduced by pytorch#93147 that change effects of sanitizer from the entire project to just torch libraries, and finally allows one to reliably catch regression reported in pytorch#111189 Pull Request resolved: pytorch#111266 Approved by: https://github.com/huydhn
yeounoh
pushed a commit
to yeounoh/pytorch
that referenced
this issue
Oct 16, 2023
If `USE_ASAN` is set, compile FBGEMM with ASAN as well, by setting `USE_SANITIZER` to `address,undefined` This fixes regression in sanitizer coverage introduced by pytorch#93147 that change effects of sanitizer from the entire project to just torch libraries, and finally allows one to reliably catch regression reported in pytorch#111189 Pull Request resolved: pytorch#111266 Approved by: https://github.com/huydhn
malfet
added a commit
to pytorch/FBGEMM
that referenced
this issue
Oct 16, 2023
Setting `histogram_ps[RDX_HIST_SIZE * (nthreads - 1) + 127] = offset;` in `combine_prefix_sum_for_msb` is guaranteed to result in `heap-buffer-overflow` if bucket is not empty during the scatter stage (as all values of `histogram_ps` should be strictly less than `element_count` Will fix pytorch/pytorch#111189 once FBGEMM is updated to the correct version
facebook-github-bot
pushed a commit
to pytorch/FBGEMM
that referenced
this issue
Oct 16, 2023
Summary: Setting `histogram_ps[RDX_HIST_SIZE * (nthreads - 1) + 127] = offset;` in `combine_prefix_sum_for_msb` is guaranteed to result in `heap-buffer-overflow` if bucket is not empty during the scatter stage (as all values of `histogram_ps` should be strictly less than `element_count` Factor out common code from `RadixSortTest.cc` into `test_tempalte` and add regression test for buffer overflow, which before the test will fail as follows: ``` [ RUN ] cpuKernelTest.raidx_sort_heap_overflow /home/nshulga/git/pytorch/FBGEMM/test/RadixSortTest.cc:36: Failure Expected equality of these values: expected_keys Which is: { 2, 3, 5, -1, -1, 2147483647, 2147483647, 2147483647 } keys Which is: { -1, -1, -1, -1, -1, -1, -1, -1 } /home/nshulga/git/pytorch/FBGEMM/test/RadixSortTest.cc:37: Failure Expected equality of these values: expected_values Which is: { 1, 4, 6, 7, 8, 2, 3, 5 } values Which is: { 2147483647, 4, 6, 7, 8, 6, 7, 8 } [ FAILED ] cpuKernelTest.raidx_sort_heap_overflow (0 ms) ``` Will fix pytorch/pytorch#111189 once FBGEMM is updated to the correct version Reviewed By: kit1980, jianyuh Differential Revision: D50256504 Pulled By: malfet
malfet
added a commit
that referenced
this issue
Oct 20, 2023
By updating fbgemm Add regression test for it Fixes #111189
andreigh
pushed a commit
to andreigh/pytorch
that referenced
this issue
Oct 26, 2023
By updating fbgemm submodule Add regression test for it (though it can probably be limited to just CPU as reproducer only works if num_threads is 1) Also, update call-sites to `fbgemm:: GenerateEmbeddingSpMDM` to pass `isbf16` twice, to match API changes introduced in pytorch/FBGEMM#1851 Fixes pytorch#111189 and pytorch#111710 Pull Request resolved: pytorch#111672 Approved by: https://github.com/Skylion007
malfet
added a commit
to pytorch/FBGEMM
that referenced
this issue
Nov 2, 2023
Summary: Setting `histogram_ps[RDX_HIST_SIZE * (nthreads - 1) + 127] = offset;` in `combine_prefix_sum_for_msb` is guaranteed to result in `heap-buffer-overflow` if bucket is not empty during the scatter stage (as all values of `histogram_ps` should be strictly less than `element_count` Factor out common code from `RadixSortTest.cc` into `test_tempalte` and add regression test for buffer overflow, which before the test will fail as follows: ``` [ RUN ] cpuKernelTest.raidx_sort_heap_overflow /home/nshulga/git/pytorch/FBGEMM/test/RadixSortTest.cc:36: Failure Expected equality of these values: expected_keys Which is: { 2, 3, 5, -1, -1, 2147483647, 2147483647, 2147483647 } keys Which is: { -1, -1, -1, -1, -1, -1, -1, -1 } /home/nshulga/git/pytorch/FBGEMM/test/RadixSortTest.cc:37: Failure Expected equality of these values: expected_values Which is: { 1, 4, 6, 7, 8, 2, 3, 5 } values Which is: { 2147483647, 4, 6, 7, 8, 6, 7, 8 } [ FAILED ] cpuKernelTest.raidx_sort_heap_overflow (0 ms) ``` Will fix pytorch/pytorch#111189 once FBGEMM is updated to the correct version Pull Request resolved: #2075 Reviewed By: kit1980, jianyuh Differential Revision: D50256504 Pulled By: malfet fbshipit-source-id: f805607595e324999cea07dcacdee8317a008221 (cherry picked from commit 70c6e83)
malfet
added a commit
that referenced
this issue
Nov 2, 2023
By updating fbgemm submodule to `pytorch/release/2.1` branch, that contains following two cherry-picks: pytorch/FBGEMM@30f09a2 (formatting) and pytorch/FBGEMM@70c6e83 (actual fix for the regression) Add regression test for it (though it can probably be limited to just CPU as reproducer only works if num_threads is 1) Fixes #111189 Cherry-pick of #111672 into release/2.1 branch, but with more targeted fbgemm updated (cherry picked from commit 03da069)
malfet
added a commit
that referenced
this issue
Nov 3, 2023
By updating fbgemm submodule to `pytorch/release/2.1` branch, that contains following two cherry-picks: - pytorch/FBGEMM@30f09a2 (formatting) - pytorch/FBGEMM@70c6e83 (actual fix for the regression) Add regression test for it (though it can probably be limited to just CPU as reproducer only works if num_threads is 1) Fixes #111189 Cherry-pick of #111672 into release/2.1 branch, but with more targeted fbgemm updated (cherry picked from commit 03da069)
xuhancn
pushed a commit
to xuhancn/pytorch
that referenced
this issue
Nov 7, 2023
By updating fbgemm submodule Add regression test for it (though it can probably be limited to just CPU as reproducer only works if num_threads is 1) Also, update call-sites to `fbgemm:: GenerateEmbeddingSpMDM` to pass `isbf16` twice, to match API changes introduced in pytorch/FBGEMM#1851 Fixes pytorch#111189 and pytorch#111710 Pull Request resolved: pytorch#111672 Approved by: https://github.com/Skylion007
Skylion007
pushed a commit
to Skylion007/pytorch
that referenced
this issue
Nov 14, 2023
By updating fbgemm submodule Add regression test for it (though it can probably be limited to just CPU as reproducer only works if num_threads is 1) Also, update call-sites to `fbgemm:: GenerateEmbeddingSpMDM` to pass `isbf16` twice, to match API changes introduced in pytorch/FBGEMM#1851 Fixes pytorch#111189 and pytorch#111710 Pull Request resolved: pytorch#111672 Approved by: https://github.com/Skylion007
Halmoni100
pushed a commit
to Halmoni100/pytorch
that referenced
this issue
Nov 25, 2023
By updating fbgemm submodule to `pytorch/release/2.1` branch, that contains following two cherry-picks: - pytorch/FBGEMM@30f09a2 (formatting) - pytorch/FBGEMM@70c6e83 (actual fix for the regression) Add regression test for it (though it can probably be limited to just CPU as reproducer only works if num_threads is 1) Fixes pytorch#111189 Cherry-pick of pytorch#111672 into release/2.1 branch, but with more targeted fbgemm updated (cherry picked from commit 03da069)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
high priority
module: correctness (silent)
issue that returns an incorrect result silently
module: crash
Problem manifests as a hard crash, as opposed to a RuntimeError
module: regression
It used to work, and now it doesn't
module: sorting and selection
triage review
馃悰 Describe the bug
Running following code will result in crash/memory corruption (and the cause for S369412):
This is a regression introduced by #100081 that uses
fbgemm::radix_sort_parallel
, that in turn results in out-of-bound writes for arrays that contains a lots of elements starting with0x80
and0x7f
Versions
2.1.0/Nightly
cc @ezyang @gchanan @zou3519 @kadeng
The text was updated successfully, but these errors were encountered: