Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running torch.sort can corrupt memory #111189

Closed
malfet opened this issue Oct 13, 2023 · 1 comment
Closed

Running torch.sort can corrupt memory #111189

malfet opened this issue Oct 13, 2023 · 1 comment
Assignees
Labels
high priority module: correctness (silent) issue that returns an incorrect result silently module: crash Problem manifests as a hard crash, as opposed to a RuntimeError module: regression It used to work, and now it doesn't module: sorting and selection triage review
Milestone

Comments

@malfet
Copy link
Contributor

malfet commented Oct 13, 2023

馃悰 Describe the bug

Running following code will result in crash/memory corruption (and the cause for S369412):

import torch
torch.set_num_threads(1)
x = torch.full((32768,), -1, dtype=torch.int32)
x[:100] = torch.iinfo(x.dtype).max
uv=x.sort().values.unique()
print(uv)
assert uv.size(0) == 2

This is a regression introduced by #100081 that uses fbgemm::radix_sort_parallel , that in turn results in out-of-bound writes for arrays that contains a lots of elements starting with 0x80 and 0x7f

Versions

2.1.0/Nightly

cc @ezyang @gchanan @zou3519 @kadeng

@malfet malfet added high priority module: crash Problem manifests as a hard crash, as opposed to a RuntimeError module: regression It used to work, and now it doesn't module: correctness (silent) issue that returns an incorrect result silently module: sorting and selection labels Oct 13, 2023
@malfet malfet added this to the 2.1.1 milestone Oct 13, 2023
malfet added a commit to pytorch/FBGEMM that referenced this issue Oct 13, 2023
Setting `histogram_ps[RDX_HIST_SIZE * (nthreads - 1) + 127] = offset;` in `combine_prefix_sum_for_msb` is guaranteed to result in `heap-buffer-overflow` if bucket is not empty during the scatter stage (as all values of `histogram_ps` should be strictly less than `element_count` 

Will fix pytorch/pytorch#111189 once FBGEMM is updated to the correct version
@malfet malfet self-assigned this Oct 13, 2023
@malfet
Copy link
Contributor Author

malfet commented Oct 13, 2023

Another interesting aspect, is that ASAN in clang-12 fails to capture it, perhaps time to migrate to clang-15

malfet added a commit that referenced this issue Oct 13, 2023
Hopefully it will align with internal system and they will detect heap-overlow access reported in #111189
Also, do not build neither Triton, nor protobuf nor DB dependencies (as they are not neded for ASAN builds/tess)
pytorchmergebot pushed a commit that referenced this issue Oct 14, 2023
Hopefully it will align with internal system and they will detect heap-overlow access reported in #111189 Also, do not build neither Triton, nor protobuf nor DB dependencies (as they are not needed for ASAN builds/tests)

Pull Request resolved: #111218
Approved by: https://github.com/Skylion007
pytorchmergebot pushed a commit that referenced this issue Oct 14, 2023
If `USE_ASAN` is set, compile FBGEMM with ASAN as well, by setting `USE_SANITIZER` to `address,undefined`

This fixes regression in sanitizer coverage introduced by #93147  that change effects of sanitizer from the entire project to just torch libraries, and finally allows one to reliably catch regression reported in #111189

Pull Request resolved: #111266
Approved by: https://github.com/huydhn
yeounoh pushed a commit to yeounoh/pytorch that referenced this issue Oct 16, 2023
Hopefully it will align with internal system and they will detect heap-overlow access reported in pytorch#111189 Also, do not build neither Triton, nor protobuf nor DB dependencies (as they are not needed for ASAN builds/tests)

Pull Request resolved: pytorch#111218
Approved by: https://github.com/Skylion007
yeounoh pushed a commit to yeounoh/pytorch that referenced this issue Oct 16, 2023
If `USE_ASAN` is set, compile FBGEMM with ASAN as well, by setting `USE_SANITIZER` to `address,undefined`

This fixes regression in sanitizer coverage introduced by pytorch#93147  that change effects of sanitizer from the entire project to just torch libraries, and finally allows one to reliably catch regression reported in pytorch#111189

Pull Request resolved: pytorch#111266
Approved by: https://github.com/huydhn
yeounoh pushed a commit to yeounoh/pytorch that referenced this issue Oct 16, 2023
If `USE_ASAN` is set, compile FBGEMM with ASAN as well, by setting `USE_SANITIZER` to `address,undefined`

This fixes regression in sanitizer coverage introduced by pytorch#93147  that change effects of sanitizer from the entire project to just torch libraries, and finally allows one to reliably catch regression reported in pytorch#111189

Pull Request resolved: pytorch#111266
Approved by: https://github.com/huydhn
malfet added a commit to pytorch/FBGEMM that referenced this issue Oct 16, 2023
Setting `histogram_ps[RDX_HIST_SIZE * (nthreads - 1) + 127] = offset;` in `combine_prefix_sum_for_msb` is guaranteed to result in `heap-buffer-overflow` if bucket is not empty during the scatter stage (as all values of `histogram_ps` should be strictly less than `element_count` 

Will fix pytorch/pytorch#111189 once FBGEMM is updated to the correct version
facebook-github-bot pushed a commit to pytorch/FBGEMM that referenced this issue Oct 16, 2023
Summary:
Setting `histogram_ps[RDX_HIST_SIZE * (nthreads - 1) + 127] = offset;` in `combine_prefix_sum_for_msb` is guaranteed to result in `heap-buffer-overflow` if bucket is not empty during the scatter stage (as all values of `histogram_ps` should be strictly less than `element_count`

Factor out common code from `RadixSortTest.cc` into `test_tempalte` and add regression test for buffer overflow, which before the test will fail as follows:
```
[ RUN      ] cpuKernelTest.raidx_sort_heap_overflow
/home/nshulga/git/pytorch/FBGEMM/test/RadixSortTest.cc:36: Failure
Expected equality of these values:
  expected_keys
    Which is: { 2, 3, 5, -1, -1, 2147483647, 2147483647, 2147483647 }
  keys
    Which is: { -1, -1, -1, -1, -1, -1, -1, -1 }
/home/nshulga/git/pytorch/FBGEMM/test/RadixSortTest.cc:37: Failure
Expected equality of these values:
  expected_values
    Which is: { 1, 4, 6, 7, 8, 2, 3, 5 }
  values
    Which is: { 2147483647, 4, 6, 7, 8, 6, 7, 8 }
[  FAILED  ] cpuKernelTest.raidx_sort_heap_overflow (0 ms)
```

Will fix pytorch/pytorch#111189 once FBGEMM is updated to the correct version


Reviewed By: kit1980, jianyuh

Differential Revision: D50256504

Pulled By: malfet
@malfet malfet reopened this Oct 19, 2023
malfet added a commit that referenced this issue Oct 20, 2023
By updating fbgemm

Add regression test for it

Fixes #111189
andreigh pushed a commit to andreigh/pytorch that referenced this issue Oct 26, 2023
By updating fbgemm submodule
Add regression test for it (though it can probably be limited to just CPU as reproducer only works if num_threads is 1)

Also, update call-sites  to `fbgemm:: GenerateEmbeddingSpMDM` to pass `isbf16` twice, to match API changes introduced in pytorch/FBGEMM#1851

Fixes pytorch#111189 and pytorch#111710

Pull Request resolved: pytorch#111672
Approved by: https://github.com/Skylion007
malfet added a commit to pytorch/FBGEMM that referenced this issue Nov 2, 2023
Summary:
Setting `histogram_ps[RDX_HIST_SIZE * (nthreads - 1) + 127] = offset;` in `combine_prefix_sum_for_msb` is guaranteed to result in `heap-buffer-overflow` if bucket is not empty during the scatter stage (as all values of `histogram_ps` should be strictly less than `element_count`

Factor out common code from `RadixSortTest.cc` into `test_tempalte` and add regression test for buffer overflow, which before the test will fail as follows:
```
[ RUN      ] cpuKernelTest.raidx_sort_heap_overflow
/home/nshulga/git/pytorch/FBGEMM/test/RadixSortTest.cc:36: Failure
Expected equality of these values:
  expected_keys
    Which is: { 2, 3, 5, -1, -1, 2147483647, 2147483647, 2147483647 }
  keys
    Which is: { -1, -1, -1, -1, -1, -1, -1, -1 }
/home/nshulga/git/pytorch/FBGEMM/test/RadixSortTest.cc:37: Failure
Expected equality of these values:
  expected_values
    Which is: { 1, 4, 6, 7, 8, 2, 3, 5 }
  values
    Which is: { 2147483647, 4, 6, 7, 8, 6, 7, 8 }
[  FAILED  ] cpuKernelTest.raidx_sort_heap_overflow (0 ms)
```

Will fix pytorch/pytorch#111189 once FBGEMM is updated to the correct version

Pull Request resolved: #2075

Reviewed By: kit1980, jianyuh

Differential Revision: D50256504

Pulled By: malfet

fbshipit-source-id: f805607595e324999cea07dcacdee8317a008221
(cherry picked from commit 70c6e83)
malfet added a commit that referenced this issue Nov 2, 2023
By updating fbgemm submodule to `pytorch/release/2.1` branch, that
contains following two cherry-picks:
pytorch/FBGEMM@30f09a2 (formatting)
and pytorch/FBGEMM@70c6e83 (actual fix for the regression)
Add regression test for it (though it can probably be limited to just CPU as reproducer only works if num_threads is 1)

Fixes #111189

Cherry-pick of  #111672 into release/2.1 branch, but with more targeted fbgemm updated

(cherry picked from commit 03da069)
malfet added a commit that referenced this issue Nov 3, 2023
By updating fbgemm submodule to `pytorch/release/2.1` branch, that
contains following two cherry-picks:
- pytorch/FBGEMM@30f09a2 (formatting)
-  pytorch/FBGEMM@70c6e83 (actual fix for the regression)

Add regression test for it (though it can probably be limited to just CPU as reproducer only works if num_threads is 1)

Fixes #111189

Cherry-pick of  #111672 into release/2.1 branch, but with more targeted fbgemm updated

(cherry picked from commit 03da069)
xuhancn pushed a commit to xuhancn/pytorch that referenced this issue Nov 7, 2023
By updating fbgemm submodule
Add regression test for it (though it can probably be limited to just CPU as reproducer only works if num_threads is 1)

Also, update call-sites  to `fbgemm:: GenerateEmbeddingSpMDM` to pass `isbf16` twice, to match API changes introduced in pytorch/FBGEMM#1851

Fixes pytorch#111189 and pytorch#111710

Pull Request resolved: pytorch#111672
Approved by: https://github.com/Skylion007
Skylion007 pushed a commit to Skylion007/pytorch that referenced this issue Nov 14, 2023
By updating fbgemm submodule
Add regression test for it (though it can probably be limited to just CPU as reproducer only works if num_threads is 1)

Also, update call-sites  to `fbgemm:: GenerateEmbeddingSpMDM` to pass `isbf16` twice, to match API changes introduced in pytorch/FBGEMM#1851

Fixes pytorch#111189 and pytorch#111710

Pull Request resolved: pytorch#111672
Approved by: https://github.com/Skylion007
Halmoni100 pushed a commit to Halmoni100/pytorch that referenced this issue Nov 25, 2023
By updating fbgemm submodule to `pytorch/release/2.1` branch, that
contains following two cherry-picks:
- pytorch/FBGEMM@30f09a2 (formatting)
-  pytorch/FBGEMM@70c6e83 (actual fix for the regression)

Add regression test for it (though it can probably be limited to just CPU as reproducer only works if num_threads is 1)

Fixes pytorch#111189

Cherry-pick of  pytorch#111672 into release/2.1 branch, but with more targeted fbgemm updated

(cherry picked from commit 03da069)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority module: correctness (silent) issue that returns an incorrect result silently module: crash Problem manifests as a hard crash, as opposed to a RuntimeError module: regression It used to work, and now it doesn't module: sorting and selection triage review
Projects
None yet
1 participant