Fix incorrect CUDA `torch.nn.Embedding` result when max_norm is not None and indices are not sorted #45248

kurtamohler · 2020-09-24T00:29:37Z

Sorting indices before calling thrust::unique fixes the issue.
Fixes #44792

dr-ci · 2020-09-24T01:08:41Z

💊 CI failures summary and remediations

As of commit c4edc84 (more details on the Dr. CI page):

Commit c4edc84 was recently pushed. Waiting for builds...

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 41 times.

dr-ci · 2020-09-24T01:10:17Z

💊 CI failures summary and remediations

As of commit 18dbfb8ee4 (more details on the Dr. CI page):

✅ None of the CI failures appear to be your fault 💚

8/8 broken upstream at merge base 76c185d since Sep 23

🚧 8 ongoing upstream failures:

These were probably caused by upstream breakages that are not fixed yet:

pytorch_linux_xenial_py3_6_gcc5_4_ge_config_legacy_test since Sep 23
- 🔁 rerun
pytorch_linux_xenial_py3_6_gcc5_4_ge_config_profiling_test since Sep 23
- 🔁 rerun
pytorch_windows_vs2019_py36_cpu_build since Sep 23
- 🔁 rerun
pytorch_windows_vs2019_py36_cuda11.0_build since Sep 23
- 🔁 rerun
binary_windows_libtorch_3_7_cpu_debug_build since Sep 23
- 🔁 rerun
pytorch_macos_10_13_py3_build since Sep 23
- 🔁 rerun
binary_windows_libtorch_3_7_cpu_release_build since Sep 23
- 🔁 rerun
pytorch_windows_vs2019_py36_cuda10.1_build since Sep 23
- 🔁 rerun

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

aten/src/ATen/native/cuda/Embedding.cu

torch/testing/_internal/common_nn.py

kurtamohler · 2020-10-01T16:23:35Z

I think this is ready. The CI failures didn't seem to be my fault, but I'm rerunning them.

codecov · 2020-10-01T18:23:34Z

Codecov Report

Merging #45248 into master will decrease coverage by 0.00%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #45248      +/-   ##
==========================================
- Coverage   68.28%   68.27%   -0.01%     
==========================================
  Files         410      410              
  Lines       53306    53306              
==========================================
- Hits        36398    36397       -1     
- Misses      16908    16909       +1

Impacted Files	Coverage Δ
torch/testing/_internal/expecttest.py	`77.55% <0.00%> (-1.03%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update acca11b...c4edc84. Read the comment docs.

mruberry · 2020-10-06T03:57:03Z

So I'm curious about your thinking here, @kurtamohler,

when duplicative entries are present on CUDA aren't the results just wrong? They may be nondeterministically wrong, but I thought the determinism flag was for algorithms which are correct-ish but whose values may vary (due to slight numerical instability, for example, or if there's a multiplicity of valid results).
assuming this was a determinism issue for the moment, it's a determinism issue that only triggers when the input has a certain structure. Yet this PR suggests disallowing the behavior on all CUDA inputs, not just CUDA inputs with this property. If the determinism flag is set in cases like this should it do the extra work of reviewing the inputs' structure?

ngimel · 2020-10-06T04:00:20Z

@kurtamohler as the comment here says, thrust::unique does not work on unsorted unputs https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cuda/Embedding.cu#L350-L356, so you should sort inputs before calling thrust on them. It may be extra work (if there are no repeating indices), but otherwise there's no way to guarantee that incorrect results won't be produced.
After the inputs are sorted, the behavior will no longer be nondeterministic, so it won't be necessary to set the flag.

kurtamohler · 2020-10-06T23:53:03Z

Thanks @ngimel! I've made that change and wrote a test based on the repro from the issue description.

test/test_nn.py

facebook-github-bot

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2020-10-13T04:14:09Z

@ngimel merged this pull request in 66505b6.

pytorchbot added the open source label Sep 24, 2020

kurtamohler requested a review from mruberry September 24, 2020 05:05

kurtamohler marked this pull request as ready for review September 24, 2020 05:05

mruberry reviewed Sep 24, 2020

View reviewed changes

aten/src/ATen/native/cuda/Embedding.cu Outdated Show resolved Hide resolved

mruberry reviewed Sep 24, 2020

View reviewed changes

torch/testing/_internal/common_nn.py Outdated Show resolved Hide resolved

kurtamohler force-pushed the embedding-nondeterministic-alert-44792 branch 2 times, most recently from 913fff7 to 403fbfd Compare September 24, 2020 18:39

zou3519 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Sep 25, 2020

kurtamohler force-pushed the embedding-nondeterministic-alert-44792 branch 2 times, most recently from 7be0437 to e469c3d Compare September 30, 2020 18:09

mruberry added the module: determinism label Oct 6, 2020

mruberry requested a review from ngimel October 6, 2020 03:51

ngimel removed the module: determinism label Oct 6, 2020

kurtamohler force-pushed the embedding-nondeterministic-alert-44792 branch from e469c3d to e1f6b16 Compare October 6, 2020 23:48

kurtamohler changed the title ~~Add nondeterministic alert to torch.nn.Embedding~~ Fix nondeterminsm in CUDA torch.nn.Embedding when max_norm is not None Oct 6, 2020

kurtamohler changed the title ~~Fix nondeterminsm in CUDA torch.nn.Embedding when max_norm is not None~~ Fix incorrect CUDA torch.nn.Embedding result when max_norm is not None Oct 6, 2020

kurtamohler changed the title ~~Fix incorrect CUDA torch.nn.Embedding result when max_norm is not None~~ Fix incorrect CUDA torch.nn.Embedding result when max_norm is not None and indices are not sorted Oct 6, 2020

kurtamohler commented Oct 6, 2020

View reviewed changes

test/test_nn.py Outdated Show resolved Hide resolved

Fix nondeterminsm in CUDA torch.nn.Embedding when max_norm is not None

c4edc84

kurtamohler force-pushed the embedding-nondeterministic-alert-44792 branch from e1f6b16 to c4edc84 Compare October 8, 2020 14:51

ngimel approved these changes Oct 8, 2020

View reviewed changes

facebook-github-bot reviewed Oct 8, 2020

View reviewed changes

facebook-github-bot closed this in 66505b6 Oct 13, 2020

facebook-github-bot added the Merged label Oct 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix incorrect CUDA `torch.nn.Embedding` result when max_norm is not None and indices are not sorted #45248

Fix incorrect CUDA `torch.nn.Embedding` result when max_norm is not None and indices are not sorted #45248

kurtamohler commented Sep 24, 2020 •

edited

dr-ci bot commented Sep 24, 2020 •

edited

dr-ci bot commented Sep 24, 2020

kurtamohler commented Oct 1, 2020

codecov bot commented Oct 1, 2020 •

edited

mruberry commented Oct 6, 2020

ngimel commented Oct 6, 2020

kurtamohler commented Oct 6, 2020

facebook-github-bot left a comment

facebook-github-bot commented Oct 13, 2020

Fix incorrect CUDA torch.nn.Embedding result when max_norm is not None and indices are not sorted #45248

Fix incorrect CUDA torch.nn.Embedding result when max_norm is not None and indices are not sorted #45248

Conversation

kurtamohler commented Sep 24, 2020 • edited

dr-ci bot commented Sep 24, 2020 • edited

💊 CI failures summary and remediations

dr-ci bot commented Sep 24, 2020

💊 CI failures summary and remediations

🚧 8 ongoing upstream failures:

kurtamohler commented Oct 1, 2020

codecov bot commented Oct 1, 2020 • edited

Codecov Report

mruberry commented Oct 6, 2020

ngimel commented Oct 6, 2020

kurtamohler commented Oct 6, 2020

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Oct 13, 2020

Fix incorrect CUDA `torch.nn.Embedding` result when max_norm is not None and indices are not sorted #45248

Fix incorrect CUDA `torch.nn.Embedding` result when max_norm is not None and indices are not sorted #45248

kurtamohler commented Sep 24, 2020 •

edited

dr-ci bot commented Sep 24, 2020 •

edited

codecov bot commented Oct 1, 2020 •

edited