Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix incorrect CUDA torch.nn.Embedding result when max_norm is not None and indices are not sorted #45248

Conversation

kurtamohler
Copy link
Collaborator

@kurtamohler kurtamohler commented Sep 24, 2020

Sorting indices before calling thrust::unique fixes the issue.
Fixes #44792

@dr-ci
Copy link

dr-ci bot commented Sep 24, 2020

💊 CI failures summary and remediations

As of commit c4edc84 (more details on the Dr. CI page):


Commit c4edc84 was recently pushed. Waiting for builds...


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 41 times.

@dr-ci
Copy link

dr-ci bot commented Sep 24, 2020

💊 CI failures summary and remediations

As of commit 18dbfb8ee4 (more details on the Dr. CI page):


None of the CI failures appear to be your fault 💚



🚧 8 ongoing upstream failures:

These were probably caused by upstream breakages that are not fixed yet:


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

@kurtamohler kurtamohler marked this pull request as ready for review September 24, 2020 05:05
@kurtamohler kurtamohler force-pushed the embedding-nondeterministic-alert-44792 branch 2 times, most recently from 913fff7 to 403fbfd Compare September 24, 2020 18:39
@zou3519 zou3519 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Sep 25, 2020
@kurtamohler kurtamohler force-pushed the embedding-nondeterministic-alert-44792 branch 2 times, most recently from 7be0437 to e469c3d Compare September 30, 2020 18:09
@kurtamohler
Copy link
Collaborator Author

I think this is ready. The CI failures didn't seem to be my fault, but I'm rerunning them.

@codecov
Copy link

codecov bot commented Oct 1, 2020

Codecov Report

Merging #45248 into master will decrease coverage by 0.00%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #45248      +/-   ##
==========================================
- Coverage   68.28%   68.27%   -0.01%     
==========================================
  Files         410      410              
  Lines       53306    53306              
==========================================
- Hits        36398    36397       -1     
- Misses      16908    16909       +1     
Impacted Files Coverage Δ
torch/testing/_internal/expecttest.py 77.55% <0.00%> (-1.03%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update acca11b...c4edc84. Read the comment docs.

@mruberry
Copy link
Collaborator

mruberry commented Oct 6, 2020

So I'm curious about your thinking here, @kurtamohler,

  • when duplicative entries are present on CUDA aren't the results just wrong? They may be nondeterministically wrong, but I thought the determinism flag was for algorithms which are correct-ish but whose values may vary (due to slight numerical instability, for example, or if there's a multiplicity of valid results).
  • assuming this was a determinism issue for the moment, it's a determinism issue that only triggers when the input has a certain structure. Yet this PR suggests disallowing the behavior on all CUDA inputs, not just CUDA inputs with this property. If the determinism flag is set in cases like this should it do the extra work of reviewing the inputs' structure?

@ngimel
Copy link
Collaborator

ngimel commented Oct 6, 2020

@kurtamohler as the comment here says, thrust::unique does not work on unsorted unputs https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cuda/Embedding.cu#L350-L356, so you should sort inputs before calling thrust on them. It may be extra work (if there are no repeating indices), but otherwise there's no way to guarantee that incorrect results won't be produced.
After the inputs are sorted, the behavior will no longer be nondeterministic, so it won't be necessary to set the flag.

@kurtamohler kurtamohler force-pushed the embedding-nondeterministic-alert-44792 branch from e469c3d to e1f6b16 Compare October 6, 2020 23:48
@kurtamohler kurtamohler changed the title Add nondeterministic alert to torch.nn.Embedding Fix nondeterminsm in CUDA torch.nn.Embedding when max_norm is not None Oct 6, 2020
@kurtamohler kurtamohler changed the title Fix nondeterminsm in CUDA torch.nn.Embedding when max_norm is not None Fix incorrect CUDA torch.nn.Embedding result when max_norm is not None Oct 6, 2020
@kurtamohler kurtamohler changed the title Fix incorrect CUDA torch.nn.Embedding result when max_norm is not None Fix incorrect CUDA torch.nn.Embedding result when max_norm is not None and indices are not sorted Oct 6, 2020
@kurtamohler
Copy link
Collaborator Author

Thanks @ngimel! I've made that change and wrote a test based on the repro from the issue description.

test/test_nn.py Outdated Show resolved Hide resolved
@kurtamohler kurtamohler force-pushed the embedding-nondeterministic-alert-44792 branch from e1f6b16 to c4edc84 Compare October 8, 2020 14:51
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ngimel merged this pull request in 66505b6.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Merged open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Embedding max_norm wrong results on cuda when repeating indices are present
6 participants