Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add nondeterministic alert to index_copy, median CUDA and kthvalue CUDA #46942

Conversation

kurtamohler
Copy link
Collaborator

@kurtamohler kurtamohler commented Oct 27, 2020

Also fixes issue where skipped tests did not properly restore deterministic flag.

Fixes #46743

@dr-ci
Copy link

dr-ci bot commented Oct 27, 2020

💊 CI failures summary and remediations

As of commit c42a254 (more details on the Dr. CI page):


None of the CI failures appear to be your fault 💚



This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 74 times.

@mrshenli mrshenli added module: cuda Related to torch.cuda, and CUDA support in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Oct 28, 2020
@kurtamohler kurtamohler force-pushed the index-copy-nondeterministic-alert branch from 2b49191 to 5f0265a Compare October 28, 2020 20:30
@ngimel
Copy link
Collaborator

ngimel commented Oct 29, 2020

Quick notes (did not do a full review)

  1. you are changing generated file, this is not good.
  2. for scatter_add, we warn regardless of whether there are repeating indices, so maybe we should do this for index_copy too https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cuda/ScatterGatherKernel.cu#L453-L460
  3. while I was looking at scatter_add, I noticed that scatter is not warning, and yet for repeating indices scatter produces nondeterministic result.

@facebook-github-bot
Copy link
Contributor

Hi @kurtamohler!

Thank you for your pull request. We require contributors to sign our Contributor License Agreement, and yours needs attention.

You currently have a record in our system, but we do not have a signature on file.

In order for us to review and merge your code, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

test/test_torch.py Outdated Show resolved Hide resolved
@mruberry
Copy link
Collaborator

Hey @kurtamohler! Great to see more nondeterministic behavior caught by this flag. For now @ngimel and I think we should just always error like we do with other operations. It's an interesting design question to follow-up with for whether we should add these checks to all operations that are nondeterministic when given duplicate indices.

@mruberry
Copy link
Collaborator

Follow-up: we should also throw an error if the determinism flag is set and indices are returned for median or kth value on CUDA.

@kurtamohler kurtamohler force-pushed the index-copy-nondeterministic-alert branch 2 times, most recently from 30ae80e to 8596d36 Compare November 3, 2020 19:48
@kurtamohler
Copy link
Collaborator Author

Tests for adaptive_log_softmax are failing because they depend on index_copy, and they are evidently setting torch.set_deterministic(True) somewhere. I'm not yet sure what is setting it.

@kurtamohler
Copy link
Collaborator Author

Looks like the expectedAlertNondeterministic test decorator does not work properly for tests that are skipped. In this case, torch.set_deterministic(True) is set, and the initial state is never recovered. This is the cause of the adaptive_log_softmax failures. I'll fix the decorator.

@kurtamohler kurtamohler force-pushed the index-copy-nondeterministic-alert branch from 8596d36 to 7c99047 Compare November 4, 2020 22:15
@kurtamohler kurtamohler changed the title Add nondeterministic alert if index_copy is given duplicate indices Add nondeterministic alert to index_copy, median CUDA and kthvalue CUDA Nov 4, 2020
@kurtamohler
Copy link
Collaborator Author

kurtamohler commented Nov 4, 2020

Barring any additional test failures, I think this is ready for a re-review.

@kurtamohler kurtamohler force-pushed the index-copy-nondeterministic-alert branch 3 times, most recently from 2da9bf6 to a564999 Compare November 5, 2020 04:41
@mruberry
Copy link
Collaborator

mruberry commented Nov 6, 2020

Hey @kurtamohler, made a few inline suggestions to align the comments and documentation.

Is there a good determinism test that verifies this runtime error will trigger when these functions are called? It'd be good to see it triggered on the function, method, inplace, and out variants. For median it should not be triggered if indices aren't returned but triggered if they are.

@kurtamohler kurtamohler force-pushed the index-copy-nondeterministic-alert branch from a564999 to c83faf3 Compare November 10, 2020 21:44
@codecov
Copy link

codecov bot commented Nov 13, 2020

Codecov Report

Merging #46942 (c6048a1) into master (5cb688b) will increase coverage by 0.23%.
The diff coverage is 75.00%.

@@            Coverage Diff             @@
##           master   #46942      +/-   ##
==========================================
+ Coverage   80.91%   81.14%   +0.23%     
==========================================
  Files        1855     1838      -17     
  Lines      200241   198605    -1636     
==========================================
- Hits       162021   161164     -857     
+ Misses      38220    37441     -779     

torch/_torch_docs.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@mruberry mruberry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @kurtamohler!

Overall the actual change looks straightforward and good. Just a few questions/comments about "Python-ness" of the testing stuff.

@kurtamohler kurtamohler force-pushed the index-copy-nondeterministic-alert branch 2 times, most recently from 27823d0 to 330ba8a Compare November 17, 2020 19:20
@kurtamohler
Copy link
Collaborator Author

I'm not sure what is causing the pytorch-linux-bionic-rocm3.9-py3.6 failures. They are all due to incorrect values returned by remainder CUDA with float16 and bfloat16. It doesn't seem that my changes would cause this, but I don't see any upstream failures in the CI HUD. Are these flaky tests?

@ngimel
Copy link
Collaborator

ngimel commented Nov 18, 2020

There was upstream failure for this, rebase and you should be fine.

@kurtamohler kurtamohler force-pushed the index-copy-nondeterministic-alert branch from 330ba8a to c6048a1 Compare November 18, 2020 23:10
@kurtamohler kurtamohler force-pushed the index-copy-nondeterministic-alert branch 3 times, most recently from e7386ba to 9f68aaf Compare December 1, 2020 18:59
@kurtamohler
Copy link
Collaborator Author

Had to rebase to fix a conflict. I think this is ready to go.

torch/_torch_docs.py Outdated Show resolved Hide resolved
@@ -3894,6 +3894,10 @@ def merge_dicts(*dicts):
(see :func:`torch.squeeze`), resulting in both the :attr:`values` and
:attr:`indices` tensors having 1 fewer dimension than the :attr:`input` tensor.

.. note::
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@mruberry
Copy link
Collaborator

mruberry commented Dec 2, 2020

Hey @kurtamohler! Had a look, just a few more comments/suggestions.

@@ -408,33 +421,28 @@ def wrapper(*args, **kwargs):
def wrapDeterministicFlagAPITest(fn):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please elaborate in this comment that tests using this wrapper need to start a subprocess (how and why) so that their cuBLAS is initialized with the new workspace config

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, let me know what you think

@kurtamohler kurtamohler force-pushed the index-copy-nondeterministic-alert branch from 9f68aaf to c42a254 Compare December 2, 2020 21:33
Copy link
Collaborator

@mruberry mruberry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! Thanks Kurt!

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mruberry has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@mruberry merged this pull request in 2cb9204.

1 similar comment
@facebook-github-bot
Copy link
Contributor

@mruberry merged this pull request in 2cb9204.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed Merged module: cuda Related to torch.cuda, and CUDA support in general open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add nondeterministic error for index_copy_ if duplicate indices are given
6 participants