Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v1.5.0] Fix handling of non-finite values in topk (#35253) #35435

Merged
merged 1 commit into from
Mar 27, 2020

Conversation

gchanan
Copy link
Contributor

@gchanan gchanan commented Mar 25, 2020

Summary:
Fixes #34191

at::native::radixSelect basically uses integer comparison which creates a defined ordering of non-finite float values. This isn't compatible with IEEE float comparison, so mixing the two leads to unwritten values in the output.
Pull Request resolved: #35253

Differential Revision: D20645554

Pulled By: ezyang

fbshipit-source-id: 651bcb1742ed67086ec89cc318d862caae65b981

Summary:
Fixes pytorch#34191

`at::native::radixSelect` basically uses integer comparison which creates a defined ordering of non-finite float values. This isn't compatible with IEEE float comparison, so mixing the two leads to unwritten values in the output.
Pull Request resolved: pytorch#35253

Differential Revision: D20645554

Pulled By: ezyang

fbshipit-source-id: 651bcb1742ed67086ec89cc318d862caae65b981
@dr-ci
Copy link

dr-ci bot commented Mar 25, 2020

💊 CircleCI build failures summary and remediations

As of commit 71c44bf (more details on the Dr. CI page):


  • 4/4 failures introduced in this PR

🕵️ 4 new failures recognized by patterns

The following build failures do not appear to be due to upstream breakages (reran 3 jobs to discount flakiness):

See CircleCI build pytorch_xla_linux_xenial_py3_6_clang7_test (1/4)

Step: "Test" (full log | pattern match details)

Mar 26 01:08:14 ERROR [0.045s]: test_topk_nonfinite_xla (__main__.TestTorchDeviceTypeXLA)
Mar 26 01:04:10 	PyRun_FileExFlags 
Mar 26 01:04:10 	PyRun_SimpleFileExFlags 
Mar 26 01:04:10 	Py_Main 
Mar 26 01:04:10 	main 
Mar 26 01:04:10 	__libc_start_main 
Mar 26 01:04:10 	 
Mar 26 01:04:10 *** End stack trace *** 
Mar 26 01:04:10 Negation, the `-` operator, on a bool tensor is not supported. If you are trying to invert a mask, use the `~` or `logical_not()` operator instead. 
Mar 26 01:08:14 ....s.s..s.ss.sssssss.s..s.........s...s....s.........s.............ssssssssssss...sssss.ss..s....ssss....ss.s.sssssssssssssssssssss.ssssss..ssssssssEs.ssss.s..s.........s............ss. 
Mar 26 01:08:14 ====================================================================== 
Mar 26 01:08:14 ERROR [0.045s]: test_topk_nonfinite_xla (__main__.TestTorchDeviceTypeXLA) 
Mar 26 01:08:14 ---------------------------------------------------------------------- 
Mar 26 01:08:14 Traceback (most recent call last): 
Mar 26 01:08:14   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 208, in instantiated_test 
Mar 26 01:08:14     return test(self, device_arg) 
Mar 26 01:08:14   File "/var/lib/jenkins/workspace/xla/test/../../test/test_torch.py", line 12530, in test_topk_nonfinite 
Mar 26 01:08:14     self.assertEqual(val, expect, allow_inf=True) 
Mar 26 01:08:14   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 363, in assertEqual 
Mar 26 01:08:14     return DeviceTypeTestBase.assertEqual(self, x, y, prec, message, allow_inf, **kwargs) 
Mar 26 01:08:14   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 882, in assertEqual 
Mar 26 01:08:14     assertTensorsEqual(x, y) 

See CircleCI build pytorch_linux_xenial_py3_6_gcc5_4_test (2/4)

Step: "Test" (full log | pattern match details) <confirmed not flaky by 2 failures>

Mar 26 01:13:09 AssertionError: 11 not less than or equal to 1e-05 :
Mar 26 01:13:09 ---------------------------------------------------------------------- 
Mar 26 01:13:09 Traceback (most recent call last): 
Mar 26 01:13:09   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 175, in wrapper 
Mar 26 01:13:09     self._join_processes(fn) 
Mar 26 01:13:09   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 285, in _join_processes 
Mar 26 01:13:09     self._check_return_codes(elapsed_time) 
Mar 26 01:13:09   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 328, in _check_return_codes 
Mar 26 01:13:09     self.assertEqual(first_process.exitcode, 0) 
Mar 26 01:13:09   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 915, in assertEqual 
Mar 26 01:13:09     super(TestCase, self).assertLessEqual(abs(x - y), prec, message) 
Mar 26 01:13:09 AssertionError: 11 not less than or equal to 1e-05 :  
Mar 26 01:13:09  
Mar 26 01:13:09 ---------------------------------------------------------------------- 
Mar 26 01:13:09 Ran 27 tests in 30.479s 
Mar 26 01:13:09  
Mar 26 01:13:09 FAILED (failures=1) 
Mar 26 01:13:09  
Mar 26 01:13:09 Generating XML reports... 
Mar 26 01:13:09 Traceback (most recent call last): 
Mar 26 01:13:09   File "test/run_test.py", line 674, in <module> 
Mar 26 01:13:09     main() 

See CircleCI build pytorch_macos_10_13_py3_test (3/4)

Step: "Test" (full log | pattern match details) <confirmed not flaky by 2 failures>

Mar 25 18:25:23 AssertionError: 11 not less than or equal to 1e-05 :
Mar 25 18:25:23 ---------------------------------------------------------------------- 
Mar 25 18:25:23 Traceback (most recent call last): 
Mar 25 18:25:23   File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/testing/_internal/common_distributed.py", line 175, in wrapper 
Mar 25 18:25:23     self._join_processes(fn) 
Mar 25 18:25:23   File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/testing/_internal/common_distributed.py", line 285, in _join_processes 
Mar 25 18:25:23     self._check_return_codes(elapsed_time) 
Mar 25 18:25:23   File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/testing/_internal/common_distributed.py", line 328, in _check_return_codes 
Mar 25 18:25:23     self.assertEqual(first_process.exitcode, 0) 
Mar 25 18:25:23   File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py", line 915, in assertEqual 
Mar 25 18:25:23     super(TestCase, self).assertLessEqual(abs(x - y), prec, message) 
Mar 25 18:25:23 AssertionError: 11 not less than or equal to 1e-05 :  
Mar 25 18:25:23  
Mar 25 18:25:23 ---------------------------------------------------------------------- 
Mar 25 18:25:23 Ran 27 tests in 34.189s 
Mar 25 18:25:23  
Mar 25 18:25:23 FAILED (failures=1) 
Mar 25 18:25:23  
Mar 25 18:25:23 Generating XML reports... 
Mar 25 18:25:24 Traceback (most recent call last): 
Mar 25 18:25:24   File "test/run_test.py", line 674, in <module> 
Mar 25 18:25:24     main() 

See CircleCI build pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test (4/4)

Step: "Test" (full log | pattern match details) <confirmed not flaky by 2 failures>

Mar 26 01:52:57 AssertionError: 11 not less than or equal to 1e-05 :
Mar 26 01:52:57 ---------------------------------------------------------------------- 
Mar 26 01:52:57 Traceback (most recent call last): 
Mar 26 01:52:57   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 175, in wrapper 
Mar 26 01:52:57     self._join_processes(fn) 
Mar 26 01:52:57   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 285, in _join_processes 
Mar 26 01:52:57     self._check_return_codes(elapsed_time) 
Mar 26 01:52:57   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 328, in _check_return_codes 
Mar 26 01:52:57     self.assertEqual(first_process.exitcode, 0) 
Mar 26 01:52:57   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 915, in assertEqual 
Mar 26 01:52:57     super(TestCase, self).assertLessEqual(abs(x - y), prec, message) 
Mar 26 01:52:57 AssertionError: 11 not less than or equal to 1e-05 :  
Mar 26 01:52:57  
Mar 26 01:52:57 ---------------------------------------------------------------------- 
Mar 26 01:52:57 Ran 27 tests in 21.065s 
Mar 26 01:52:57  
Mar 26 01:52:57 FAILED (failures=1) 
Mar 26 01:52:57  
Mar 26 01:52:57 Generating XML reports... 
Mar 26 01:52:57 Traceback (most recent call last): 
Mar 26 01:52:57   File "test/run_test.py", line 674, in <module> 
Mar 26 01:52:57     main() 

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

See how this bot performed.

This comment has been revised 7 times.

@gchanan
Copy link
Contributor Author

gchanan commented Mar 26, 2020

@ailzhang @dlibenzi I'm try to get this fix into the 1.5 release and it looks like it's failing xla although it doesn't look like the master branch version broke it. And from scanning the XLA commits I didn't see any obvious fix that went in. Any ideas?

@dlibenzi
Copy link
Contributor

@ailzhang @dlibenzi I'm try to get this fix into the 1.5 release and it looks like it's failing xla although it doesn't look like the master branch version broke it. And from scanning the XLA commits I didn't see any obvious fix that went in. Any ideas?

pytorch/xla#1824 might fix that, though we will have to ask @jysohn23 to cherrypick it into our release branch as well.

@gchanan
Copy link
Contributor Author

gchanan commented Mar 26, 2020

@dlibenzi thanks for the pointer. Should I go ahead and land this and then @jysohn23 can cherry-pick it on your side? I'm a little unclear on how quickly that can happen.

I guess an alternative is to make up a fake xla branch to point to for testing this PR so we have some confidence before we cherry-pick.

What do you think?

@ailzhang
Copy link
Contributor

@gchanan XLA release branch has been updated, this PR is good to merge. (Thanks @dlibenzi and @jysohn23!)

Copy link
Member

@seemethere seemethere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be good to go, I don't think the testing failures are related to the contents of this PR.

@gchanan
Copy link
Contributor Author

gchanan commented Mar 26, 2020

@dlibenzi @ailzhang looks like this is still failing, https://dr.pytorch.org/api/view-log-full?build_id=110804382. Any other ideas?

Copy link
Contributor

@ailzhang ailzhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gchanan All good now :D I forgot that we have separate jobs for build and test so I only ran the test job...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants