-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Fix type promotion for trace on CPU. #47305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Fixes #47127. Ideally this would just use diag and sum (as the CUDA implementation does), but that seems to have performance problems, which I'll link in the github PR. [ghstack-poisoned]
|
Here's some benchmarking results for this vs the This:
|
|
so the overhead is much higher, although for big sizes it doesn't matter. |
| return result; | ||
| } | ||
|
|
||
| Tensor trace_cpu(const Tensor& self) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: The best place to put this might be TriangularOps.cpp (trace_cuda is in TriangularOps.cu). Alternatively we can move trace_cuda out of TriangularOps.cu for consistency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the problem is you need the specialized promotion logic from here. In theory you could split things up but this doesn't seem worth it given the promotion logic is really specialized for reductions.
| }); | ||
|
|
||
| return result; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're going to have to update test_trace:
Lines 5996 to 6029 in 774b638
| def _test_trace(self, device, dtype, legacy): | |
| def test(shape): | |
| tensor = make_tensor(shape, device, dtype, low=-9, high=9) | |
| diag = tensor.diag() | |
| if legacy: | |
| # NB: trace on cpu doesn't do type promotion... #47127 | |
| expected_dtype = dtype | |
| else: | |
| expected_dtype = tensor.sum().dtype | |
| expected_dtype = torch_to_numpy_dtype_dict[expected_dtype] | |
| result = np.trace(tensor.cpu().numpy(), dtype=expected_dtype) | |
| expected = torch.tensor(result, device=device) | |
| self.assertEqual(tensor.trace(), expected) | |
| shapes = ( | |
| [10, 1], | |
| [1, 10], | |
| [100, 100], | |
| [20, 100], | |
| [100, 20], | |
| ) | |
| for shape in shapes: | |
| test(shape) | |
| @onlyCPU | |
| @dtypes(*torch.testing.get_all_dtypes(include_complex=False, include_bool=False, include_half=False, include_bfloat16=False)) | |
| def test_trace_legacy(self, device, dtype): | |
| self._test_trace(device, dtype, legacy=True) | |
| @onlyCUDA | |
| @dtypes(*torch.testing.get_all_dtypes(include_complex=False, include_bool=False, include_bfloat16=False)) | |
| def test_trace(self, device, dtype): | |
| self._test_trace(device, dtype, legacy=False) |
|
Should we mark this as BC-breaking? In older versions of PyTorch (<1.5), we did not do any type promotions for trace |
💊 CI failures summary and remediationsAs of commit f72d684 (more details on the Dr. CI page):
🕵️ 1 new failure recognized by patternsThe following CI failures do not appear to be due to upstream breakages:
|
|
ya I'll mark it BC-breaking. |
Codecov Report
@@ Coverage Diff @@
## gh/gchanan/338/base #47305 +/- ##
=======================================================
- Coverage 60.81% 60.81% -0.01%
=======================================================
Files 2749 2749
Lines 254098 254099 +1
=======================================================
- Hits 154535 154529 -6
- Misses 99563 99570 +7 |
Fixes #47127. Ideally this would just use diag and sum (as the CUDA implementation does), but that seems to have performance problems, which I'll link in the github PR. [ghstack-poisoned]
test/test_torch.py
Outdated
| def test_trace(self, device, dtype): | ||
| def test(shape): | ||
| tensor = make_tensor(shape, device, dtype, low=-9, high=9) | ||
| diag = tensor.diag() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The diag = tensor.diag() line is bogus (it is unused), can you remove it please? (Sorry, I was the one who added it but I am only noticing it now)
test/test_torch.py
Outdated
| self.assertEqual(expected, result) | ||
|
|
||
| def _test_trace(self, device, dtype, legacy): | ||
| @dtypes(*torch.testing.get_all_dtypes(include_complex=False, include_bool=False, include_half=False, include_bfloat16=False)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've dropped testing for half on CUDA, is there a good way around this?
The dtypesIfCUDA and dtypesifCPU decorators look like they handle this (albeit very verbosely):
@dtypesIfCPU(*torch.testing.get_all_dtypes(include_complex=False, include_bool=False, include_half=False, include_bfloat16=False))
@dtypesIfCUDA(*torch.testing.get_all_dtypes(include_complex=False, include_bool=False, include_bfloat16=False))
Fixes #47127. Ideally this would just use diag and sum (as the CUDA implementation does), but that seems to have performance problems, which I'll link in the github PR. [ghstack-poisoned]
Fixes #47127. Ideally this would just use diag and sum (as the CUDA implementation does), but that seems to have performance problems, which I'll link in the github PR. [ghstack-poisoned]
Fixes #47127. Ideally this would just use diag and sum (as the CUDA implementation does), but that seems to have performance problems, which I'll link in the github PR. Differential Revision: [D24729627](https://our.internmc.facebook.com/intern/diff/D24729627)
Fixes #47127. Ideally this would just use diag and sum (as the CUDA implementation does), but that seems to have performance problems, which I'll link in the github PR. Differential Revision: [D24729627](https://our.internmc.facebook.com/intern/diff/D24729627)
Fixes #47127. Ideally this would just use diag and sum (as the CUDA implementation does), but that seems to have performance problems, which I'll link in the github PR. Differential Revision: [D24729627](https://our.internmc.facebook.com/intern/diff/D24729627)
Stack from ghstack:
Fixes #47127.
Ideally this would just use diag and sum (as the CUDA implementation does), but that seems to have performance problems, which I'll link in the github PR.
Differential Revision: D24729627
BC-breaking message:
Previously, the cpu variant of trace did not properly type promote, but the cuda variant did. Both variants now type promote correctly:
1.7:
1.8: