New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate renorm to ATen (CPU and CUDA) #59250
Conversation
💊 CI failures summary and remediationsAs of commit e4b4f8f (more details on the Dr. CI page):
🕵️ 1 new failure recognized by patternsThe following CI failures do not appear to be due to upstream breakages: pytorch_xla_linux_bionic_py3_6_clang9_test (1/1)Step: "Run tests" (full log | diagnosis details | 🔁 rerun)
|
if (acc_type != dtype) { | ||
norm = at::linalg_vector_norm(self, p.toDouble(), reduce_dims, | ||
/*keepdim=*/true, /*dtype=*/acc_type); | ||
} else { | ||
norm = at::linalg_vector_norm(self, p.toDouble(), reduce_dims, | ||
/*keepdim=*/true); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For complex types, the input is cast to dtype
and the output has type dtype
so it has to be complex. However, if I leave the argument out entirely, the result is real. Seems like an odd way for this to work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @kurtamohler for norm dtype
behavior. Should we disallow specifying non-complex dtype
for complex? I believe there was a warning about that at some point. But then, it doesn't allow users to specify different precision if the input is complex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it's odd behavior. The explanation is that the input is converted to dtype
before the norm is calculated, so if we allowed a non-complex dtype
arg when the input is complex, we would be discarding the imaginary part. So at the moment we throw an error in that case: https://github.com/pytorch/pytorch/blob/afdfd2288ab/aten/src/ATen/native/LinearAlgebra.cpp#L2271
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the result should always be real though, even if dtype
is complex. For complex outputs, the imaginary part is always going to be 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the result was always real then norm(tensor, ..., dtype=dtype)
would behaave just like norm(tensor.to(dtype), ...)
. I think that's reasonable.
@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
norm = at::linalg_vector_norm( | ||
self, p, reduce_dims, /*keepdim=*/true); | ||
} | ||
auto grad_output = (self.conj() * grad).sum( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out of curiosity, is discarding imaginary part expected here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. The norm is real-valued so its gradient should be as well. In fact the gradcheck tests fail if I don't intentionally cast to real.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But /*dtype=*/c10::toValueType(acc_type)
will always be throwing warnings when input is converted to real, even though that's what we want. Maybe (self.conj() * grad).real().contiguous().sum(...)
? (.contiguous
is optional, on cpu I suspect discontiguous reduction will be very bad, on the gpu on the other hand .contiguous() + reduction will be slower than just reduction, and in any case, people don't seem to care about renorm backward performance).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe
(self.conj() * grad).real().contiguous().sum(...)
?
The awkward thing is that at::real
isn't a no-op for real types. It throws an error so you can't write a one-liner. But if this way creates a warning then I guess there's no ignoring it.
@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Summary: Resubmit of pytorch#59108, closes pytorch#24754, closes pytorch#24616 This reuses `linalg_vector_norm` to calculate the norms. I just add a new kernel that turns the norm into a normalization factor, then multiply the original tensor using a normal broadcasted `mul` operator. The result is less code, and better performance to boot. #### Benchmarks (CPU): | Shape | Dim | Before | After (1 thread) | After (8 threads) | |:------------:|:---:|--------:|-----------------:|------------------:| | (10, 10, 10) | 0 | 11.6 us | 4.2 us | 4.2 us | | | 1 | 14.3 us | 5.2 us | 5.2 us | | | 2 | 12.7 us | 4.6 us | 4.6 us | | (50, 50, 50) | 0 | 330 us | 120 us | 24.4 us | | | 1 | 350 us | 135 us | 28.2 us | | | 2 | 417 us | 130 us | 24.4 us | #### Benchmarks (CUDA) | Shape | Dim | Before | After | |:------------:|:---:|--------:|--------:| | (10, 10, 10) | 0 | 12.5 us | 12.1 us | | | 1 | 13.1 us | 12.2 us | | | 2 | 13.1 us | 11.8 us | | (50, 50, 50) | 0 | 33.7 us | 11.6 us | | | 1 | 36.5 us | 15.8 us | | | 2 | 41.1 us | 15 us | Pull Request resolved: pytorch#59250 Reviewed By: mruberry Differential Revision: D28820359 Pulled By: ngimel fbshipit-source-id: 572486adabac8135d52a9b8700f9d145c2a4ed45
Resubmit of #59108, closes #24754, closes #24616
This reuses
linalg_vector_norm
to calculate the norms. I just add a new kernel that turns the norm into a normalization factor, then multiply the original tensor using a normal broadcastedmul
operator. The result is less code, and better performance to boot.Benchmarks (CPU):
Benchmarks (CUDA)