-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Fix permuted sum precision issue for lower precision on CPU #108559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/108559
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit 9f491ad with merge base 3381f28 ( FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
213911d
to
ae9d9f8
Compare
ae9d9f8
to
1604dec
Compare
1604dec
to
9859973
Compare
9859973
to
6f8180f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the performance impact? Do we have comparison between fp32 and bf16 with the same inputs?
test/test_reductions.py
Outdated
|
||
def helper(self, shape, reduce_dims, device, dtype): | ||
permute_list = dim_sequences[len(shape)] | ||
random.shuffle(permute_list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to specify the permutation instead of randomizing it so that we are sure the non-contiguous scenario can always happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use permutations
instead of random.shuffle
aten/src/ATen/native/ReduceOps.cpp
Outdated
if (!at::isReducedFloatingType(iter.common_dtype())) { | ||
return false; | ||
} | ||
if (ndim < 2 || iter.noutputs() != 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it happens with ndim >= 3
so should check ndim < 3
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment. Fixed.
7682040
to
8cc4a4e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes LGTM now, still stamp if no performance impact.
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
4a843e8
to
9be3b64
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please update the root cause of the issue in the pr comment as well.
@peterbell10 Could you please help review this PR ? Thanks. |
@mruberry Could you please help review this PR ? Thanks. |
// See https://github.com/pytorch/pytorch/issues/83149 | ||
if (should_use_acc_buffer(iter)) { | ||
auto tmp_output = at::empty(result.sizes(), result.options().dtype(kFloat)); | ||
at::sum_outf(self.to(ScalarType::Float), opt_dim, keepdim, /*dtype=*/c10::nullopt, tmp_output); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that in my original comment I suggested adding mixed dtype kernels with half precision as the input and float as the output, like we have on CUDA. This is okay for now though I guess. It just won't perform as well.
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Fixes #83149
There is a limitation of
TensorIterator
reductions:The non-permuted input tensor will be coalesced down to a 2-d tensor by
TensorIterator
whereas the permuted case may become a >2d operation (for example, two reduced dimensions and non-reduced dim).Since the cpu reduction loop of
TensorIterator
only operates on two dimensions at a time, this means the intermediate sums will be truncated to lower precision.