Skip to content

Conversation

CaoE
Copy link
Collaborator

@CaoE CaoE commented Sep 5, 2023

Fixes #83149
There is a limitation of TensorIterator reductions:
The non-permuted input tensor will be coalesced down to a 2-d tensor by TensorIterator whereas the permuted case may become a >2d operation (for example, two reduced dimensions and non-reduced dim).
Since the cpu reduction loop of TensorIterator only operates on two dimensions at a time, this means the intermediate sums will be truncated to lower precision.

@pytorch-bot
Copy link

pytorch-bot bot commented Sep 5, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/108559

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 9f491ad with merge base 3381f28 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@CaoE CaoE force-pushed the ecao/fix_reduce2 branch 2 times, most recently from 213911d to ae9d9f8 Compare September 6, 2023 03:46
@CaoE CaoE added topic: not user facing topic category module: bfloat16 module: half Related to float16 half-precision floats ciflow/trunk Trigger trunk jobs on your pull request ciflow/mps Run MPS tests (subset of trunk) labels Sep 6, 2023
@CaoE CaoE changed the title Fix non-contiguous sum precision issue for BFloat16 on CPU Fix permuted sum precision issue for lower precision on CPU Sep 6, 2023
@CaoE CaoE requested a review from mingfeima September 6, 2023 03:58
@CaoE CaoE requested a review from jgong5 September 6, 2023 13:07
Copy link
Collaborator

@jgong5 jgong5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the performance impact? Do we have comparison between fp32 and bf16 with the same inputs?


def helper(self, shape, reduce_dims, device, dtype):
permute_list = dim_sequences[len(shape)]
random.shuffle(permute_list)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to specify the permutation instead of randomizing it so that we are sure the non-contiguous scenario can always happen?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use permutations instead of random.shuffle

if (!at::isReducedFloatingType(iter.common_dtype())) {
return false;
}
if (ndim < 2 || iter.noutputs() != 1) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it happens with ndim >= 3 so should check ndim < 3 here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment. Fixed.

Copy link
Collaborator

@jgong5 jgong5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes LGTM now, still stamp if no performance impact.

@CaoE CaoE force-pushed the ecao/fix_reduce2 branch from 4b7f769 to 4a843e8 Compare October 7, 2023 06:24
Copy link
Contributor

github-actions bot commented Dec 6, 2023

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions bot added the Stale label Dec 6, 2023
@github-actions github-actions bot closed this Jan 5, 2024
@CaoE CaoE reopened this Feb 1, 2024
@CaoE CaoE force-pushed the ecao/fix_reduce2 branch from 4a843e8 to 9be3b64 Compare February 1, 2024 09:00
Copy link
Collaborator

@mingfeima mingfeima left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please update the root cause of the issue in the pr comment as well.

@mingfeima mingfeima marked this pull request as ready for review February 20, 2024 05:36
@mingfeima mingfeima requested a review from mruberry as a code owner February 20, 2024 05:36
@CaoE CaoE requested a review from peterbell10 March 4, 2024 02:54
@CaoE
Copy link
Collaborator Author

CaoE commented Mar 4, 2024

@peterbell10 Could you please help review this PR ? Thanks.

@CaoE
Copy link
Collaborator Author

CaoE commented Mar 4, 2024

@mruberry Could you please help review this PR ? Thanks.

// See https://github.com/pytorch/pytorch/issues/83149
if (should_use_acc_buffer(iter)) {
auto tmp_output = at::empty(result.sizes(), result.options().dtype(kFloat));
at::sum_outf(self.to(ScalarType::Float), opt_dim, keepdim, /*dtype=*/c10::nullopt, tmp_output);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that in my original comment I suggested adding mixed dtype kernels with half precision as the input and float as the output, like we have on CUDA. This is okay for now though I guess. It just won't perform as well.

@CaoE CaoE force-pushed the ecao/fix_reduce2 branch from 9be3b64 to 9f491ad Compare March 5, 2024 02:20
@CaoE
Copy link
Collaborator Author

CaoE commented Mar 6, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/mps Run MPS tests (subset of trunk) ciflow/trunk Trigger trunk jobs on your pull request Merged module: bfloat16 module: half Related to float16 half-precision floats open source Stale topic: not user facing topic category

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

bf16 strided tensor wrong calculation

6 participants