[Distributed] add pack-check method for float8_e5m2 #136115

kwen2501 · 2024-09-15T17:22:09Z

Stack from ghstack (oldest at bottom):

Add support for Float8_e5m2, following similar algorithm used for Float8_e4m3fn (i.e. overflow check).

Made HasNanFP8x8 a template so that it is extendable based on dtype.

cc @XilunWu @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

[ghstack-poisoned]

pytorch-bot · 2024-09-15T17:22:13Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136115

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit c907c0c with merge base 0216936 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Skylion007 · 2024-09-15T17:23:41Z

torch/csrc/distributed/c10d/NanCheck.cu

+// We want to check 8 x FP8 simultaneously, hence this template definition.
+template<typename T>
+struct HasNanFP8x8 {
+  // I am a dumb implementation. You should never call in here, unless the check


Static Assert False to raise a compile error if we call in here?

Or at least can we issue warning somehow?

I struggle between providing basic functionality so that user code can run without break (current code) vs speed. And eventually chose the former :)

But your point seems better: compile error can force a developer to implement the template if they want to add a new data type to AT_DISPATCH_FLOATING_TYPES_AND4 below.

Hmm, it seems static_assert doesn't work well with template definition prior to c++23. I may try =delete as suggested here.

@kwen2501 You could also at least unroll the loop, but the speed gains would be minimal unless the compiler realizes to inline isnan

You could also use the self != self to check for NaN based on the dtype, but I am not sure if that's faster / more portable.

@kwen2501 You could also at least unroll the loop, but the speed gains would be minimal unless the compiler realizes to inline isnan

Yeah, we tried that. Since the final result is a reduction, i.e.
packHasNan = isnan(byte0) || isnan(byte1) ... || isnan(byte7),
the compiler does not seem quite willing the unroll the loop.

@kwen2501 It doesn't like #pragma unroll?

We could also manually unroll since it's only 8 elements (as painful as that would be).

torch/csrc/distributed/c10d/NanCheck.cu

Add support for Float8_e5m2, following similar algorithm used for Float8_e4m3fn (i.e. overflow check). Made `HasNanFP8x8` a template so that it is extendable based on dtype. cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

ghstack-source-id: 4b11623 Pull Request resolved: #136115

kwen2501 · 2024-09-15T18:41:41Z

@pytorchbot merge

pytorchmergebot · 2024-09-15T18:43:33Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Add support for Float8_e5m2, following similar algorithm used for Float8_e4m3fn (i.e. overflow check). Made `HasNanFP8x8` a template so that it is extendable based on dtype. Pull Request resolved: pytorch#136115 Approved by: https://github.com/Skylion007 ghstack dependencies: pytorch#135891, pytorch#135961

[Distributed] add pack-check method for float8_e5m2

bee4aca

[ghstack-poisoned]

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Sep 15, 2024

Skylion007 reviewed Sep 15, 2024

View reviewed changes

torch/csrc/distributed/c10d/NanCheck.cu Show resolved Hide resolved

Skylion007 approved these changes Sep 15, 2024

View reviewed changes

kwen2501 added a commit that referenced this pull request Sep 15, 2024

[Distributed] add pack-check method for float8_e5m2

df62c65

ghstack-source-id: 4b11623 Pull Request resolved: #136115

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 15, 2024

pytorchmergebot added the merging label Sep 15, 2024

pytorchmergebot added the Merged label Sep 15, 2024

pytorchmergebot closed this in d2207c5 Sep 15, 2024

pytorchmergebot removed the merging label Sep 15, 2024

github-actions bot deleted the gh/kwen2501/61/head branch October 16, 2024 02:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Distributed] add pack-check method for float8_e5m2 #136115

[Distributed] add pack-check method for float8_e5m2 #136115

Uh oh!

kwen2501 commented Sep 15, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Sep 15, 2024 •

edited

Loading

Uh oh!

Skylion007 Sep 15, 2024

Uh oh!

Skylion007 Sep 15, 2024

Uh oh!

kwen2501 Sep 15, 2024

Uh oh!

kwen2501 Sep 15, 2024

Uh oh!

kwen2501 Sep 15, 2024 •

edited

Loading

Uh oh!

Skylion007 Sep 15, 2024 •

edited

Loading

Uh oh!

Skylion007 Sep 15, 2024

Uh oh!

kwen2501 Sep 15, 2024 •

edited

Loading

Uh oh!

Skylion007 Sep 15, 2024

Uh oh!

Skylion007 Sep 15, 2024

Uh oh!

Uh oh!

kwen2501 commented Sep 15, 2024

Uh oh!

pytorchmergebot commented Sep 15, 2024

Uh oh!

Uh oh!

[Distributed] add pack-check method for float8_e5m2 #136115

[Distributed] add pack-check method for float8_e5m2 #136115

Uh oh!

Conversation

kwen2501 commented Sep 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136115

✅ No Failures

Uh oh!

Skylion007 Sep 15, 2024

Choose a reason for hiding this comment

Uh oh!

Skylion007 Sep 15, 2024

Choose a reason for hiding this comment

Uh oh!

kwen2501 Sep 15, 2024

Choose a reason for hiding this comment

Uh oh!

kwen2501 Sep 15, 2024

Choose a reason for hiding this comment

Uh oh!

kwen2501 Sep 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Skylion007 Sep 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Skylion007 Sep 15, 2024

Choose a reason for hiding this comment

Uh oh!

kwen2501 Sep 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Skylion007 Sep 15, 2024

Choose a reason for hiding this comment

Uh oh!

Skylion007 Sep 15, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kwen2501 commented Sep 15, 2024

Uh oh!

pytorchmergebot commented Sep 15, 2024

Merge started

Uh oh!

Uh oh!

kwen2501 commented Sep 15, 2024 •

edited

Loading

pytorch-bot bot commented Sep 15, 2024 •

edited

Loading

kwen2501 Sep 15, 2024 •

edited

Loading

Skylion007 Sep 15, 2024 •

edited

Loading

kwen2501 Sep 15, 2024 •

edited

Loading