Map float8 types to uint8 for allgather #126556

drisspg · 2024-05-17T18:02:27Z

Summary

Different take on this one:
#126338

We should probably not allow this mapping for 'compute' ops e.g. reductions

Corresponding fp8 PR

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k

pytorch-bot · 2024-05-17T18:02:29Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126556

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 4000694 with merge base 6bcf156 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

wanchaol · 2024-05-17T19:22:33Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

@@ -64,6 +63,10 @@ std::map<at::ScalarType, ncclDataType_t> ncclDataType = {
    {at::kLong, ncclInt64},
    {at::kHalf, ncclHalf},
    {at::kBool, ncclUint8},
+    {at::kFloat8_e5m2, ncclUint8},


I think this is reasonable! This also means we can get rid of the uint8 view when communicating fp8 format?

Yup, did that here and all dtensor test passed: pytorch-labs/float8_experimental#263

wanchaol · 2024-05-17T19:24:19Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

+    {at::kFloat8_e5m2, ncclUint8},
+    {at::kFloat8_e4m3fn, ncclUint8},
+    {at::kFloat8_e4m3fnuz, ncclUint8},
+    {at::kFloat8_e5m2fnuz, ncclUint8},


wondering what's the concerns of running reduction op on uint8 directly?

I guess it depends on how nccl_reduce works, if only the comms are in uint8 but the actual computation is done in fp8 then should be fine. That beings said fp8 addition without a scale is not accurate so mabye this just isnt a problem

I think the point is that nccl_reduction would perform both communication and reduction together when running allreduce or reduce_scatter. So if this is not accurate for fp8, then we should throw some error when hitting a reduction and a fp8 dtype, otherwise user would hit silent correctness issue

Yeah okay this is what I thought would happen and is incorrect for fp8, ill add a torchcheck to the reduce ops

drisspg · 2024-05-17T20:00:57Z

@wanchaol what is a good place to add tests for thsi?

wanchaol · 2024-05-17T20:15:26Z

@wanchaol what is a good place to add tests for thsi?

I think you can add tests in this file https://github.com/pytorch/pytorch/blob/main/test/distributed/test_c10d_nccl.py

wanchaol

Thanks this sounds good to me! some more comments about reduce_scatter, otherwise lgtm :)

wanchaol · 2024-05-17T22:56:05Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

@@ -3039,6 +3042,9 @@ c10::intrusive_ptr<Work> ProcessGroupNCCL::allreduce_sparse(
    const AllreduceOptions& opts) {
  TORCH_CHECK(tensors.size() == 1, MULTI_DEVICE_ERROR_MSG);
  auto tensor = tensors.back();
+  TORCH_CHECK(
+      !isFloat8Type(tensor.scalar_type()),


shall we also add some checks to reduce_scatter?

ahh yeah, will add

drisspg · 2024-05-17T23:43:03Z

@pytorchbot merge

pytorchmergebot · 2024-05-17T23:44:55Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Summary: Coupled with this: pytorch/pytorch#126556 test everytihng is pasing Pull Request resolved: #263 Reviewed By: wanchaol Differential Revision: D57505783 Pulled By: drisspg fbshipit-source-id: cd928420f559839c63d79bfe7558416fbcfe1d69

# Summary Different take on this one: pytorch#126338 We should probably not allow this mapping for 'compute' ops e.g. reductions ### Corresponding fp8 PR pytorch-labs/float8_experimental#263 Pull Request resolved: pytorch#126556 Approved by: https://github.com/wanchaol

map float8 types to uint8 for allgather

7c0fa64

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels May 17, 2024

wanchaol reviewed May 17, 2024

View reviewed changes

drisspg mentioned this pull request May 17, 2024

enable float types in pytorch for non comptue comms pytorch-labs/float8_experimental#263

Closed

drisspg requested a review from wanchaol May 17, 2024 22:54

wanchaol approved these changes May 17, 2024

View reviewed changes

add tests

4000694

drisspg force-pushed the add-float8-types-to-nccl branch from 2ae2a64 to 4000694 Compare May 17, 2024 23:36

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 17, 2024

pytorchmergebot added the merging label May 17, 2024

pytorchmergebot added the Merged label May 18, 2024

pytorchmergebot closed this in d4704dc May 18, 2024

pytorchmergebot removed the merging label May 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Map float8 types to uint8 for allgather #126556

Map float8 types to uint8 for allgather #126556

drisspg commented May 17, 2024 •

edited

Loading

pytorch-bot bot commented May 17, 2024 •

edited

Loading

wanchaol May 17, 2024

drisspg May 17, 2024

wanchaol May 17, 2024

drisspg May 17, 2024

wanchaol May 17, 2024

drisspg May 17, 2024

drisspg commented May 17, 2024

wanchaol commented May 17, 2024

wanchaol left a comment

wanchaol May 17, 2024

drisspg May 17, 2024

drisspg commented May 17, 2024

pytorchmergebot commented May 17, 2024

Map float8 types to uint8 for allgather #126556

Map float8 types to uint8 for allgather #126556

Conversation

drisspg commented May 17, 2024 • edited Loading

Summary

Corresponding fp8 PR

pytorch-bot bot commented May 17, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126556

✅ No Failures

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

drisspg commented May 17, 2024

wanchaol commented May 17, 2024

wanchaol left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

drisspg commented May 17, 2024

pytorchmergebot commented May 17, 2024

Merge started

drisspg commented May 17, 2024 •

edited

Loading

pytorch-bot bot commented May 17, 2024 •

edited

Loading