Add Bfloat16 scalar support to gloo backend #113557

anko-intel · 2023-11-13T13:07:26Z

There was missing support for bfloat scalars. When I use gloo backend
torch.distributed.init_process_group(backend='gloo')
and run
torch.nn.parallel.DistributedDataParallel(model)
and model has Bfloat16 features I receive following error:
RuntimeError: Invalid scalar type

This change fix this issue.
c10::BFloat16 defines conversions from/to float, so calculations are made on float for bfloat.

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu

Bfloat16 uses c10::BFloat16 which define conversions from/to float, so calculations are made on floats.

pytorch-bot · 2023-11-13T13:07:30Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/113557

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit abb0a4c with merge base 115da02 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2023-11-13T13:07:32Z

The committers listed above are authorized under a signed CLA.

✅ login: anko-intel / name: Andrzej Kotłowski (c2a0233, abb0a4c)

facebook-github-bot · 2023-11-13T20:08:04Z

@XilunWu has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

XilunWu

Thanks for adding BFloat16 support to gloo PG. I'm not sure if this is the best way to support BFloat16 because we do have float16 support in gloo (Half i.e. gloo:;float16).

torch/csrc/distributed/c10d/ProcessGroupGloo.cpp

jgong5 · 2023-11-14T00:58:19Z

Thanks for adding BFloat16 support to gloo PG. I'm not sure if this is the best way to support BFloat16 because we do have float16 support in gloo (Half i.e. gloo:;float16).

@XilunWu are you suggesting to add gloo::bfloat16 to gloo instead of using c10::bfloat16?

jgong5

Please add UT.

XilunWu · 2023-11-14T01:23:27Z

@jgong5 That seems to be a reasonable call, right? But to quickly unblock, we can merge this PR once the windows part is fixed, and add bfloat16 to gloo later.

jgong5 · 2023-11-14T05:10:44Z

@jgong5 That seems to be a reasonable call, right? But to quickly unblock, we can merge this PR once the windows part is fixed, and add bfloat16 to gloo later.

Yes, that sounds reasonable.

XilunWu · 2023-11-14T21:34:35Z

Just a reminder that the windows part has issue. Need to fix.

Fix Windows and add unit tests for bfloat.

anko-intel · 2023-11-15T11:33:23Z

Please add UT.

I added 2 test cases for bfloat. I think this will be enough and will not increase too much the time needed for testing.

jgong5

LGTM as long as CI passes.

anko-intel · 2023-11-16T16:49:10Z

@XilunWu can we go forward with this change?

facebook-github-bot · 2023-11-17T18:06:55Z

@XilunWu has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

XilunWu

LGTM! Really appreciate the effort of adding BFloat16 scalar support!

XilunWu · 2023-11-17T18:36:07Z

@pytorchbot merge

pytorchmergebot · 2023-11-17T18:38:19Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Add Bfloat16 support to gloo backend

c2a0233

Bfloat16 uses c10::BFloat16 which define conversions from/to float, so calculations are made on floats.

pytorch-bot bot added the release notes: distributed (c10d) release notes category label Nov 13, 2023

pytorchbot added the open source label Nov 13, 2023

XilunWu requested review from rohan-varma, kwen2501, wanchaol, XilunWu and H-Huang November 13, 2023 20:02

XilunWu requested changes Nov 13, 2023

View reviewed changes

torch/csrc/distributed/c10d/ProcessGroupGloo.cpp Outdated Show resolved Hide resolved

jgong5 requested changes Nov 14, 2023

View reviewed changes

Apply review comments

abb0a4c

Fix Windows and add unit tests for bfloat.

anko-intel requested a review from jgong5 November 15, 2023 11:29

anko-intel requested a review from XilunWu November 15, 2023 11:34

janeyx99 added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Nov 15, 2023

jgong5 approved these changes Nov 16, 2023

View reviewed changes

janeyx99 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Nov 17, 2023

XilunWu approved these changes Nov 17, 2023

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 17, 2023

pytorchmergebot added the merging label Nov 17, 2023

pytorchmergebot added Merged and removed merging labels Nov 17, 2023

pytorchmergebot closed this in 0885c58 Nov 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Bfloat16 scalar support to gloo backend #113557

Add Bfloat16 scalar support to gloo backend #113557

anko-intel commented Nov 13, 2023 •

edited by pytorch-bot bot

pytorch-bot bot commented Nov 13, 2023 •

edited

linux-foundation-easycla bot commented Nov 13, 2023 •

edited

facebook-github-bot commented Nov 13, 2023

XilunWu left a comment

jgong5 commented Nov 14, 2023

jgong5 left a comment

XilunWu commented Nov 14, 2023

jgong5 commented Nov 14, 2023

XilunWu commented Nov 14, 2023

anko-intel commented Nov 15, 2023

jgong5 left a comment

anko-intel commented Nov 16, 2023

facebook-github-bot commented Nov 17, 2023

XilunWu left a comment

XilunWu commented Nov 17, 2023

pytorchmergebot commented Nov 17, 2023

Add Bfloat16 scalar support to gloo backend #113557

Add Bfloat16 scalar support to gloo backend #113557

Conversation

anko-intel commented Nov 13, 2023 • edited by pytorch-bot bot

pytorch-bot bot commented Nov 13, 2023 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/113557

✅ No Failures

linux-foundation-easycla bot commented Nov 13, 2023 • edited

facebook-github-bot commented Nov 13, 2023

XilunWu left a comment

Choose a reason for hiding this comment

jgong5 commented Nov 14, 2023

jgong5 left a comment

Choose a reason for hiding this comment

XilunWu commented Nov 14, 2023

jgong5 commented Nov 14, 2023

XilunWu commented Nov 14, 2023

anko-intel commented Nov 15, 2023

jgong5 left a comment

Choose a reason for hiding this comment

anko-intel commented Nov 16, 2023

facebook-github-bot commented Nov 17, 2023

XilunWu left a comment

Choose a reason for hiding this comment

XilunWu commented Nov 17, 2023

pytorchmergebot commented Nov 17, 2023

Merge started

anko-intel commented Nov 13, 2023 •

edited by pytorch-bot bot

pytorch-bot bot commented Nov 13, 2023 •

edited

linux-foundation-easycla bot commented Nov 13, 2023 •

edited