Implement all_gather with native XLA primitive. #3275

hjm-aws · 2022-01-05T04:59:35Z

Hi,

I wasn't able to test this on CPU or TPU. I believe all-gather is not lowered on CPU yet, and I am not sure how to acquire a TPU instance to run this. If you can point me to the instructions it will be much appreciated!

For issue #3138

hjm-aws · 2022-01-05T06:28:04Z

The build failure is due to pytorch commit pytorch/pytorch@36db501#diff-2f3dbd85efb9b5172f2264eedd3be47dd765e6ab7cc8bf3ade5e62c28ae35991L9048. I am not sure how to make my PR depend on an earlier pytorch commit.

JackCaoG · 2022-01-05T22:01:00Z

@hjm-aws FYI, adding a pin like https://github.com/pytorch/xla/pull/3208/files#diff-e71f55b9deb410fad9d5945c4c30878a93369434520520987ac03c59bfa5280d can help you ping the upstream pytorch pr on circle CI

JackCaoG

Mostly LGTM, want to get some clarification on the test change.

JackCaoG · 2022-01-05T22:21:14Z

test/test_mp_all_gather.py

    print(
-        'Default device {} does not support replication'.format(device),
-        file=sys.stderr)
+        'Default device {} is not a TPU device'.format(device), file=sys.stderr)


Is native all-gather not supported on xla:GPU?

I am not sure. Do you have instructions on how to run tests on GPU?

If you rebase this pr, circle CI should build and run the test on cpu and gpu. I think gpu should work.

Done. Sorry for the late response!

JackCaoG · 2022-01-05T22:24:30Z

test/test_mp_distributed_mm.py


-  world_size = xm.xrt_world_size()
-  if world_size > 1:
+  if xm.xla_device_hw(device) == 'TPU':


ditto, I think GPU should also be supported.

JackCaoG · 2022-01-10T19:42:22Z

@hjm-aws Can you rebase this pr and remove the TPU hard check? Circle CI should run this time for CPU and GPU.

JackCaoG · 2022-01-17T07:03:25Z

@hjm-aws Thanks. The commit history seems a bit messed up, it bring into commits that already merged to the master.

hjm-aws · 2022-01-17T07:54:44Z

@JackCaoG Yes, I thought it was normal (this is the first PR I created). What I did was the following:

Create a fork https://github.com/hjm-aws/xla/tree/master.
Create a branch on the fork https://github.com/hjm-aws/xla/tree/all_gather.
Use https://github.com/pytorch/xla as an upstream remote.
Created this PR in https://github.com/hjm-aws/xla/tree/all_gather.
After seeing you rebase request, firstly in https://github.com/hjm-aws/xla/tree/master, I did git pull upstream master. Then in https://github.com/hjm-aws/xla/tree/all_gather, I did git rebase master, git pull --rebase, and git push.

What is the proper way to rebase this PR?

hjm-aws · 2022-01-17T08:48:37Z

@JackCaoG OK, seems I cleaned up the commit history. The only garbage left is in the conversation history.

JackCaoG

Thanks @hjm-aws !

JackCaoG self-requested a review January 5, 2022 18:13

miladm linked an issue Jan 5, 2022 that may be closed by this pull request

[RFC] Exposing additional XLA collective communication primitives. #3138

Open

JackCaoG reviewed Jan 5, 2022

View reviewed changes

hjm-aws added 2 commits January 17, 2022 00:46

Implement all_gather with native XLA primitive.

97e2d89

Enable GPU test.

671b2ad

JackCaoG approved these changes Jan 17, 2022

View reviewed changes

JackCaoG merged commit 3413bd1 into pytorch:master Jan 17, 2022

ronghanghu mentioned this pull request Mar 15, 2022

new all_gather together w/ reduce_scatter causes GRPC error on v3-128 and v3-256 (nightly 20220308) #3423

Closed

ronghanghu mentioned this pull request Apr 18, 2022

New all-gather API takes much more memory than 1.10 all-gather implementation via all-reduce #3510

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement all_gather with native XLA primitive. #3275

Implement all_gather with native XLA primitive. #3275

Uh oh!

hjm-aws commented Jan 5, 2022 •

edited

Loading

Uh oh!

hjm-aws commented Jan 5, 2022

Uh oh!

JackCaoG commented Jan 5, 2022

Uh oh!

JackCaoG left a comment

Uh oh!

JackCaoG Jan 5, 2022

Uh oh!

hjm-aws Jan 6, 2022

Uh oh!

JackCaoG Jan 7, 2022

Uh oh!

hjm-aws Jan 15, 2022

Uh oh!

JackCaoG Jan 5, 2022

Uh oh!

JackCaoG commented Jan 10, 2022

Uh oh!

JackCaoG commented Jan 17, 2022

Uh oh!

hjm-aws commented Jan 17, 2022

Uh oh!

hjm-aws commented Jan 17, 2022

Uh oh!

JackCaoG left a comment

Uh oh!

Uh oh!

Implement all_gather with native XLA primitive. #3275

Implement all_gather with native XLA primitive. #3275

Uh oh!

Conversation

hjm-aws commented Jan 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hjm-aws commented Jan 5, 2022

Uh oh!

JackCaoG commented Jan 5, 2022

Uh oh!

JackCaoG left a comment

Choose a reason for hiding this comment

Uh oh!

JackCaoG Jan 5, 2022

Choose a reason for hiding this comment

Uh oh!

hjm-aws Jan 6, 2022

Choose a reason for hiding this comment

Uh oh!

JackCaoG Jan 7, 2022

Choose a reason for hiding this comment

Uh oh!

hjm-aws Jan 15, 2022

Choose a reason for hiding this comment

Uh oh!

JackCaoG Jan 5, 2022

Choose a reason for hiding this comment

Uh oh!

JackCaoG commented Jan 10, 2022

Uh oh!

JackCaoG commented Jan 17, 2022

Uh oh!

hjm-aws commented Jan 17, 2022

Uh oh!

hjm-aws commented Jan 17, 2022

Uh oh!

JackCaoG left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hjm-aws commented Jan 5, 2022 •

edited

Loading