-
Notifications
You must be signed in to change notification settings - Fork 566
Implement all_gather with native XLA primitive. #3275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The build failure is due to pytorch commit pytorch/pytorch@36db501#diff-2f3dbd85efb9b5172f2264eedd3be47dd765e6ab7cc8bf3ade5e62c28ae35991L9048. I am not sure how to make my PR depend on an earlier pytorch commit. |
@hjm-aws FYI, adding a pin like https://github.com/pytorch/xla/pull/3208/files#diff-e71f55b9deb410fad9d5945c4c30878a93369434520520987ac03c59bfa5280d can help you ping the upstream pytorch pr on circle CI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly LGTM, want to get some clarification on the test change.
test/test_mp_all_gather.py
Outdated
print( | ||
'Default device {} does not support replication'.format(device), | ||
file=sys.stderr) | ||
'Default device {} is not a TPU device'.format(device), file=sys.stderr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is native all-gather
not supported on xla:GPU?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure. Do you have instructions on how to run tests on GPU?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you rebase this pr, circle CI should build and run the test on cpu and gpu. I think gpu should work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Sorry for the late response!
test/test_mp_distributed_mm.py
Outdated
|
||
world_size = xm.xrt_world_size() | ||
if world_size > 1: | ||
if xm.xla_device_hw(device) == 'TPU': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto, I think GPU
should also be supported.
@hjm-aws Can you rebase this pr and remove the |
@hjm-aws Thanks. The commit history seems a bit messed up, it bring into commits that already merged to the master. |
@JackCaoG Yes, I thought it was normal (this is the first PR I created). What I did was the following:
What is the proper way to rebase this PR? |
@JackCaoG OK, seems I cleaned up the commit history. The only garbage left is in the conversation history. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @hjm-aws !
Hi,
I wasn't able to test this on CPU or TPU. I believe all-gather is not lowered on CPU yet, and I am not sure how to acquire a TPU instance to run this. If you can point me to the instructions it will be much appreciated!
For issue #3138