Skip to content

Conversation

hjm-aws
Copy link
Collaborator

@hjm-aws hjm-aws commented Jan 5, 2022

Hi,

I wasn't able to test this on CPU or TPU. I believe all-gather is not lowered on CPU yet, and I am not sure how to acquire a TPU instance to run this. If you can point me to the instructions it will be much appreciated!

For issue #3138

@hjm-aws
Copy link
Collaborator Author

hjm-aws commented Jan 5, 2022

The build failure is due to pytorch commit pytorch/pytorch@36db501#diff-2f3dbd85efb9b5172f2264eedd3be47dd765e6ab7cc8bf3ade5e62c28ae35991L9048. I am not sure how to make my PR depend on an earlier pytorch commit.

@JackCaoG JackCaoG self-requested a review January 5, 2022 18:13
@miladm miladm linked an issue Jan 5, 2022 that may be closed by this pull request
@JackCaoG
Copy link
Collaborator

JackCaoG commented Jan 5, 2022

@hjm-aws FYI, adding a pin like https://github.com/pytorch/xla/pull/3208/files#diff-e71f55b9deb410fad9d5945c4c30878a93369434520520987ac03c59bfa5280d can help you ping the upstream pytorch pr on circle CI

Copy link
Collaborator

@JackCaoG JackCaoG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM, want to get some clarification on the test change.

print(
'Default device {} does not support replication'.format(device),
file=sys.stderr)
'Default device {} is not a TPU device'.format(device), file=sys.stderr)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is native all-gather not supported on xla:GPU?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure. Do you have instructions on how to run tests on GPU?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you rebase this pr, circle CI should build and run the test on cpu and gpu. I think gpu should work.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Sorry for the late response!


world_size = xm.xrt_world_size()
if world_size > 1:
if xm.xla_device_hw(device) == 'TPU':
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, I think GPU should also be supported.

@JackCaoG
Copy link
Collaborator

@hjm-aws Can you rebase this pr and remove the TPU hard check? Circle CI should run this time for CPU and GPU.

@JackCaoG
Copy link
Collaborator

@hjm-aws Thanks. The commit history seems a bit messed up, it bring into commits that already merged to the master.

@hjm-aws
Copy link
Collaborator Author

hjm-aws commented Jan 17, 2022

@JackCaoG Yes, I thought it was normal (this is the first PR I created). What I did was the following:

  1. Create a fork https://github.com/hjm-aws/xla/tree/master.
  2. Create a branch on the fork https://github.com/hjm-aws/xla/tree/all_gather.
  3. Use https://github.com/pytorch/xla as an upstream remote.
  4. Created this PR in https://github.com/hjm-aws/xla/tree/all_gather.
  5. After seeing you rebase request, firstly in https://github.com/hjm-aws/xla/tree/master, I did git pull upstream master. Then in https://github.com/hjm-aws/xla/tree/all_gather, I did git rebase master, git pull --rebase, and git push.

What is the proper way to rebase this PR?

@hjm-aws
Copy link
Collaborator Author

hjm-aws commented Jan 17, 2022

@JackCaoG OK, seems I cleaned up the commit history. The only garbage left is in the conversation history.

Copy link
Collaborator

@JackCaoG JackCaoG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hjm-aws !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RFC] Exposing additional XLA collective communication primitives.
2 participants