Skip to content

Conversation

@JackCaoG
Copy link
Collaborator

@JackCaoG JackCaoG commented Feb 28, 2023

My goal is to make API_GUIDE to be a good entry point for any firs time pytorch/xla user. We can put more technical details on different docs, but API_GUIDE should include a big picture.

rendered version in https://github.com/pytorch/xla/blob/JackCaoG/update_API_GUIDE/API_GUIDE.md#running-on-multiple-xla-devices-with-multi-processing.

@JackCaoG JackCaoG force-pushed the JackCaoG/update_API_GUIDE branch from 4f328bf to d56a9fa Compare February 28, 2023 04:00
Copy link
Collaborator

@cowanmeg cowanmeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some small comments and edits.

API_GUIDE.md Outdated
### Running on Multiple XLA Hosts
Multi-host setup for different accelerators can be very different. This doc will talk about the device independent bits of multi-host training and will use the TPU + PJRT runtime(currently available on 1.13 and 2.x releases) as an example.

Let's assume you have the above mnist example from above section in a `train_mnist_xla.py`. If it is a single host multi device training, you would run it like
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think first you should show a smoke test, e.g. just dumping the real device IDs from each host with python -c before copying around a script.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I can add that

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am having some trouble creating a pod to play with these command. For now I will leave this doc as "How it should work" and direct user to the pod user guide.. The major problem now is that user guide still uses XRT.

Copy link
Collaborator

@alanwaketan alanwaketan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@JackCaoG JackCaoG force-pushed the JackCaoG/update_API_GUIDE branch from 50922fe to 21fef5f Compare March 3, 2023 22:46
@JackCaoG JackCaoG merged commit a6d186f into master Mar 6, 2023
JackCaoG added a commit that referenced this pull request Mar 10, 2023
…4706)

* Update API GUIDE to include multi host training and add some colors

* address review comments
JackCaoG added a commit that referenced this pull request Mar 10, 2023
* Update API GUIDE to include multi host training and add some colors (#4706)

* Update API GUIDE to include multi host training and add some colors

* address review comments

* Update README (#4734)

* Update README

* update user guide section title

* Add public readme for torchdynamo (#4744)

* Add public readme for torchdynamo

* Update index file
mateuszlewko pushed a commit that referenced this pull request Mar 15, 2023
…4706)

* Update API GUIDE to include multi host training and add some colors

* address review comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants