- 
                Notifications
    
You must be signed in to change notification settings  - Fork 559
 
Update API GUIDE to include multi host training and add some colors #4706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
4f328bf    to
    d56a9fa      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some small comments and edits.
        
          
                API_GUIDE.md
              
                Outdated
          
        
      | ### Running on Multiple XLA Hosts | ||
| Multi-host setup for different accelerators can be very different. This doc will talk about the device independent bits of multi-host training and will use the TPU + PJRT runtime(currently available on 1.13 and 2.x releases) as an example. | ||
| 
               | 
          ||
| Let's assume you have the above mnist example from above section in a `train_mnist_xla.py`. If it is a single host multi device training, you would run it like | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think first you should show a smoke test, e.g. just dumping the real device IDs from each host with python -c before copying around a script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I can add that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am having some trouble creating a pod to play with these command. For now I will leave this doc as "How it should work" and direct user to the pod user guide.. The major problem now is that user guide still uses XRT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
50922fe    to
    21fef5f      
    Compare
  
    …4706) * Update API GUIDE to include multi host training and add some colors * address review comments
* Update API GUIDE to include multi host training and add some colors (#4706) * Update API GUIDE to include multi host training and add some colors * address review comments * Update README (#4734) * Update README * update user guide section title * Add public readme for torchdynamo (#4744) * Add public readme for torchdynamo * Update index file
…4706) * Update API GUIDE to include multi host training and add some colors * address review comments
My goal is to make
API_GUIDEto be a good entry point for any firs time pytorch/xla user. We can put more technical details on different docs, butAPI_GUIDEshould include a big picture.rendered version in https://github.com/pytorch/xla/blob/JackCaoG/update_API_GUIDE/API_GUIDE.md#running-on-multiple-xla-devices-with-multi-processing.