Jack cao g/r20 backportdoc (#4752)

JackCaoG · web-flow · commit cf7995d9b00b · 2023-03-10T15:22:49.000-08:00
* Update API GUIDE to include multi host training and add some colors (#4706) * Update API GUIDE to include multi host training and add some colors * address review comments * Update README (#4734) * Update README * update user guide section title * Add public readme for torchdynamo (#4744) * Add public readme for torchdynamo * Update index file
diff --git a/API_GUIDE.md b/API_GUIDE.md
@@ -135,11 +135,19 @@ if __name__ == '__main__':
 ```
 
 There are three differences between this multi-device snippet and the previous
-single device snippet:
-
-- `xmp.spawn()` creates the processes that each run an XLA device.
-- `MpDeviceLoader` loads the training data onto each device.
-- `xm.optimizer_step(optimizer)` consolidates the gradients between cores and issues the XLA device step computation.
+single device snippet. Let's go over then one by one.
+
+- `xmp.spawn()`
+  - Creates the processes that each run an XLA device.
+  - Each process will only be able to access the device assigned to the current process. For example on a TPU v4-8, there will be 4 processes being spawn up and each process will own a TPU device.
+  - Note that if you print the `xm.xla_device()` on each process you will see `xla:0` on all devices. This is because each process can only see one device. This does not mean multi-process is not functioning. The only execution is with PJRT runtime on TPU v2 and TPU v3 since there will be `#devices/2` processes and each process will have 2 threads(check this [doc](https://github.com/pytorch/xla/blob/master/docs/pjrt.md#tpus-v2v3-vs-v4) for more details).
+- `MpDeviceLoader`
+  - Loads the training data onto each device.
+  - `MpDeviceLoader` can wrap on a torch dataloader. It can preload the data to the device and overlap the dataloading with device execution to improve the performance.
+  - `MpDeviceLoader` also call `xm.mark_step` for you every `batches_per_execution`(default to 1) batch being yield.
+- `xm.optimizer_step(optimizer)`
+  - Consolidates the gradients between devices and issues the XLA device step computation.
+  - It is pretty much a `all_reduce_gradients` + `optimizer.step()` + `mark_step` and returns the loss being reduced.
 
 The model definition, optimizer definition and training loop remain the same.
 
@@ -152,6 +160,33 @@ See the
 [full multiprocessing example](https://github.com/pytorch/xla/blob/master/test/test_train_mp_mnist.py)
 for more on training a network on multiple XLA devices with multi-processing.
 
+### Running on TPU Pods
+Multi-host setup for different accelerators can be very different. This doc will talk about the device independent bits of multi-host training and will use the TPU + PJRT runtime(currently available on 1.13 and 2.x releases) as an example. 
+
+Before you being, please take a look at our user guide at [here](https://cloud.google.com/tpu/docs/run-calculation-pytorch) which will explain some Google Cloud basis like how to use `gcloud` command and how to setup your project. You can also check [here](https://cloud.google.com/tpu/docs/how-to) for all Cloud TPU Howto. This doc will focus on the PyTorch/XLA perspective of the Setup.
+
+Let's assume you have the above mnist example from above section in a `train_mnist_xla.py`. If it is a single host multi device training, you would ssh to the TPUVM and run command like
+
+```
+PJRT_DEVICE=TPU python3 train_mnist_xla.py
+```
+
+Now in order to run the same models on a TPU v4-16 (which has 2 host, each with 4 TPU devices), you will need to
+  - Make sure each host can access the training script and training data. This is usually done by using the `gcloud scp` command or `gcloud ssh` command to copy the training scripts to all hosts.
+  - Run the same training command on all hosts at the same time.
+
+```
+gcloud alpha compute tpus tpu-vm ssh $USER-pjrt --zone=$ZONE --project=$PROJECT --worker=all --command="PJRT_DEVICE=TPU python3 train_mnist_xla.py"
+```
+
+Above `gcloud ssh` command will ssh to all hosts in TPUVM Pod and run the same command at the same time..
+
+> **NOTE:** You need to run run above `gcloud` command outside of the TPUVM vm.
+
+The model code and training scirpt is the same for the multi-process training and the multi-host training. PyTorch/XLA and the underlying infrastructure will make sure each device is aware of the global topology and each device's local and global ordinal. Cross-device communication will happen across all devices instead of local devices. 
+
+For more details regarding PJRT runtime and how to run it on pod, please refer to this [doc](https://github.com/pytorch/xla/blob/master/docs/pjrt.md#tpu). For more information about PyTorch/XLA and TPU pod and a complete guide to run a resnet50 with fakedata on TPU pod, please refer to this [guide](https://cloud.google.com/tpu/docs/pytorch-pods).
+
 ## XLA Tensor Deep Dive
 
 Using XLA tensors and devices requires changing only a few lines of code. But
diff --git a/README.md b/README.md
@@ -24,20 +24,32 @@ running on Cloud TPUs and learn how to use Cloud TPUs as PyTorch devices:
 
 The rest of this README covers:
 
+* [User Guide & Best Practices](#user-guide--best-practices)
 * [Running PyTorch on Cloud TPUs and GPU](#running-pytorchxla-on-cloud-tpu-and-gpu)
 Google Cloud also runs networks faster than Google Colab.
 * [Available docker images and wheels](#available-docker-images-and-wheels)
-* [API & Best Practices](#api--best-practices)
 * [Performance Profiling and Auto-Metrics Analysis](#performance-profiling-and-auto-metrics-analysis)
 * [Troubleshooting](#troubleshooting)
 * [Providing Feedback](#providing-feedback)
 * [Building and Contributing to PyTorch/XLA](#contributing)
+* [Additional Reads](#additional-reads)
 
 
 
 Additional information on PyTorch/XLA, including a description of its
 semantics and functions, is available at [PyTorch.org](http://pytorch.org/xla/).
 
+## User Guide & Best Practices
+
+Our comprehensive user guides are available at:
+
+[Documentation for the latest release](https://pytorch.org/xla)
+
+[Documentation for master branch](https://pytorch.org/xla/master)
+
+See the [API Guide](API_GUIDE.md) for best practices when writing networks that
+run on XLA devices(TPU, GPU, CPU and...)
+
 ## Running PyTorch/XLA on Cloud TPU and GPU
 
 * [Running on a single Cloud TPU](#running-on-a-single-cloud-tpu-vm)
@@ -144,17 +156,6 @@ pip3 install torch_xla[tpuvm]
 
 This is only required on Cloud TPU VMs.
 
-## API & Best Practices
-
-In general PyTorch/XLA follows PyTorch APIs, some additional torch_xla specific APIs are available at:
-
-[Documentation for the latest release](https://pytorch.org/xla)
-
-[Documentation for master branch](https://pytorch.org/xla/master)
-
-See the [API Guide](API_GUIDE.md) for best practices when writing networks that
-run on Cloud TPUs and Cloud TPU Pods.
-
 ## Performance Profiling and Auto-Metrics Analysis
 
 With PyTorch/XLA we provide a set of performance profiling tooling and auto-metrics analysis which you can check the following resources:
@@ -181,3 +182,10 @@ See the [contribution guide](CONTRIBUTING.md).
 
 ## Disclaimer
 This repository is jointly operated and maintained by Google, Facebook and a number of individual contributors listed in the [CONTRIBUTORS](https://github.com/pytorch/xla/graphs/contributors) file. For questions directed at Facebook, please send an email to opensource@fb.com. For questions directed at Google, please send an email to pytorch-xla@googlegroups.com. For all other questions, please open up an issue in this repository [here](https://github.com/pytorch/xla/issues).
+
+## Additional Reads
+You can find additional useful reading materials in
+* [Performance debugging on Cloud TPU VM](https://cloud.google.com/blog/topics/developers-practitioners/pytorchxla-performance-debugging-tpu-vm-part-1)
+* [Lazy tensor intro](https://pytorch.org/blog/understanding-lazytensor-system-performance-with-pytorch-xla-on-cloud-tpu/)
+* [Scaling deep learning workloads with PyTorch / XLA and Cloud TPU VM](https://cloud.google.com/blog/topics/developers-practitioners/scaling-deep-learning-workloads-pytorch-xla-and-cloud-tpu-vm)
+* [Scaling PyTorch models on Cloud TPUs with FSDP](https://pytorch.org/blog/scaling-pytorch-models-on-cloud-tpus-with-fsdp/)
diff --git a/docs/dynamo.md b/docs/dynamo.md
@@ -0,0 +1,90 @@
+## TorchDynamo(torch.compile) integration in PyTorch XLA
+
+Torchdynamo is a Python-level JIT compiler designed to make unmodified PyTorch programs faster. It provides a clean API for compiler backends to hook in and its biggest feature is to dynamically modify Python bytecode right before it is executed. In the pytorch/xla 2.0 release, PyTorch/XLA provided an experimental backend for the TorchDynamo for both inference and training. 
+
+The way that XLA bridge works is that Dynamo will provide a TorchFX graph when it recognizes a model pattern and PyTorch/XLA will use existing Lazy Tensor technology to compile the FX graph and return the compiled function. 
+
+### Inference
+Here is a small code example of running resnet18 with `torch.compile`
+
+```python
+import torch
+imprt torchvision
+import torch_xla.core.xla_model as xm
+
+def eval_model(loader):
+  device = xm.xla_device()
+  xla_resnet18 = torchvision.models.resnet18().to(device)
+  xla_resnet18.eval()
+  dynamo_resnet18 = torch.compile(
+      xla_resnet18, backend='torchxla_trace_once')
+  for data, _ in loader:
+    output = dynamo_resnet18(data)
+```
+> **NOTE:** inference backend name `torchxla_trace_once` is subject to change.
+
+With the `torch.compile` you will see that PyTorch/XLA only traces the resent18 model once during the init time and executes the compiled binary everytime `dynamo_resnet18` is invoked, instead of tracing the model every time. Note that currently Dynamo does not support fallback so if there is an op that can not be traced by XLA, it will error out. We will fix this issue in the upcoming 2.1 release. Here is a inference speed analysis to compare Dynamo and Lazy using torch bench on Cloud TPU v4-8
+
+| model | Speed up |
+| --- | ----------- |
+resnet18 | 1.768
+resnet50 | 1.61
+resnext50_32x4d	| 1.328
+alexnet | 1.261
+mobilenet_v2 | 2.017
+mnasnet1_0 | 1.686
+vgg16 | 1.155
+BERT_pytorch | 3.502
+squeezenet1_1 | 1.674
+timm_vision_transformer | 3.138
+average | 1.9139
+
+### Training
+PyTorch/XLA also supports Dynamo for training, but it is very experimental and we are working with the PyTorch Compiler team to iterate on the implementation. On the 2.0 release it only supports forward and backward pass but not the optimizer. Here is an example of training a resnet18 with `torch.compile`
+
+```python
+import torch
+imprt torchvision
+import torch_xla.core.xla_model as xm
+
+def train_model(model, data, target):
+  loss_fn = torch.nn.CrossEntropyLoss()
+  pred = model(data)
+  loss = loss_fn(pred, target)
+  loss.backward()
+  return pred
+
+def train_model_main(loader):
+  device = xm.xla_device()
+  xla_resnet18 = torchvision.models.resnet18().to(device)
+  xla_resnet18.train()
+  dynamo_train_model = torch.compile(
+        train_model, backend='aot_torchxla_trace_once')
+  for data, target in loader:
+    output = dynamo_train_model(xla_resnet18, data, target)
+```
+
+> **NOTE:** Backend we used here is `aot_torchxla_trace_once`(subject to change) instead of `torchxla_trace_once`
+
+We expect to extract and execute 3 graphs per training step instead of one training step if you use the Lazy tensor. Here is a training speed analysis to compare Dynamo and Lazy using a torch bench on Cloud TPU v4-8.
+
+| model | Speed up | 
+| --- | ----------- |
+resnet50 | 0.937
+resnet18 | 1.003
+BERT_pytorch | 1.869
+resnext50_32x4d | 1.139
+alexnet | 0.802
+mobilenet_v2 | 0.672
+mnasnet1_0 | 0.967
+vgg16 | 0.742
+timm_vision_transformer | 1.69
+squeezenet1_1 | 0.958
+average | 1.0779
+
+> **NOTE:** We run each model's fwd and bwd for a single step and then collect the e2e time. In the real world we will run multiple steps at each training job which can easily hide the tracing cost from execution(since it is async). Lazy Tensor will have much better performance in that scenario.
+
+We are currently working on the optimizer support and that will be availiable on nightly soon but won't be in the 2.0 release.
+
+### Take away
+TorchDynamo provides a really promising way for the compiler backend to hide the complexity from the user and easily retrieve the modeling code in a graph format. Compared with PyTorch/XLA’s traditional Lazy Tensor way of extracting the graph, TorchDynamo can skip the graph tracing for every iteration hence provide a much better inference response time. However TorchDynamo does not trace the communication ops(like `all_reduce` and `all_gather`) yet and it provides separate graphs for the forward and the backward which hurts xla performance. These feature gaps compared to Lazy Tensor makes it less efficient in real world training use cases, especially the tracing cost can be overlapped with the execution in training. The PyTorch/XLA team will keep investing in TorchDynamo and work with upstream to mature the training story.
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -98,6 +98,7 @@ test
 
 .. mdinclude:: ../../TROUBLESHOOTING.md
 .. mdinclude:: ../pjrt.md
-.. mdinclude:: ../ddp.md
+.. mdinclude:: ../dynamo.md
 .. mdinclude:: ../fsdp.md
+.. mdinclude:: ../ddp.md
 .. mdinclude:: ../gpu.md