diff --git a/README.md b/README.md index 729d63f5009d..52ef4700df1f 100644 --- a/README.md +++ b/README.md @@ -2,7 +2,7 @@ # How to Run PyTorch with TPUs -First, create your [TPU](https://pantheon.corp.google.com/compute/tpus) node with the corresponding release you wish to consume (TPU software version: `pytorch-0.1`): +First, create your [TPU](https://pantheon.corp.google.com/compute/tpus) node with the corresponding release you wish to consume (TPU software version: `pytorch-0.5`): Once you've created a Cloud TPU node, you can train your PyTorch models by either: @@ -66,16 +66,16 @@ Follow these steps to train a PyTorch model with Docker on a TPU: # conda environments: # base * /anaconda3 - pytorch-0.1 /anaconda3/envs/pytorch-0.1 - pytorch-nightly /anaconda3/envs/pytorch-nightly + torch-xla-0.1 /anaconda3/envs/torch-xla-0.1 + torch-xla-nightly /anaconda3/envs/torch-xla-nightly - (vm)$ conda activate pytorch-0.1 - (pytorch-0.1)$ cd /usr/share/torch-xla-0.1/pytorch/xla - (pytorch-0.1)$ python test/test_train_mnist.py + (vm)$ conda activate torch-xla-0.1 + (torch-xla-0.1)$ cd /usr/share/torch-xla-0.1/pytorch/xla + (torch-xla-0.1)$ python test/test_train_mnist.py ``` To update the wheels `torch` and `torch_xla` to the latest nightly - distribution (only updates your pytorch-nightly conda env), run: + distribution (only updates your `torch-xla-nightly` conda env), run: ```Shell (vm)$ cd /usr/share/torch-xla-nightly/pytorch/xla (vm)$ . ./scripts/update_nightly_torch_wheels.sh @@ -118,21 +118,21 @@ Training on pods can be broken down to largely 3 different steps: 2. Let's say the command you ran to run a v3-8 was: `XLA_USE_BF16=1 python test/test_train_imagenet.py --fake_data`. * To distribute training as a conda environment process: ``` -(pytorch-nightly)$ cd /usr/share/torch-xla-nightly/pytorch/xla -(pytorch-nightly)$ python -m torch_xla.distributed.xla_dist --tpu=$TPU_POD_NAME --conda-env=pytorch-nightly --env=XLA_USE_BF16=1 -- python test/test_train_imagenet.py --fake_data +(torch-xla-nightly)$ cd /usr/share/torch-xla-nightly/pytorch/xla +(torch-xla-nightly)$ python -m torch_xla.distributed.xla_dist --tpu=$TPU_POD_NAME --conda-env=torch-xla-nightly --env=XLA_USE_BF16=1 -- python test/test_train_imagenet.py --fake_data ``` * Or, to distribute training as a docker container: ``` -(pytorch-nightly)$ cd /usr/share/torch-xla-nightly/pytorch/xla -(pytorch-nightly)$ python -m torch_xla.distributed.xla_dist --tpu=$TPU_POD_NAME --docker-image=gcr.io/tpu-pytorch/xla:nightly --docker-run-flag=--rm=true --docker-run-flag=--shm-size=50GB --env=XLA_USE_BF16=1 -- python test/test_train_imagenet.py --fake_data +(torch-xla-nightly)$ cd /usr/share/torch-xla-nightly/pytorch/xla +(torch-xla-nightly)$ python -m torch_xla.distributed.xla_dist --tpu=$TPU_POD_NAME --docker-image=gcr.io/tpu-pytorch/xla:nightly --docker-run-flag=--rm=true --docker-run-flag=--shm-size=50GB --env=XLA_USE_BF16=1 -- python test/test_train_imagenet.py --fake_data ``` ## List of VMs If you up to not use an [instance group](#create-your-instance-group), you can decide to use a list of VM instances that you may have already created (or can create individually). Make sure that you create all the VM instances in the same zone as the TPU node, and also make sure that the VMs have the same configuration (datasets, VM size, disk size, etc.). Then you can [start distributed training](#start-distributed-training) after creating your TPU pod. The difference is in the `python -m torch_xla.distributed.xla_dist` command. For example, to use a list of VMs run the following command (ex. conda with v3-32): ``` -(pytorch-nightly)$ cd /usr/share/torch-xla-nightly/pytorch/xla -(pytorch-nightly)$ python -m torch_xla.distributed.xla_dist --tpu=$TPU_POD_NAME --vm $VM1 --vm $VM2 --vm $VM3 --vm $VM4 --conda-env=pytorch-nightly --env=XLA_USE_BF16=1 -- python test/test_train_imagenet.py --fake_data +(torch-xla-nightly)$ cd /usr/share/torch-xla-nightly/pytorch/xla +(torch-xla-nightly)$ python -m torch_xla.distributed.xla_dist --tpu=$TPU_POD_NAME --vm $VM1 --vm $VM2 --vm $VM3 --vm $VM4 --conda-env=torch-xla-nightly --env=XLA_USE_BF16=1 -- python test/test_train_imagenet.py --fake_data ``` To learn more about TPU Pods check out this [blog diff --git a/scripts/update_nightly_torch_wheels.sh b/scripts/update_nightly_torch_wheels.sh index 2987ba04937d..7b79fb6a7367 100755 --- a/scripts/update_nightly_torch_wheels.sh +++ b/scripts/update_nightly_torch_wheels.sh @@ -2,9 +2,9 @@ set -e set -x -# Activate pytorch-nightly conda env if not already in it -if [ "$CONDA_DEFAULT_ENV" != "pytorch-nightly" ]; then - conda activate pytorch-nightly +# Activate torch-xla-nightly conda env if not already in it +if [ "$CONDA_DEFAULT_ENV" != "torch-xla-nightly" ]; then + conda activate torch-xla-nightly fi $(dirname $0)/update_torch_wheels.sh diff --git a/torch_xla/distributed/xla_dist.py b/torch_xla/distributed/xla_dist.py index 5eeca33dd4b1..035b73384ce2 100755 --- a/torch_xla/distributed/xla_dist.py +++ b/torch_xla/distributed/xla_dist.py @@ -729,7 +729,7 @@ def run(self, cmd): parser = argparse.ArgumentParser( description='PyTorch on TPU distrubuted training', epilog=('Usage example: xla_dist.py --tpu=[TPU_NAME]' - ' --conda-env pytorch-nightly -- python train')) + ' --conda-env torch-xla-nightly -- python train')) cluster_group = parser.add_argument_group('Cluster Setup') cluster_group.add_argument(