Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 13 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

# How to Run PyTorch with TPUs

First, create your [TPU](https://pantheon.corp.google.com/compute/tpus) node with the corresponding release you wish to consume (TPU software version: `pytorch-0.1`):
First, create your [TPU](https://pantheon.corp.google.com/compute/tpus) node with the corresponding release you wish to consume (TPU software version: `pytorch-0.5`):

Once you've created a Cloud TPU node, you can train your PyTorch models by either:

Expand Down Expand Up @@ -66,16 +66,16 @@ Follow these steps to train a PyTorch model with Docker on a TPU:
# conda environments:
#
base * /anaconda3
pytorch-0.1 /anaconda3/envs/pytorch-0.1
pytorch-nightly /anaconda3/envs/pytorch-nightly
torch-xla-0.1 /anaconda3/envs/torch-xla-0.1
torch-xla-nightly /anaconda3/envs/torch-xla-nightly

(vm)$ conda activate pytorch-0.1
(pytorch-0.1)$ cd /usr/share/torch-xla-0.1/pytorch/xla
(pytorch-0.1)$ python test/test_train_mnist.py
(vm)$ conda activate torch-xla-0.1
(torch-xla-0.1)$ cd /usr/share/torch-xla-0.1/pytorch/xla
(torch-xla-0.1)$ python test/test_train_mnist.py
```

To update the wheels `torch` and `torch_xla` to the latest nightly
distribution (only updates your pytorch-nightly conda env), run:
distribution (only updates your `torch-xla-nightly` conda env), run:
```Shell
(vm)$ cd /usr/share/torch-xla-nightly/pytorch/xla
(vm)$ . ./scripts/update_nightly_torch_wheels.sh
Expand Down Expand Up @@ -118,21 +118,21 @@ Training on pods can be broken down to largely 3 different steps:
2. Let's say the command you ran to run a v3-8 was: `XLA_USE_BF16=1 python test/test_train_imagenet.py --fake_data`.
* To distribute training as a conda environment process:
```
(pytorch-nightly)$ cd /usr/share/torch-xla-nightly/pytorch/xla
(pytorch-nightly)$ python -m torch_xla.distributed.xla_dist --tpu=$TPU_POD_NAME --conda-env=pytorch-nightly --env=XLA_USE_BF16=1 -- python test/test_train_imagenet.py --fake_data
(torch-xla-nightly)$ cd /usr/share/torch-xla-nightly/pytorch/xla
(torch-xla-nightly)$ python -m torch_xla.distributed.xla_dist --tpu=$TPU_POD_NAME --conda-env=torch-xla-nightly --env=XLA_USE_BF16=1 -- python test/test_train_imagenet.py --fake_data
```

* Or, to distribute training as a docker container:
```
(pytorch-nightly)$ cd /usr/share/torch-xla-nightly/pytorch/xla
(pytorch-nightly)$ python -m torch_xla.distributed.xla_dist --tpu=$TPU_POD_NAME --docker-image=gcr.io/tpu-pytorch/xla:nightly --docker-run-flag=--rm=true --docker-run-flag=--shm-size=50GB --env=XLA_USE_BF16=1 -- python test/test_train_imagenet.py --fake_data
(torch-xla-nightly)$ cd /usr/share/torch-xla-nightly/pytorch/xla
(torch-xla-nightly)$ python -m torch_xla.distributed.xla_dist --tpu=$TPU_POD_NAME --docker-image=gcr.io/tpu-pytorch/xla:nightly --docker-run-flag=--rm=true --docker-run-flag=--shm-size=50GB --env=XLA_USE_BF16=1 -- python test/test_train_imagenet.py --fake_data
```

## List of VMs
If you up to not use an [instance group](#create-your-instance-group), you can decide to use a list of VM instances that you may have already created (or can create individually). Make sure that you create all the VM instances in the same zone as the TPU node, and also make sure that the VMs have the same configuration (datasets, VM size, disk size, etc.). Then you can [start distributed training](#start-distributed-training) after creating your TPU pod. The difference is in the `python -m torch_xla.distributed.xla_dist` command. For example, to use a list of VMs run the following command (ex. conda with v3-32):
```
(pytorch-nightly)$ cd /usr/share/torch-xla-nightly/pytorch/xla
(pytorch-nightly)$ python -m torch_xla.distributed.xla_dist --tpu=$TPU_POD_NAME --vm $VM1 --vm $VM2 --vm $VM3 --vm $VM4 --conda-env=pytorch-nightly --env=XLA_USE_BF16=1 -- python test/test_train_imagenet.py --fake_data
(torch-xla-nightly)$ cd /usr/share/torch-xla-nightly/pytorch/xla
(torch-xla-nightly)$ python -m torch_xla.distributed.xla_dist --tpu=$TPU_POD_NAME --vm $VM1 --vm $VM2 --vm $VM3 --vm $VM4 --conda-env=torch-xla-nightly --env=XLA_USE_BF16=1 -- python test/test_train_imagenet.py --fake_data
```

To learn more about TPU Pods check out this [blog
Expand Down
6 changes: 3 additions & 3 deletions scripts/update_nightly_torch_wheels.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@
set -e
set -x

# Activate pytorch-nightly conda env if not already in it
if [ "$CONDA_DEFAULT_ENV" != "pytorch-nightly" ]; then
conda activate pytorch-nightly
# Activate torch-xla-nightly conda env if not already in it
if [ "$CONDA_DEFAULT_ENV" != "torch-xla-nightly" ]; then
conda activate torch-xla-nightly
fi

$(dirname $0)/update_torch_wheels.sh
2 changes: 1 addition & 1 deletion torch_xla/distributed/xla_dist.py
Original file line number Diff line number Diff line change
Expand Up @@ -729,7 +729,7 @@ def run(self, cmd):
parser = argparse.ArgumentParser(
description='PyTorch on TPU distrubuted training',
epilog=('Usage example: xla_dist.py --tpu=[TPU_NAME]'
' --conda-env pytorch-nightly -- python train'))
' --conda-env torch-xla-nightly -- python train'))

cluster_group = parser.add_argument_group('Cluster Setup')
cluster_group.add_argument(
Expand Down