## Getting Started with PyTorch on Cloud TPUs

* https://colab.research.google.com/github/pytorch/xla/blob/master/contrib/colab/getting-started.ipynb

PyTorch/XLA 连接了PyTorch和Cloud TPU，将TPU 核（core）作为设备（device）。

虽然一块TPU有多个核，本notebook只使用其中一个，后面我们会介绍如何使用多核。

## 安装 PyTorch/XLA

如果你使用的是Cloud TPU VM，不需要我们手动安装，在创建TPU VM时，可以直接选择预装的PyTorch环境，比如最新版的是"tpu-vm-pt-1.12"，已经安装了 PyTorch 1.12.0 and Pytorch / XLA 1.12.0。


## 在TPU上创建Tensor

有了PyTorch/XLA，你可以像对待CPU或GPU那样来管理Cloud TPU，我们将每个Cloud TPU 核（core）都看作一个独立的PyTorch device。

In [2]:
# imports pytorch
import torch

# imports the torch_xla package
import torch_xla
import torch_xla.core.xla_model as xm

In [3]:
print(torch.__version__, torch_xla.__version__)

1.12.0+cu102 1.12


PyTorch/XLA (torch_xla)可以让PyTorch管理TPU device，函数 `xla_device()` 返回TPU的默认核作为device，下面就在TPU上创建一个tensor:

In [4]:
# Creates a random tensor on xla:1 (a Cloud TPU core)
dev = xm.xla_device()
t1 = torch.ones(3, 3, device = dev)
print(t1)

tensor([[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]], device='xla:1')


如果执行上面的代码报错，"RuntimeError: tensorflow/compiler/xla/xla_client/computation_client.cc:280 : Missing XLA configuration"

说明还没有配置XLA，只需要在终端执行
`export XRT_TPU_CONFIG="localservice;0;localhost:51011"`

我们可以把这条命令添加到/etc/profile

* 参考 https://pytorch-lightning.readthedocs.io/en/latest/accelerators/tpu_faq.html

可以看 http://pytorch.org/xla/ 中的文档，了解 PyTorch/XLA 都含有哪些函数。

刚才使用了第一个TPU 核 ('xla:1')，我们切换另一个核:

In [5]:
# Creating a tensor on the second Cloud TPU core
second_dev = xm.xla_device(n=2, devkind='TPU')
t2 = torch.zeros(3, 3, device = second_dev)
print(t2)

tensor([[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]], device='xla:2')


建议使用 `xm.xla_device()` 来指定设备。


TPU上创建的Tensor和其他Tensor 的用户体验完全相同：

In [6]:
a = torch.randn(2, 2, device = dev)
b = torch.randn(2, 2, device = dev)
print(a + b)
print(b * 2)
print(torch.matmul(a, b))

tensor([[-1.1846, -0.7140],
        [-0.3259, -0.5264]], device='xla:1')
tensor([[-0.9715, -1.2307],
        [-2.1193,  0.7613]], device='xla:1')
tensor([[ 0.4448,  0.3940],
        [ 0.6057, -0.7984]], device='xla:1')


使用torch中的函数：

In [7]:
# Creates random filters and inputs to a 1D convolution
filters = torch.randn(33, 16, 3, device = dev)
inputs = torch.randn(20, 16, 50, device = dev)
torch.nn.functional.conv1d(inputs, filters)

tensor([[[ -2.2614,  -7.4375,  -3.0452,  ...,   6.4813,   6.0025,  -3.8181],
         [ -2.1178,  -1.2323,   6.3152,  ...,  -8.1402,   1.5390,  10.5330],
         [  3.7358,  -6.1666,  -5.3654,  ...,   3.9503,   6.6946,  -1.0387],
         ...,
         [ -1.0524,  -7.5402,  -6.6635,  ...,  -5.7106,  -9.5255,   9.1400],
         [-12.9870,   1.4063,  -6.9533,  ...,  10.5729,   1.3097,  -5.2656],
         [  5.0329,   1.4415,   8.1006,  ...,  -3.4235,   3.5638,  -5.9472]],

        [[ -3.0059,   3.8605,   3.6280,  ...,  -8.1614, -13.1281,   5.2417],
         [ -3.7675,  -4.9035,  -1.3131,  ...,   4.4226, -11.7430,  11.3242],
         [-13.1958,   5.3812,   3.2664,  ...,  -4.4664,   5.2152,   2.1421],
         ...,
         [  5.9822,  -2.8872,   8.6605,  ...,  -9.1931,  -6.1449,  -6.7736],
         [  1.4102,   1.8250, -12.2252,  ...,   3.3475,  -7.8704,   1.3273],
         [  5.8575,  -0.6981,   3.5026,  ...,  -3.9181,   2.3322,   0.6250]],

        [[ -7.5734,  -1.9799, -13.0047,  ...

tensors也可以在CPU和TPU之间传输，但是注意PyTorch跨设备传输tensor都是传输的备份（copy）



In [8]:
# 在CPU上创建一个tensor，
t_cpu = torch.randn(2, 2, device='cpu')
print(t_cpu)

# 将t_cpu传输到TPU，注意：实际上传输的是备份
t_tpu = t_cpu.to(dev)
print(t_tpu)

# 将t_tpu传输到CPU，注意：实际上传输的是备份
t_cpu_again = t_tpu.to('cpu')
print(t_cpu_again)

tensor([[-1.1306, -0.2727],
        [-0.5207,  0.3036]])
tensor([[-1.1306, -0.2727],
        [-0.5207,  0.3036]], device='xla:1')
tensor([[-1.1306, -0.2727],
        [-0.5207,  0.3036]])


In [9]:
id(t_cpu), id(t_cpu_again)

(140206247038624, 140208887314928)

可以看到，这是两个tensor。

## Running PyTorch modules and autograd on TPUs

Modules 和 autograd 是PyTorch的基础，可以无缝操纵TPU tensor。

PyTorch中每个有状态的函数都对应一个同功能的Module，Module是一个类，封装了数据和方法。比如线性层是一个module，由于Module是有状态的，所以可以放在device中:


In [10]:
# 创建一个线性module
fc = torch.nn.Linear(5, 2, bias=True)

# 将这个module拷贝到TPU核中
fc = fc.to(dev)

# Creates a random feature tensor
features = torch.randn(3, 5, device=dev, requires_grad=True)

# Runs and prints the module
output = fc(features)
print(output)

tensor([[ 0.9235, -1.2827],
        [-0.6867,  0.3928],
        [-1.4930,  0.7258]], device='xla:1', grad_fn=<AddmmBackward0>)


Autograd 是PyTorch中的自动微分系统，如果一个Module在TPU核中，那么Module中参数的梯度也在同一个TPU核上:

In [11]:
output.backward(torch.ones_like(output))
print(fc.weight.grad)

tensor([[ 1.6606,  1.4688,  1.9482,  0.6864, -1.0761],
        [ 1.6606,  1.4688,  1.9482,  0.6864, -1.0761]], device='xla:1')


## 在TPU上运行神经网络

既然Module可以放在TPU上，NN当然也可以，毕竟也是modules：

In [12]:
import torch.nn as nn
import torch.nn.functional as F

# Simple example network from 
# https://pytorch.org/tutorials/beginner/blitz/neural_networks_tutorial.html#sphx-glr-beginner-blitz-neural-networks-tutorial-py
class Net(nn.Module):

    def __init__(self):
        super(Net, self).__init__()
        # 1 input image channel, 6 output channels, 3x3 square convolution
        # kernel
        self.conv1 = nn.Conv2d(1, 6, 3)
        self.conv2 = nn.Conv2d(6, 16, 3)
        # an affine operation: y = Wx + b
        self.fc1 = nn.Linear(16 * 6 * 6, 120)  # 6*6 from image dimension
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        # Max pooling over a (2, 2) window
        x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
        # If the size is a square you can only specify a single number
        x = F.max_pool2d(F.relu(self.conv2(x)), 2)
        x = x.view(-1, self.num_flat_features(x))
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

    def num_flat_features(self, x):
        size = x.size()[1:]  # all dimensions except the batch dimension
        num_features = 1
        for s in size:
            num_features *= s
        return num_features


# Places network on the default TPU core
net = Net().to(dev)

# Creates random input on the default TPU core
input = torch.randn(1, 1, 32, 32, device=dev)

# Runs network
out = net(input)
print(out)

tensor([[ 0.0107, -0.0980,  0.1534,  0.0691,  0.0965,  0.0634, -0.0321,  0.0635,
          0.0590, -0.0895]], device='xla:1', grad_fn=<AddmmBackward0>)


就是如此简单，只需要把device指定为TPU核即可。