# Part 4: Using GPU acceleration with PyTorch

In [0]:
# Execute this code block to install dependencies when running on colab
try:
    import torch
except:
    from os.path import exists
    from wheel.pep425tags import get_abbr_impl, get_impl_ver, get_abi_tag
    platform = '{}{}-{}'.format(get_abbr_impl(), get_impl_ver(), get_abi_tag())
    cuda_output = !ldconfig -p|grep cudart.so|sed -e 's/.*\.\([0-9]*\)\.\([0-9]*\)$/cu\1\2/'
    accelerator = cuda_output[0] if exists('/dev/nvidia0') else 'cpu'

    !pip install -q http://download.pytorch.org/whl/{accelerator}/torch-1.0.0-{platform}-linux_x86_64.whl torchvision

try: 
    import torchbearer
except:
    !pip install torchbearer

## Manual use of `.cuda()`

Now the magic of PyTorch comes in. So far, we've only been using the CPU to do computation. When we want to scale to a bigger problem, that won't be feasible for very long.
|
PyTorch makes it really easy to use the GPU for accelerating computation. Consider the following code that computes the element-wise product of two large matrices:

In [0]:
import torch

t1 = torch.randn(1000, 1000)
t2 = torch.randn(1000, 1000)
t3 = t1*t2
print(t3)

tensor([[ 0.9414,  2.0151,  0.6757,  ...,  1.9189,  0.1562,  0.1325],
        [ 0.3274, -0.6311, -1.2401,  ...,  0.5978, -0.4244,  0.0220],
        [-0.8007, -0.1126, -0.0951,  ...,  0.0993,  0.5886,  0.2130],
        ...,
        [ 0.3461, -0.8226,  0.0394,  ..., -0.4308,  0.2861,  0.4572],
        [-0.0247, -0.6191, -0.1831,  ..., -2.3207,  0.0465, -0.5108],
        [-1.5556, -0.2022, -0.2424,  ...,  0.1038,  0.1627, -0.6700]])


By sending all the tensors that we are using to the GPU, all the operations on them will also run on the GPU without having to change anything else. If you're running a non-cuda enabled version of PyTorch the following will throw an error; if you have cuda available the following will create the input matrices, copy them to the GPU and perform the multiplication on the GPU itself:

In [0]:
t1 = torch.randn(1000, 1000).cuda()
t2 = torch.randn(1000, 1000).cuda()
t3 = t1*t2
print(t3)

tensor([[-0.2195,  1.3444, -0.1633,  ..., -0.1416, -0.2565, -0.0435],
        [-1.6846, -0.0031,  1.4869,  ..., -0.2306, -0.0970, -0.6455],
        [-1.0707, -0.5851,  1.0679,  ...,  0.9600, -2.6441, -0.2734],
        ...,
        [-0.0532,  0.0633,  2.2664,  ..., -0.0118, -0.6130,  0.1089],
        [-0.1117,  0.2210, -0.6556,  ..., -0.2617, -0.4815,  0.3398],
        [-1.3476,  0.1137, -0.9070,  ...,  0.5196,  0.4858, -0.5521]],
       device='cuda:0')


If you're running this workbook in colab, now enable GPU acceleration (`Runtime->Runtime Type` and add a `GPU` in the hardware accelerator pull-down). You'll then need to re-run all cells to this point.

If you were able to run the above with hardware acceleration, the print-out of the result tensor would show that it was an instance of `cuda.FloatTensor` type on the the `(GPU 0)` GPU device. If your wanted to copy the tensor back to the CPU, you would use the `.cpu()` method.

## Writing platform agnostic code

Most of the time you'd like to write code that is device agnostic; that is it will run on a GPU if one is available, and otherwise it would fall back to the CPU. The recommended way to do this is as follows:

In [0]:
device = "cuda:0" if torch.cuda.is_available() else "cpu"
t1 = torch.randn(1000, 1000).to(device)
t2 = torch.randn(1000, 1000).to(device)
t3 = t1*t2
print(t3)

tensor([[-1.5173e+00, -9.4144e-01, -5.7393e-02,  ..., -3.6494e-03,
          1.5715e-02, -3.6293e-01],
        [-1.0048e-02, -1.8353e+00, -2.8213e+00,  ..., -2.3511e-01,
          2.3116e+00,  1.8033e+00],
        [-2.6288e-01, -1.6830e+00,  3.2183e-01,  ...,  6.3308e-01,
          4.0032e-01,  2.1916e+00],
        ...,
        [ 6.2640e-01,  1.3166e-02, -1.2638e-03,  ...,  2.3809e-01,
         -7.5367e-01, -1.5126e-01],
        [ 7.2562e-01,  4.4981e-01,  3.1707e-01,  ..., -3.0734e-01,
         -1.1695e-01, -1.4448e+00],
        [-1.9400e-01,  6.1879e-02, -2.4267e-01,  ..., -5.5031e-01,
         -8.7382e-01,  2.2382e-01]], device='cuda:0')


## Accelerating neural net training

If you wanted to accelerate the training of a neural net using raw PyTorch, you would have to copy both the model and the training data to the GPU. Unless you were using a really small dataset like MNIST, you would typically _stream_ the batches of training data to the GPU as you used them in the training loop:

```python
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = BaselineModel(784, 784, 10).to(device)

loss_function = ...
optimiser = ...

for epoch in range(10):
    for data in trainloader:
        inputs, labels = data
        inputs, labels = inputs.to(device), labels.to(device)

        optimiser.zero_grad()
        outputs = model(inputs)
        loss = loss_function(outputs, labels)
        loss.backward()
        optimiser.step()
```

Using Torchbearer, this becomes much simpler - you just tell the `Trial` to run on the GPU and that's it!:

```python
model = BetterCNN()

loss_function = ...
optimiser = ...

device = "cuda:0" if torch.cuda.is_available() else "cpu"
trial = Trial(model, optimiser, loss_function, metrics=['loss', 'accuracy']).to(device)
trial.with_generators(trainloader)
trial.run(epochs=10)
```


## Multiple GPUs

Using multiple GPUs is beyond the scope of the lab, but if you have multiple cuda devices, they can be referred to by index: `cuda:0`, `cuda:1`, `cuda:2`, etc. You have to be careful not to mix operations on different devices, and would need how to carefully orchestrate moving of data between the devices (which can really slow down your code to the point at which using the CPU would actually be faster).

## Questions

__Answer the following questions (enter the answer in the box below each one):__

__1.__ What features of GPUs allow them to perform computations faster than a typically CPU?

The amount of cores avaiable in the GPU that can be used for computation compared to the CPU. A state of the art CPU avaiable in PCs atm has about 16 cores, whereas GPU has about 3000-4000 cores. The amount of parallelisation that is achievable with the GPU is much higher than with CPU.

__2.__ What is the biggest limiting factor for training large models with current generation GPUs?

It might have something to do with the data acess, due to being unable to hold everything in the highest level caches, and requiring regular reads to other caches, potentially requiring more than one can simultaneously fit on the data bus.

Something else could be that it might not be able to fit the entire model in VRAM, which could result in really high performance hits.