# AI on a GPU


In this notebook, we'll create our biggest neural networks so far and compare their training speeds between GPU and CPU training.

A lot of this notebook mirrors the PyTorch's [guide](https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html) on GPU training.

## What is a GPU?

A GPU is a type of processor designed for performing many operations in parallel. GPU stands for Graphics Processing Unit. 

Contrast this to a CPU. CPU stands for Central Processing Unit. CPUs can usually perform faster, but can only process one thing at a time.

Inside a processor, different computational operations are running on different **threads**. Multiple threads of operations run in a piece of hardware called a **core**.
The main hardware difference between CPUs and GPUs is that GPUs have many more **cores**, but that GPU cores operate at much slower speeds than CPU cores. 

![](images/num_cores.jpg)


## Do I have a GPU?

Not all computers have a GPU. The code below shows you how to check if you have a GPU that `torch` can use.

This package adds support for CUDA tensor types, that implement the same function as CPU tensors, but they utilize GPUs for computation.



In [1]:
import torch
# import pyprofiler

cuda_available = torch.cuda.is_available() # check if cuda is available

print('Got GPU?', cuda_available)

Got GPU? False


If you don't have a GPU, you can run this notebook in [Google Colab](https://colab.research.google.com/github/AI-Core/Practical-ML-DS/blob/master/Chapter%203.%20Deep%20Learning/Module%206.%20GPU%20and%20cloud%20training/0.%20GPU.ipynb#scrollTo=Evkr9cnkU2D5)

### Wait, what's CUDA?

CUDA is a platform for parallel computing.
When something is defined as a platform, in the proper sense of the word, it means that it is something that can be built on top of.
CUDA is a platform that makes it possible for you build software applications that utilise your GPU for parallel processing.
Part of the CUDA platform is the code extensions that handle the low level memory allocation, and distribution of parallelisable operations across the GPU, amongst many other things.
These code extensions are built into PyTorch in it's ```.cuda module```.

## I know I have a GPU! Why isn't it available?

Not only do you need to have the actual GPU hardware, but you also need to have the corresponding ••driver** installed. 

### What's a driver?

A driver is a piece of software that lets the operating system and a hardware device communicate with each other.
They can be notoriously difficult to install.
In the next session, we'll look at training models in cloud servers, where we won't have to worry about this ourselves.

## Moving data ```.to``` the GPU

Torch tensors and models have a ```.to``` method which moves them to a device (such as a GPU).
The argument to this function is a string, which is the name of the device.
This will usually either be `'cpu'` or `'cuda'`, for moving it to the default CPU or GPU respectively.

If we are using multiple GPUs, we can specify the index of the GPU which we wish to move the tensor to.
E.g. to move the tensor `x` to GPU 3, we would call `x.to('cuda:2')` (note the zero-indexing).

We can also check which device any torch tensor is currently stored on by using its ```.get_device()``` method.

Trying to move a tensor to a GPU if you don't have one will throw an 'invalid device string' error.

In [15]:
x = torch.rand(1)
x.to('cpu')
print('x device:', x.get_device())
x.to('cuda:0')
print('x device:', x.get_device())
x.cpu() # move a tensor to the cpu
print('x device:', x.get_device())

x device: -1


AssertionError: Torch not compiled with CUDA enabled

We can see the expected device prefixes if we give an invalid argument string.

In [9]:
x.to('random')

RuntimeError: Expected one of cpu, cuda, mkldnn, opengl, opencl, ideep, hip, msnpu, xla device type at start of device string: random

Alternatively to `.to`, we can use `.cuda`. This method takes in an optional index of the GPU to move the tensor to. If no argument is provided, the tensor is moved to the current device.

In [None]:
x.cuda(0) 

How many GPUs do we have?

In [None]:
torch.cuda.device_count()

We can create objects that represent a device by using `torch.device`
See why this is slightly different to just using the string name [here](https://pytorch.org/docs/stable/tensor_attributes.html#torch.torch.device).

In [10]:
cuda0 = torch.device('cuda0')

RuntimeError: Invalid device string: 'cuda0'

We can also move models to a GPU, thanks to our friend `torch.nn.Module`


In [13]:
import sys
sys.path.append('..')
from utils import NN
model = NN([1, 2])
model.to('cuda:0') # will throw an error if you don't have a GPU or don't have a version of torch with the cuda extensions installed

AssertionError: Torch not compiled with CUDA enabled

## Will a GPU always speed things up?

Not necessarily, and there are a few reasons why.

### Moving tensors between devices takes time

### Not all operations are parallelisable.

Before we compare with and without GPU, let's spin up tensorboard.

In [16]:
%load_ext tensorboard
%tensorboard --logdir runs

## Firstly let's create a CNN

In [17]:
import torchvision
import torchvision.transforms as transforms
from time import time
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter()

transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
train_loader = torch.utils.data.DataLoader(trainset, batch_size=256, shuffle=True, num_workers=2)

classes = ('plane', 'car', 'bird', 'cat',
           'deer', 'dog', 'frog', 'horse', 'ship', 'truck')

class CNN(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.pool = torch.nn.MaxPool2d(2, 2)
        self.conv = torch.nn.Sequential(
            torch.nn.Conv2d(3, 6, 5),
            torch.nn.ReLU(),
            self.pool,
            torch.nn.Conv2d(6, 16, 5, padding=3),
            torch.nn.ReLU(),
            torch.nn.Conv2d(16, 16, 5),
            torch.nn.ReLU(),
            self.pool,
            torch.nn.Conv2d(16, 16, 3),
            torch.nn.ReLU(),
            self.pool,
            torch.nn.Flatten()
        )
        self.fc = torch.nn.Sequential(
            torch.nn.Linear(64, 120),
            torch.nn.ReLU(),
            torch.nn.Linear(120, 84),
            torch.nn.ReLU(),
            torch.nn.Linear(84, 10)
        )

    def forward(self, x):
        x = self.conv(x)
        x = self.fc(x)
        return x


Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar-10-python.tar.gz
100.0%Extracting ./data/cifar-10-python.tar.gz to ./data


Let's make a function to train a model on our CPU.

In [None]:
def trainCPU(model, dataloader, criterion, optimiser, writer, epochs=1):
    start = time()
    for epoch in range(epochs):
        for idx, (x, y) in enumerate(dataloader):
            h = model(x)
            loss = criterion(h, y)
            loss.backward()
            optimiser.step()
            optimiser.zero_grad()
            writer.add_scalar('Train/Loss', loss, epoch*len(dataloader) + idx)
        print(f'Epoch: {epoch}\tBatch: {idx}\tLoss: {loss.data}')
    duration = time() - start
    print('Time taken on CPU:', duration)

nn = CNN()          
criterion = torch.nn.CrossEntropyLoss()
optimiser = torch.optim.Adam(nn.parameters())

trainCPU(nn, train_loader, criterion, optimiser, writer)


Now let's train the same model for one epoch on a GPU. PyTorch makes this very easy - and not a lot changes at all.

In [2]:
def trainGPU(model, dataloader, criterion, optimiser, writer, epochs=1):
    start = time()
    model.to(device) # THIS LINE IS NEW - it recursively move all of the models parameters to the device
    for epoch in range(epochs):
        for idx, (x, y) in enumerate(dataloader):
            x, y = x.to(device), y.to(device) # THIS LINE IS NEW - move the example features and labels to the GPU 
            h = model(x)
            loss = criterion(h, y)
            loss.backward()
            optimiser.step()
            optimiser.zero_grad()
            writer.add_scalar('Train/Loss', loss, epoch*len(dataloader) + idx)
        print(f'Epoch: {epoch}\tBatch: {idx}\tLoss: {loss.data}')
    duration = time() - start
    print('Time taken on GPU:', duration)
        
device = torch.device('cuda' if cuda_available else 'cpu')
print('DEVICE:', device)
                  

nn = CNN()
nn.to(device)
optimiser = torch.optim.Adam(nn.parameters())
trainGPU(nn, train_loader, criterion, optimiser, writer, gpu=True)


NameError: name 'torch' is not defined

Compare the two wall times (how long the algorithm took to run) between CPU and GPU. 
Try this out with different batch sizes to see what proportion of time is being used to move tensors between devices.