
# Tutorial 1: Introduction to PyTorch


## Setup
Adapted from https://colab.research.google.com/github/PytorchLightning/lightning-tutorials/blob/publication/.notebooks/course_UvA-DL/01-introduction-to-pytorch.ipynb#scrollTo=c5bb5655

In [None]:
! pip install --quiet "urllib3" "pytorch-lightning>=1.4, <2.1.0" "setuptools==67.7.2" "torch>=1.8.1, <2.1.0" "torchmetrics>=0.7, <0.12" "matplotlib>=3.0.0, <3.8.0" "ipython[notebook]==7.34.0" "lightning>=2.0.0rc0" "matplotlib"

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m727.0/727.0 kB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.2/519.2 kB[0m [31m43.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m85.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m73.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.4/66.4 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m70.7/70.7 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.2/66.2 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m596.7/596.7 kB[0m [31m58.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━

Set up env

In [None]:
import time

import matplotlib.pyplot as plt

%matplotlib inline
import matplotlib_inline.backend_inline
import numpy as np
import torch
import torch.nn as nn
import torch.utils.data as data
from matplotlib.colors import to_rgba
from torch import Tensor
from tqdm.notebook import tqdm  # Progress bar

matplotlib_inline.backend_inline.set_matplotlib_formats("svg", "pdf")  # For export

## The Basics of PyTorch

As a first step, we can check the torch version:

In [None]:
print("Using torch", torch.__version__)

Using torch 2.0.1+cu118


As in every machine learning framework, PyTorch provides functions that are stochastic like generating random numbers.
However, a very good practice is to setup your code to be reproducible with the exact same random numbers.
This is why we set a seed below.

In [None]:
torch.manual_seed(42)  # Setting the seed

<torch._C.Generator at 0x7afc2c0f8e10>

### Tensors

The name "tensor" is a generalization of concepts you already know.
For instance, a vector is a 1-D tensor, and a matrix a 2-D tensor.

#### Initialization

Let's first start by looking at different ways of creating a tensor!

In [None]:
x = Tensor(2, 3, 4)
print(x)

tensor([[[9.9130e+04, 4.4118e-41, 9.9130e+04, 4.4118e-41],
         [0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00],
         [1.2845e+31, 1.6045e+02, 1.3926e+19, 8.6543e+05]],

        [[1.4217e+19, 8.7765e+05, 5.1580e-02, 2.0535e-19],
         [1.8617e+25, 5.9423e-02, 1.5870e-19, 5.2053e+34],
         [1.8500e+20, 2.0333e+32, 1.1259e+24, 1.1700e-19]]])


The function `torch.Tensor` allocates memory for the desired tensor, but reuses any values that have already been in the memory.
To directly assign values to the tensor during initialization, there are many alternatives including:

* `torch.zeros`: Creates a tensor filled with zeros
* `torch.ones`: Creates a tensor filled with ones
* `torch.rand`: Creates a tensor with random values uniformly sampled between 0 and 1
* `torch.randn`: Creates a tensor with random values sampled from a normal distribution with mean 0 and variance 1
* `torch.arange`: Creates a tensor containing the values $N,N+1,N+2,...,M$
* `torch.Tensor` (input list): Creates a tensor from the list elements you provide

In [None]:
# Create a tensor from a (nested) list
x = Tensor([[1, 2], [3, 4]])
print(x)

tensor([[1., 2.],
        [3., 4.]])


In [None]:
# Create a tensor with random values between 0 and 1 with the shape [2, 3, 4]
x = torch.rand(2, 3, 4)
print(x)

tensor([[[0.9811, 0.0874, 0.0041, 0.1088],
         [0.1637, 0.7025, 0.6790, 0.9155],
         [0.2418, 0.1591, 0.7653, 0.2979]],

        [[0.8035, 0.3813, 0.7860, 0.1115],
         [0.2477, 0.6524, 0.6057, 0.3725],
         [0.7980, 0.8399, 0.1374, 0.2331]]])


You can obtain the shape of a tensor in the same way as in numpy (`x.shape`), or using the `.size` method:

In [None]:
shape = x.shape
print("Shape:", x.shape)

size = x.size()
print("Size:", size)

dim1, dim2, dim3 = x.size()
print("Size:", dim1, dim2, dim3)

Shape: torch.Size([2, 3, 4])
Size: torch.Size([2, 3, 4])
Size: 2 3 4


#### Tensor to Numpy, and Numpy to Tensor

Tensors can be converted to numpy arrays, and numpy arrays back to tensors.
To transform a numpy array into a tensor, we can use the function `torch.from_numpy`:

In [None]:
np_arr = np.array([[1, 2], [3, 4]])
tensor = torch.from_numpy(np_arr)

print("Numpy array:", np_arr)
print("PyTorch tensor:", tensor)

Numpy array: [[1 2]
 [3 4]]
PyTorch tensor: tensor([[1, 2],
        [3, 4]])


To transform a PyTorch tensor back to a numpy array, we can use the function `.numpy()` on tensors:

In [None]:
tensor = torch.arange(4)
np_arr = tensor.numpy()

print("PyTorch tensor:", tensor)
print("Numpy array:", np_arr)

PyTorch tensor: tensor([0, 1, 2, 3])
Numpy array: [0 1 2 3]


The conversion of tensors to numpy require the tensor to be on the CPU, and not the GPU (more on GPU support in a later section).
In case you have a tensor on GPU, you need to call `.cpu()` on the tensor beforehand.
Hence, you get a line like `np_arr = tensor.cpu().numpy()`.

#### Operations

Most operations that exist in numpy, also exist in PyTorch.
A full list of operations can be found in the [PyTorch documentation](https://pytorch.org/docs/stable/tensors.html#), but we will review the most important ones here.

The simplest operation is to add two tensors:

In [None]:
x1 = torch.rand(2, 3)
x2 = torch.rand(2, 3)
y = x1 + x2

print("X1", x1)
print("X2", x2)
print("Y", y)

X1 tensor([[0.9578, 0.3313, 0.3227],
        [0.0162, 0.2137, 0.6249]])
X2 tensor([[0.4340, 0.1371, 0.5117],
        [0.1585, 0.0758, 0.2247]])
Y tensor([[1.3918, 0.4683, 0.8345],
        [0.1747, 0.2895, 0.8496]])


Calling `x1 + x2` creates a new tensor containing the sum of the two inputs.
However, we can also use in-place operations that are applied directly on the memory of a tensor.
We therefore change the values of `x2` without the chance to re-accessing the values of `x2` before the operation.
An example is shown below:

In [None]:
x1 = torch.rand(2, 3)
x2 = torch.rand(2, 3)
print("X1 (before)", x1)
print("X2 (before)", x2)

x2.add_(x1)
print("X1 (after)", x1)
print("X2 (after)", x2)

X1 (before) tensor([[0.0624, 0.1816, 0.9998],
        [0.5944, 0.6541, 0.0337]])
X2 (before) tensor([[0.1716, 0.3336, 0.5782],
        [0.0600, 0.2846, 0.2007]])
X1 (after) tensor([[0.0624, 0.1816, 0.9998],
        [0.5944, 0.6541, 0.0337]])
X2 (after) tensor([[0.2340, 0.5152, 1.5780],
        [0.6545, 0.9386, 0.2343]])


In-place operations are usually marked with a underscore postfix (for example `torch.add_` instead of `torch.add`).

Another common operation aims at changing the shape of a tensor.
A tensor of size (2,3) can be re-organized to any other shape with the same number of elements (e.g. a tensor of size (6), or (3,2), ...).
In PyTorch, this operation is called `view`:

In [None]:
x = torch.arange(6)
print("X", x)

X tensor([0, 1, 2, 3, 4, 5])


In [None]:
x = x.view(2, 3)
print("X", x)

X tensor([[0, 1, 2],
        [3, 4, 5]])


In [None]:
x = x.permute(1, 0)  # Swapping dimension 0 and 1
print("X", x)

X tensor([[0, 3],
        [1, 4],
        [2, 5]])


Other commonly used operations include matrix multiplications, which are essential for neural networks.
Quite often, we have an input vector $\mathbf{x}$, which is transformed using a learned weight matrix $\mathbf{W}$.
There are multiple ways and functions to perform matrix multiplication, some of which we list below:

* `torch.matmul`: Performs the matrix product over two tensors, where the specific behavior depends on the dimensions.
If both inputs are matrices (2-dimensional tensors), it performs the standard matrix product.
For higher dimensional inputs, the function supports broadcasting (for details see the [documentation](https://pytorch.org/docs/stable/generated/torch.matmul.html?highlight=matmul#torch.matmul)).
Can also be written as `a @ b`, similar to numpy.
* `torch.mm`: Performs the matrix product over two matrices, but doesn't support broadcasting (see [documentation](https://pytorch.org/docs/stable/generated/torch.mm.html?highlight=torch%20mm#torch.mm))
* `torch.bmm`: Performs the matrix product with a support batch dimension.
If the first tensor $T$ is of shape ($b\times n\times m$), and the second tensor $R$ ($b\times m\times p$), the output $O$ is of shape ($b\times n\times p$), and has been calculated by performing $b$ matrix multiplications of the submatrices of $T$ and $R$: $O_i = T_i @ R_i$
* `torch.einsum`: Performs matrix multiplications and more (i.e. sums of products) using the Einstein summation convention.
Explanation of the Einstein sum can be found in assignment 1.

Usually, we use `torch.matmul` or `torch.bmm`. We can try a matrix multiplication with `torch.matmul` below.

In [None]:
x = torch.ones(6)
x = x.view(2, 3)
print("X", x)

X tensor([[1., 1., 1.],
        [1., 1., 1.]])


In [None]:
W = torch.ones(9).view(3, 3)  # We can also stack multiple operations in a single line
print("W", W)

W tensor([[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]])


In [None]:
h = torch.matmul(x, W)  # Verify the result by calculating it by hand too!
print("h", h)

h tensor([[3., 3., 3.],
        [3., 3., 3.]])


#### Indexing

We often have the situation where we need to select a part of a tensor.
Indexing works just like in numpy, so let's try it:

In [None]:
x = torch.arange(12).view(3, 4)
print("X", x)

X tensor([[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]])


In [None]:
print(x[:, 1])  # Second column

tensor([1, 5, 9])


In [None]:
print(x[0])  # First row

tensor([0, 1, 2, 3])


In [None]:
print(x[:2, -1])  # First two rows, last column

tensor([3, 7])


In [None]:
print(x[1:3, :])  # Middle two rows

tensor([[ 4,  5,  6,  7],
        [ 8,  9, 10, 11]])


### Dynamic Computation Graph and Backpropagation

One of the main reasons for using PyTorch in Deep Learning projects is that we can automatically get **gradients/derivatives** of functions that we define.

If we use weight matrices in our function that we want to learn, then those are called the **parameters** or simply the **weights**.

- Given an input $\mathbf{x}$, we define our function by **manipulating** that input.
- As we manipulate our input, we are automatically creating a **computational graph**.
- PyTorch is a **define-by-run** framework; this means that we can just do our manipulations, and PyTorch will keep track of that graph for us.

So, to recap: the only thing we have to do is to compute the **output**, and then we can ask PyTorch to automatically get the **gradients**.

The first thing we have to do is to specify which tensors require gradients.
By default, when we create a tensor, it does not require gradients.

In [None]:
x = torch.ones((3,))
print(x.requires_grad)

False


We can change this for an existing tensor using the function `requires_grad_()` (underscore indicating that this is a in-place operation).
Alternatively, when creating a tensor, you can pass the argument
`requires_grad=True` to most initializers we have seen above.

In [None]:
x.requires_grad_(True)
print(x.requires_grad)

True


In order to get familiar with the concept of a computation graph, we will create one for the following function:

$$y = \frac{1}{|x|}\sum_i \left[(x_i + 2)^2 + 3\right]$$

You could imagine that $x$ are our parameters, and we want to optimize (either maximize or minimize) the output $y$.
For this, we want to obtain the gradients $\partial y / \partial \mathbf{x}$.
For our example, we'll use $\mathbf{x}=[0,1,2]$ as our input.

In [None]:
x = torch.arange(3, dtype=torch.float32, requires_grad=True)  # Only float tensors can have gradients
print("X", x)

X tensor([0., 1., 2.], requires_grad=True)


Now let's build the computation graph step by step.
You can combine multiple operations in a single line, but we will
separate them here to get a better understanding of how each operation
is added to the computation graph.

In [None]:
a = x + 2
b = a**2
c = b + 3
y = c.mean()
print("Y", y)

Y tensor(12.6667, grad_fn=<MeanBackward0>)


Using the statements above, we have created a computation graph that looks similar to the figure below:

<center style="width: 100%"><img src="https://github.com/Lightning-AI/lightning-tutorials/raw/main/course_UvA-DL/01-introduction-to-pytorch/pytorch_computation_graph.svg" width="200px"></center>

- We calculate $a$ based on the inputs $x$ and the constant $2$, $b$ is $a$ squared, and so on.
- The visualization is an abstraction of the dependencies between inputs and outputs of the operations we have applied.
- Each node of the computation graph has automatically defined a function for calculating the gradients with respect to its inputs, `grad_fn`.
- We can perform backpropagation on the computation graph by calling the
function `backward()` on the last output, which effectively calculates
the gradients for each tensor that has the property
`requires_grad=True`:

In [None]:
y.backward()

`x.grad` will now contain the gradient $\partial y/ \partial \mathcal{x}$, and this gradient indicates how a change in $\mathbf{x}$ will affect output $y$ given the current input $\mathbf{x}=[0,1,2]$:

In [None]:
print(x.grad)

tensor([1.3333, 2.0000, 2.6667])


We can also verify these gradients by hand.
We will calculate the gradients using the chain rule, in the same way as PyTorch did it:

$$\frac{\partial y}{\partial x_i} = \frac{\partial y}{\partial c_i}\frac{\partial c_i}{\partial b_i}\frac{\partial b_i}{\partial a_i}\frac{\partial a_i}{\partial x_i}$$

Note that we have simplified this equation to index notation, and by using the fact that all operation besides the mean do not combine the elements in the tensor.
The partial derivatives are:

$$
\frac{\partial a_i}{\partial x_i} = 1,\hspace{1cm}
\frac{\partial b_i}{\partial a_i} = 2\cdot a_i\hspace{1cm}
\frac{\partial c_i}{\partial b_i} = 1\hspace{1cm}
\frac{\partial y}{\partial c_i} = \frac{1}{3}
$$

Hence, with the input being $\mathbf{x}=[0,1,2]$, our gradients are $\partial y/\partial \mathbf{x}=[4/3,2,8/3]$.
The previous code cell should have printed the same result.

### GPU support

First, let's check whether you have a GPU available:

In [None]:
gpu_avail = torch.cuda.is_available()
print(f"Is the GPU available? {gpu_avail}")

Is the GPU available? True


- By default, all tensors you create are stored on the CPU.
We can push a tensor to the GPU by using the function `.to(...)`, or `.cuda()`.
- It is often a good practice to define a `device` object in your code which points to the GPU if you have one, and otherwise to the CPU.

In [None]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
print("Device", device)

Device cuda


Now let's create a tensor and push it to the device:

In [None]:
x = torch.zeros(2, 3)
x = x.to(device)
print("X", x)

X tensor([[0., 0., 0.],
        [0., 0., 0.]], device='cuda:0')


- In case you have a GPU, you should now see the attribute `device='cuda:0'` being printed next to your tensor.
- The zero next to cuda indicates that this is the zero-th GPU device on your computer.

We can also compare the runtime of a large matrix multiplication on the CPU with a operation on the GPU:

In [None]:
x = torch.randn(5000, 5000)

# CPU version
start_time = time.time()
_ = torch.matmul(x, x)
end_time = time.time()
print(f"CPU time: {(end_time - start_time):6.5f}s")

# GPU version
if torch.cuda.is_available():
    x = x.to(device)
    # CUDA is asynchronous, so we need to use different timing functions
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)
    start.record()
    _ = torch.matmul(x, x)
    end.record()
    torch.cuda.synchronize()  # Waits for everything to finish running on the GPU
    print(f"GPU time: {0.001 * start.elapsed_time(end):6.5f}s")  # Milliseconds to seconds

CPU time: 2.02440s
GPU time: 0.08774s


As `matmul` operations are very common in neural networks, we can already see the great benefit of training a NN on a GPU.

When generating random numbers, the seed between CPU and GPU is not synchronized.
Hence, we need to set the seed on the GPU separately to ensure a reproducible code.

In [None]:
# GPU operations have a separate seed we also want to set
if torch.cuda.is_available():
    torch.cuda.manual_seed(42)
    torch.cuda.manual_seed_all(42)

# Additionally, some operations on a GPU are implemented stochastic for efficiency
# We want to ensure that all operations are deterministic on GPU (if used) for reproducibility
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False