# Exercise 2
## Authors: E. Vercesi; A. Dei Rossi, G. Dominici, S. Huber

In this exercise session you are going to learn the basics of PyTorch. 
PyTorch is a Python library for scientific computing (as much as NumPy), but which can additionally run on GPUs. 
Hence, this is the computing library of choice for Deep Learning applications. 
PyTorch is developed by Meta. You might have also heard of its main competitor TensorFlow (Google). Although both have basically the same functionalities, in this course we would like you to stick to Pytorch.
If you haven't done Exercise 1 on NumPy yet, we highly encourage to do it first: NumPy and PyTorch offer a vast overlap of functionalities, so understanding NumPy first is going to boost greatly your understanding of PyTorch.
To begin with, make sure you have installed it. 

In [26]:
import torch  # If you see errors, use conda or pip to install torch in your virtual environment.
import numpy as np

torch.manual_seed(42)  # manual seed is to ensure repeatability of random numbers. 

<torch._C.Generator at 0x1116a1fd0>

**Question (for fun):** Why the seed is often [42](https://www.youtube.com/watch?v=aboZctrHfK8)?


## Create tensors

Tensors are like numpy arrays, but they can live in the GPU.

1. Create a tensor out of a Python list [1, 2, 3]
2. Create a tensor out of a NumPy array [[2, 3, 4], [4, 3, 2]] (see method [`.from_numpy()`](https://pytorch.org/docs/stable/generated/torch.from_numpy.html))
3. Convert the tensor of point 2 back to a NumPy array. (see method [`.numpy()`](https://pytorch.org/docs/stable/generated/torch.Tensor.numpy.html))

In [27]:
## 1: create a tensor out of a Python list
tens = torch.Tensor([1,2,3]) 
print(tens)

## 2: create a tensor out of a NumPy array
a = np.array([[2,3,4],[4,3,2]])
t = torch.from_numpy(a)
print(t)

## 3: Convert the tensor back to NumPy.
print(torch.Tensor.numpy(t))

tensor([1., 2., 3.])
tensor([[2, 3, 4],
        [4, 3, 2]])
[[2 3 4]
 [4 3 2]]


Check the `.dtype` attribute of the above created tensors. Create a tensor of size 3 with values [1, 2, 3] but forcing the dtype to be float64.

In [28]:
## 1: create [1, 2, 3] with dtype float64
tens = torch.tensor([1,2,3], dtype=torch.float64)
print(tens)


tensor([1., 2., 3.], dtype=torch.float64)


PyTorch also offers some more advanced functions that can be used to create well known matrices:

1. Create an identity matrix of size (5, 5) (see [`torch.eye()`](https://pytorch.org/docs/stable/generated/torch.eye.html)).
2. Create a matrix of all zeros of size (3, 4) (see [`torch.zeros()`](https://pytorch.org/docs/stable/generated/torch.zeros.html).
3. Create a matrix of all ones of size (2, 3) (see [`torch.ones()`](https://pytorch.org/docs/stable/generated/torch.ones.html).
4. Given a tensor of size (3, 2) of your choice, create a matrix of the same size (3, 2) filled with ones (equivalently zeros) (see [`torch.zeros_like()`](https://pytorch.org/docs/stable/generated/torch.zeros_like.html)
5. Create a matrix of size (3, 4) filled with numbers from 0 to 11 inclusive (same as in NumPy). Try both [`torch.arange()`](https://pytorch.org/docs/stable/generated/torch.arange.html) and [`torch.linspace()`](https://pytorch.org/docs/stable/generated/torch.linspace.html).

In [29]:
## 1:
identity = torch.eye(5,5)
print(identity)

## 2:
all_zeros = torch.zeros(3, 4)
print(all_zeros)

## 3:
all_ones = torch.ones(2,3)
print(all_ones)

## 4:
tens = torch.tensor([[1,2,3],[4,5,6]])
zeros_like = torch.zeros_like(tens)
print(zeros_like)

## 5:
## torch.arange()
tens = torch.arange(start=0, end=12, out=torch.zeros(3, 4))
print(tens)

## torch.linspace()
tens2 = torch.linspace(start=0, end=12, steps=12, out=torch.zeros(3, 4))
print(tens2)

tensor([[1., 0., 0., 0., 0.],
        [0., 1., 0., 0., 0.],
        [0., 0., 1., 0., 0.],
        [0., 0., 0., 1., 0.],
        [0., 0., 0., 0., 1.]])
tensor([[0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.]])
tensor([[1., 1., 1.],
        [1., 1., 1.]])
tensor([[0, 0, 0],
        [0, 0, 0]])
tensor([[ 0.,  1.,  2.,  3.],
        [ 4.,  5.,  6.,  7.],
        [ 8.,  9., 10., 11.]])
tensor([[ 0.0000,  1.0909,  2.1818,  3.2727],
        [ 4.3636,  5.4545,  6.5455,  7.6364],
        [ 8.7273,  9.8182, 10.9091, 12.0000]])


### Random arrays

As in NumPy, you have a big choice of random distributions to sample you arrays from.
Try to do the same random arrays you tried to do in NumPy in Exercise 1:
1) Create a random tensor of size 4 of uniform floating point numbers in the interval [0, 1). (see [`torch.rand`](https://pytorch.org/docs/stable/generated/torch.rand.html))
2) Create a random tensor of size (3, 2) of uniform floating point numbers in the interval [0, 5). (hint: generate numbers in the interval [0, 1) and scale them up by 5).
3) Create a random tensor of size (2, 1, 2) of integers in the interval [10, 20]. (see [`torch.randint`](https://pytorch.org/docs/stable/generated/torch.randint.html), caraful with border conditions!)
4) Create a random tensor of size 10 over the normal distribution, mean 3 and std dev 2. (see [`torch.normal`](https://pytorch.org/docs/stable/generated/torch.normal.html))

In [30]:
## 1:
tens = torch.rand(4)
print(tens)

## 2:
tens = torch.rand(3, 2)
print(tens+4)

## 3:
tens = torch.randint(low=10, high=21, size=(2,1,2))
print(tens)

## 4:
tens = torch.normal(mean=3, std=2, size=(10,))
print(tens)

tensor([0.8823, 0.9150, 0.3829, 0.9593])
tensor([[4.3904, 4.6009],
        [4.2566, 4.7936],
        [4.9408, 4.1332]])
tensor([[[11, 20]],

        [[13, 14]]])
tensor([3.7531, 2.6384, 3.7861, 3.8654, 0.2746, 5.7129, 4.3376, 1.5846, 2.3466,
        2.4424])


## Device (GPU vs CPU)

In this section we will learn how do computation using the GPU instead of the CPU: notice that this is the reason why in Deep Learning applications PyTorch is used over NumPy.

By default, tensors are accessed by the CPU. You can check it easily using the [`.device()`](https://pytorch.org/docs/stable/tensor_attributes.html#torch.device) method.
1) Create an identity matrix of size (4, 4) and access its device attribute.

In [32]:
## 1: see .device of a matrix

v = torch.eye(4,4)  # create a tensor
print(torch.cuda.current_device())

AssertionError: Torch not compiled with CUDA enabled

Hence, every time we want to use the GPU, we need to explicitly move the tensors to the desired device. Careful here: your laptop doesn't necessarily have a dedicated GPU. And, even if it has one, it might not be compatible with CUDA (the NVIDIA interface that allows computations to be performed on the GPU).

You can check if CUDA is available on your machine by simply using [`cuda.is_available()`](https://pytorch.org/docs/stable/generated/torch.cuda.is_available.html).

In [33]:
torch.cuda.is_available()

False

If the above returns False, it could be either because you didn't install correctly CUDA, or because you laptop doesn't have a GPU compatible with it. 
If you have a macbook with Apple Silicon processors, you can still use the device `mps`:

In [34]:
# For mac M1/2/3 users
torch.backends.mps.is_available()

True

We can set the device either to these three options:
- `cuda` (if you have a NVIDIA graphics card). Might be `cuda:0` etc if you have more than one.
- `mps` (if you have a MacBook with M1/2/3 processor)
- `cpu` otherwise

If your laptop doesn't have any of the above-mentioned devices apart from the CPU, you can use Google Colab's or Kaggle's notebooks: they offer free hours of GPU per week (they count the hours the kernel is running, not if you are actually using the notebook. Remember to shut it down when you don't use it!!!)

In [35]:
device = torch.device('cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu')
device

device(type='mps')

Finally, move the tensor `v` you created earlier to the most convenient device. Use function [`.to`](https://pytorch.org/docs/stable/generated/torch.Tensor.to.html). Careful: is it an in-place method? Check that the device is indeed correct.

In [36]:
# Move vector v to the correct device. Check it is indeed on the desired device.
v = v.to(device)

You can also create a tensor and send it directly to the correct device. 
1. Create a tensor of ones of size (3, 3) and specify in its constructor the `device` attribute. Check that, indeed, the tensor has been initialized with the correct device.

In [37]:
## 1: Create a tensor and initialize it to the correct device.
print(torch.tensor([1,2,3], device=device))

tensor([1, 2, 3], device='mps:0')


Later on, we will compute empirically how much faster are GPUs than CPUs for computing large calculations.

If you are using Kaggle platform for your projects (we recommend you to do that), you have at your disposal 30h/week of free GPUs: in order to activate it, you need to open a notebook, go to settings -> accelerator and you can select a GPU from there. If the GPU options are non-clickable, it is because you have to verify your account using your phone number. Go to home -> your picture (top right border) -> settings -> phone verification. Before the options become actually clickable you will need to wait a few minutes (<5').

If you are using Google Colab (also recommended), you can activate GPUs by opening a notebook -> top-right arrow pointing downward -> change runtime type -> select something which is not `CPU`.

## Working with tensors' dimensions

In this section we will learn how to manipulate tensor's dimensions. Notice that they are extremely similar to NumPy methods: hence, if you have done exercise 1, this section should be quite straightforward.

### Access elements and slicing 

Create an identity matrix of size (4, 4) and access 
1. The element in position [0, 0]
2. The last element
3. Element in position [2, 3]
Check that the returned elements are what you expect.

In [38]:
# Create the identity matrix
identity = torch.eye(4,4)

## 1: access element in [0, 0]
print(identity[0][0])

## 2: access element in [3, 3]
print(identity[-1][-1])

## 3: access element in [2, 3]
print(identity[2][3])

tensor(1.)
tensor(1.)
tensor(0.)


### Slicing

1. Create a random tensor of size (3, 4) of integers in the interval [5, 10].
2. Print the second row.
3. Print the third column.
4. Print the sub-matrix spanning from the second to the third row, from the second to the third column.

In [39]:
## 1: create a random tensor of size [3, 4].
tens = torch.randint(low=5, high=11, size=(3,4))
print(tens)

## 2: print the second row.
print(tens[1,:])

## 3: print the third column.
print(tens[:,2])

## 4: sub-matrix
print(tens[1:3,1:3])

tensor([[10,  9, 10,  7],
        [ 9,  5,  7,  5],
        [ 6,  8,  8, 10]])
tensor([9, 5, 7, 5])
tensor([10,  7,  8])
tensor([[5, 7],
        [8, 8]])


### Access tensors' dimensions

1. Create a tensor $v$ of size (3, 4, 2, 4, 1) of random floats in [0, 1)
2. Print its shape. You can use both `.shape` and `.size()`, try them both.
3. Print its third dimension's size (2 in our example). Check `.size()` function.
4. Print the number of dimensions of our vector (5 in our example). Check `.ndim`.

In [40]:
## 1: create a random tensor v of size (3, 4, 2, 4, 1).
tens = torch.rand(size=(3,4,3,4,2))

## 2: print v's shape using .shape and .size().
print(f'Shape: {tens.shape}')
print(f'Size: {tens.size()}')

## 3: print the size of the third dimension of v.
print(f'3rd size: {tens.size(2)}')

## 4: print the number of dimensions of v.
print(f'Dimensions: {tens.ndim}')

Shape: torch.Size([3, 4, 3, 4, 2])
Size: torch.Size([3, 4, 3, 4, 2])
3rd size: 3
Dimensions: 5


## Permute dimensions

You can invert the order of the dimensions of a tensor. Create a random tensor of integers in the interval [0, 10) of size (2, 3, 4) and permute its dimensions so that the final size is (4, 2, 3). See [`torch.permute`](https://pytorch.org/docs/stable/generated/torch.permute.html)

In [41]:
# Create a random tensor. Check its shape (2, 3, 4)
tens = torch.rand(size=(2,3,4))
print(tens.size())

# Permute its dimensions. Check its shape (4, 2, 3)
print(tens.permute(2,0,1).shape)

torch.Size([2, 3, 4])
torch.Size([4, 2, 3])


## Squeeze/unsqueeze

If you want increase the number of dimensions of your vector (similar to `np.newaxis`, this might turn useful in the context of broadcasting), you can use [`torch.unsqueeze`](https://pytorch.org/docs/stable/generated/torch.unsqueeze.html). If you want to reduce the number of dimensions of your vector by dropping dimensions of size 1 you can use [`torch.squeeze`](https://pytorch.org/docs/stable/generated/torch.squeeze.html) instead.

1. Create a random tensor uniform in [0, 1) of size (2, 2). Insert a new dimension so that the final shape is (2, 1, 2)
2. Add a dimension to the tensor of point 1, so that the final shape is (2, 1, 2, 1). Try to use negative indices as the argument of `torch.unsqueeze()`.
3. Turn the tensor back to its original shape (2, 2) by using `torch.squeeze()`.

In [42]:
## 1: Create a tensor of size (2, 2). Unsqueeze it so that its final shape is (2, 1, 2)
tens = torch.rand(size=(2,2))
tens = tens.unsqueeze(1)
print(tens.shape)

## 2: Add an additional dimension to the tensor so that its shape is (2, 1, 2, 1). Use negative indices
tens = tens.unsqueeze(-1)
print(tens.shape)

## 3: Turn the tensor back to shape (2, 2)
tens = tens.squeeze(-1)
tens = tens.squeeze(1)
print(tens.shape)

torch.Size([2, 1, 2])
torch.Size([2, 1, 2, 1])
torch.Size([2, 2])


## Concatenate and stack

If you have two tensors of compatible sizes, you can merge them into a unique tensor along one of their axes.
In order to get some intuition, think about having 2 2-dimensional tensors of size (3, 4). You can merge them along the first axis and get the final shape be (6, 4), or you can merge them along the second axis and get the final shape to be (3, 8), or you can go in 3D stacking one over the other (along the z-axis) and get a shape of (2, 3, 4). 

This is precisely what [`torch.concat`](https://pytorch.org/docs/stable/generated/torch.cat.html#torch.cat) (also called `.cat`) and [`torch.stack`](https://pytorch.org/docs/stable/generated/torch.stack) do. 
You should already be familiar with NumPy `axis` attribute. In PyTorch it is called `dim`.

1. Concat $v$ and $w$ along the first dimension. Check that the final shape is (6, 4).
2. Concat $v$ and $w$ along the second dimension. Check that the final shape is (3, 8).
3. Concat $v$ and $w$ along a new dimension. Check that the final shape is (2, 3, 4).

In [43]:
v = torch.randint(0, 10, (3,4))
w = torch.randint(0, 10, (3,4))

## 1: concat along first dimension.
print(torch.cat((v, w), dim=0).shape)

## 2: concat along second dimension.
print(torch.cat((v, w), dim=1).shape)


## 3: concat along new dimension.
print(torch.stack((v,w)).shape)

torch.Size([6, 4])
torch.Size([3, 8])
torch.Size([2, 3, 4])


## Broadcasting

Same as in NumPy, also PyTorch tensors allow [broadcasting](https://pytorch.org/docs/stable/notes/broadcasting.html).
When performing element-wise operations (like sums) on two tensors of mismatching sizes, the smaller tensor can adapt to the size of the larger tensor in case these simple rules apply:

- Each tensor has at least one dimension.
- When iterating over the dimension sizes, starting at the trailing (right-most) dimension, the dimension sizes must either be equal, one of them is 1, or one of them does not exist.

Let us see an example:

Assume you have $v = [[1, 2, 3], [4, 5, 6]]$ shape (2, 3) and $w=[3, 2, 1]$ shape (3). If we want to perform $v + w$ (element by element sum), it is clear that the dimensions don't match, but with the help of broadcasting we can still do it: $w$ is simply enlarged to reach size (2, 3) by copying itself on the first axis twice. Then, it is possible to perform element by element sum $v+w$.

Let's put broadcasting in practice:

1. Perform the above described example $v+w$ using tensors, check that the result size is (2, 3) and that numbers add up.
2. $r = [[1, 2], [3, 4], [5, 6]]$ and $l=[1, 2, 3]$. Compute $r + l$. It should raise errors. Why?
3. Adjust the size of $l$ in example 2 so that the sum works. What size should $l$ have in order for broadcasting to work on $r + l$?
4. Create random integers tensors $s$ of size (2, 1, 3, 1) and $t$ of size (1, 3, 1, 3). Does broadcasting work here in order to compute $s+t$? In case it does, predict the final shape of the result. 
 
 

In [50]:
## 1: compute v + w
v=torch.tensor([[1,2,3],[4,5,6]])
w=torch.tensor([3,2,1])
print((v+w).shape)
print((v+w))

## 2: compute r + l. It doesn't work, why?
r=torch.tensor([[1,2],[3,4],[5,6]])
l=torch.tensor([1,2,3])
# Error all dimensions mismatch

## 3: adjust the size of l, and compute r + l
l=l.unsqueeze(1)
print((r+l).shape)
print(r+l)
# Adding a dimension to l to make the left most equal,
# the right one will be expanded

## 4: compute s + t
s=torch.rand(size=(2,1,3,1))
t=torch.rand(size=(1,3,1,3))
print('Predicted shape: 2,3,3,3')
print((s+t).shape)


torch.Size([2, 3])
tensor([[4, 4, 4],
        [7, 7, 7]])
torch.Size([3, 2])
tensor([[2, 3],
        [5, 6],
        [8, 9]])
Predicted shape: 2,3,3,3
torch.Size([2, 3, 3, 3])


## PyTorch functions

In this section we are going to learn the basic functions of PyTorch.

### Mean, min, max, sum ...

These functions are quite self-explanatory, and they work the same way as in NumPy. The only detail we ought to pay attention to is the axis along we want to perform the function (in NumPy it was called `axis`, in PyTorch `dim`).

Create a random tensor $v$ of ints of size (3, 2, 4) and print it.

In order to be sure you have understood what is going on, always try to predict the result and then check that your prediction is wrong/correct.

1. Compute the min value in the entire tensor.
2. Compute the max value along axis 0.
3. Compute the min along axis 1.
4. Multi-dimensional axes: take the sum over axes (0, 1). 


In [51]:
# Create v of shape (3, 2, 4)
v=torch.rand(size=(3,2,4))
print(v)

tensor([[[0.8377, 0.5398, 0.5226, 0.3769],
         [0.0472, 0.0299, 0.2610, 0.2458]],

        [[0.6558, 0.3544, 0.3044, 0.9767],
         [0.6742, 0.8565, 0.2579, 0.2958]],

        [[0.6838, 0.1669, 0.1731, 0.4759],
         [0.3171, 0.1252, 0.7966, 0.9021]]])


In [56]:
## 1: Compute the min value of v.
print(torch.min(v))

## 2: Compute the max along axis 0.
print(torch.max(v, dim=0))

## 3: Compute the min along axis 1.
print(torch.min(v, dim=1))

## 4: Compute the sum over axes (0, 1)
print(torch.sum(v, dim=(0,1)))

tensor(0.0299)
torch.return_types.max(
values=tensor([[0.8377, 0.5398, 0.5226, 0.9767],
        [0.6742, 0.8565, 0.7966, 0.9021]]),
indices=tensor([[0, 0, 0, 1],
        [1, 1, 2, 2]]))
torch.return_types.min(
values=tensor([[0.0472, 0.0299, 0.2610, 0.2458],
        [0.6558, 0.3544, 0.2579, 0.2958],
        [0.3171, 0.1252, 0.1731, 0.4759]]),
indices=tensor([[1, 1, 1, 1],
        [0, 0, 1, 1],
        [1, 1, 0, 0]]))
tensor([3.2157, 2.0726, 2.3156, 3.2732])


### dot, matmul, transpose, *

Unlike NumPy, Torch has a stricter policy on these operands:

- `*`: is the Hadamard product, element-wise product.
- `dot`: only used to compute the dot product of two 1-dimensional tensors. Remember how confusing the dot product between multi-dimension NumPy vectors is (see Exercise 1)? PyTorch avoids this issue by simply forbidding the dimension of the input tensors to be greater than 1.
- `matmul`: or its alias `@` computes the matrix product. Can be used for larger than 2-dimensional tensors (it applies broadcasting, as much as in NumPy). Notice that the complexity of multiplying two $n\times n$ matrices is $O(n^3)$. We are taking advantage of its relatively high time-complexity in order to show how much faster are GPUs wrt CPUs.

1. Create two random integer tensors $A$ and $B$ of compatible(?) sizes and compute their Hadamard product (element by element product). Try these sizes (predict whether they work or not):
    - $A$ size (3, 4), $B$ size (3, 4).
    - $A$ size (3, 4), $B$ size (4, 4).
    - $A$ size (3, 4), $B$ size (1, 4).
2. Create two random 1-dimensional tensors $v, w$ and compute their dot product. If you can use multiple ways to compute it, check that indeed they return the same value.
3. Create $C$ of size (3, 4) and $D$ of size (4, 3). Compute the matrix product. Are the sizes compatible?
4. Create $E$ of size (3, 3) and $F$ of size (4, 3). Compute the matrix product. Are the sizes compatible? If not, use the transpose operator to adjust the dimensions of one of the two matrices and compute the matrix product.




In [66]:
## 1: Create A, B and perform hadamard product
# A (3,4), B (3,4)
a=torch.rand(size=(3,4))
b=torch.rand(size=(3,4))
print(a*b)

# A (3,4), B (4,4)
a=torch.rand(size=(3,4))
b=torch.rand(size=(4,4))
# Not work, bigger dimensions

# A (3,4), B (1,4)
a=torch.rand(size=(3,4))
b=torch.rand(size=(1,4))
print(a*b)
# Work, smaller dim is broadcasted

## 2: Create 1-dimensional tensors v, w and compute their dot product.
v=torch.rand(size=(3,))
w=torch.rand(size=(3,))
print(v*w) # element-wise
print(torch.dot(v,w)) # equal to matmul
print(torch.matmul(v,w)) # equal to dot
print()

## 3: Compute matrix product of C and D.
c=torch.rand(size=(3,4))
d=torch.rand(size=(4,3))
print(torch.matmul(c,d))

## 4: adjust dimensions using .T, and compute the matrix product E @ F
e=torch.rand(size=(3,3))
f=torch.rand(size=(4,3))
print(torch.matmul(e,f.T))


tensor([[0.7078, 0.1388, 0.0141, 0.0495],
        [0.3971, 0.7466, 0.1627, 0.2145],
        [0.4738, 0.1941, 0.0821, 0.4177]])
tensor([[0.1567, 0.2391, 0.0105, 0.2605],
        [0.2786, 0.1061, 0.1162, 0.3285],
        [0.4849, 0.1626, 0.0314, 0.5266]])
tensor([0.0014, 0.1091, 0.1050])
tensor(0.2155)
tensor(0.2155)

tensor([[0.9410, 1.5292, 2.0286],
        [0.3608, 0.6621, 0.9526],
        [0.9816, 1.2992, 1.9389]])
tensor([[0.5395, 0.5800, 0.8887, 1.0214],
        [0.4686, 0.4916, 0.6768, 0.8854],
        [0.2077, 0.1947, 0.2746, 0.3073]])


Now, we try to prove empirically that GPUs are actually faster than CPUs at doing large calculations.

Create a large tensor $G, H$ both of size (15000, 15000). Take their matrix product and measure how long it takes (use `%%time` cell magic notebook function).

In [67]:
# Create E and F
G = torch.rand(15000, 15000)
H = torch.rand(15000, 15000)

In [68]:
%%time
G @ H

CPU times: user 5.71 s, sys: 387 ms, total: 6.09 s
Wall time: 3.09 s


tensor([[3785.3940, 3736.2322, 3745.4905,  ..., 3756.3398, 3790.1528,
         3786.3767],
        [3773.0840, 3734.7131, 3755.4543,  ..., 3776.8916, 3823.2778,
         3786.1375],
        [3809.5737, 3760.7314, 3766.8491,  ..., 3791.2261, 3828.6909,
         3807.8506],
        ...,
        [3806.4031, 3771.0356, 3763.5601,  ..., 3778.9180, 3803.5864,
         3786.3508],
        [3786.4512, 3750.9412, 3757.9817,  ..., 3773.0549, 3811.6399,
         3810.8909],
        [3796.5281, 3740.4800, 3771.4990,  ..., 3812.6238, 3819.9070,
         3784.4470]])

Move $E$ and $F$ to the more convenient device at your disposal (different from CPU, if possible), and compute the same matrix product.

In [71]:
# Move the tensors to GPU in another cell, so that the time is not counted.
G = G.to(device)
H = H.to(device)

In [72]:
%%time
G @ H

CPU times: user 708 μs, sys: 685 μs, total: 1.39 ms
Wall time: 605 μs


tensor([[3785.3940, 3736.2322, 3745.4905,  ..., 3756.3362, 3790.1523,
         3786.3767],
        [3773.0840, 3734.7131, 3755.4543,  ..., 3776.8896, 3823.2729,
         3786.1440],
        [3809.5737, 3760.7314, 3766.8491,  ..., 3791.2261, 3828.6880,
         3807.8479],
        ...,
        [3806.4031, 3771.0356, 3763.5601,  ..., 3778.9099, 3803.5769,
         3786.3477],
        [3786.4512, 3750.9412, 3757.9817,  ..., 3773.0645, 3811.6387,
         3810.8950],
        [3796.5281, 3740.4800, 3771.4990,  ..., 3812.6321, 3819.9019,
         3784.4470]], device='mps:0')

Side note: on my laptop (MacBook) I noticed a performance improvement by $\approx\times 10$. On Kaggle the performance improvement is much larger (from >20'' to <<1').
When you have done this task, you might want to shut down your notebook and start from the cells below since resource usage might be quite demanding. Also, if you are using Kaggle, you might consider shutting down the GPU, since the bottom cells can be done with CPU only.

### PyTorch functionals

As you will learn by attending this class, one of the key features of neural networks are their non-linear functions.
PyTorch has already implemented a great amount of them in the package `torch.nn.functionals`.
Create a random tensor $A$ of size (3, 2) and apply to it:

1. [ReLu](https://pytorch.org/docs/stable/generated/torch.nn.ReLU.html)
2. [Tanh](https://pytorch.org/docs/stable/generated/torch.nn.Tanh.html)
3. [Sigmoid](https://pytorch.org/docs/stable/generated/torch.nn.Sigmoid.html)
4. [Softmax](https://pytorch.org/docs/stable/generated/torch.nn.Softmax.html) (it requires an axis: pick axis 1, predict the output shape).

If you are not familiar with them don't worry, you will learn in the remainder of the course what these functions are used for.

In [90]:
import torch
import torch.nn.functional as f

In [93]:
# Create A
A = torch.rand(3, 2) * 10 - 5
#print(A)

## 1: apply F.relu
print(f.relu(A))
print()

## 2: apply F.tanh
print(f.tanh(A))
print()

## 3: apply F.sigmoid
print(f.sigmoid(A))
print()

## 4: apply F.softmax with dim=1
print(f.softmax(A, dim=1))
print()

tensor([[0.0000, 4.9915],
        [0.0000, 0.0000],
        [0.0000, 3.7730]])

tensor([[-0.9993,  0.9999],
        [-0.9696, -0.9083],
        [-0.8759,  0.9989]])

tensor([[0.0188, 0.9933],
        [0.1105, 0.1798],
        [0.2046, 0.9775]])

tensor([[1.3052e-04, 9.9987e-01],
        [3.6165e-01, 6.3835e-01],
        [5.8757e-03, 9.9412e-01]])



## Gradients

One of the useful features of PyTorch is that it is possible to compute automatically the gradient of functions. 
As you will see, the gradient of a function is one of the key ingredients of the backpropagation algorithm, used to train neural nets.

Assume we have tensor $x = [2], y = [2]$. We have $z = 2x^2 + 3y = [14]$.

We know that $\frac{\delta z}{\delta x} = 4x$, $\frac{\delta z}{\delta y} = 3$. Since we are evaluating the point $x=2, y=2$, we get that the gradient is (8, 3). The gradients are going to be stored in $x.grad$ and $y.grad$ if we specify the option `requires_grad=True`. We can let PyTorch compute the gradients by invoking `z.backward()`. Check that indeed `x.grad` and `y.grad` hold the desired values.


In [94]:
x = torch.tensor([2], dtype=torch.float64, requires_grad=True)
y = torch.tensor([2], dtype=torch.float64, requires_grad=True)
z = 2 * x*x + 3 * y
z.backward()
print(x.grad)
print(y.grad)

tensor([8.], dtype=torch.float64)
tensor([3.], dtype=torch.float64)


1. Create tensors $s = [1]$ and $t = [1]$, define a new variable $w = 5s + 6$ and compute their gradient. What is the gradient associated to $t$? (Notice that $w$ does not depend on $t$). 
2. What happens if I try to define an integer tensor with `requires_grad=True`?
3. What happens if I call `numpy()` on a tensor that has `requires_grad=True`?

In [98]:
## 1: gradient of t for w = 5s + 6.
s = torch.tensor([1], dtype=torch.float64, requires_grad=True)
t = torch.tensor([1], dtype=torch.float64, requires_grad=True)
w = 5*s+6
w.backward()
print(t.grad) # none, w not depend on t

## 2: integer tensor with requires_grad.
# only floating point can have gradient

## 3: compute .numpy() of a tensor with requires_grad.
# t = torch.tensor([1], dtype=torch.float64, requires_grad=True).numpy() can't call directly
# must use tensor.detach().numpy()

None


In this last section we point out a very important feature of gradients, namely that they are *cumulative*! In order to see what does that mean, let's see in practice the example that was given in class:

1. Create tensors $x=[2], y=[3]$ (with flag `requires_grad=True`).
2. Compute $z = x * x + y$ and perform the backward pass.
3. Check that the gradients are as expected: $\frac{\delta z}{\delta x}=2x=4$, $\frac{\delta z}{\delta y} = 1$.
4. Compute $g = xy + 3x$ and perform che backward pass.
5. Check out the gradients: $\frac{\delta g}{\delta x}=y + 3=6$, $\frac{\delta g}{\delta y} = x = 2$.
6. Do you see the expected value? Can you explain why? (hint: gradients are *cumulative*).
7. In order to fix this potential issue, use `x/y.grad.zero_()` in between the computation of $z$ and $g$. Do you observe the expected gradient now?

In [101]:
## 1: create x, y.
x = torch.tensor([2], dtype=torch.float64, requires_grad=True)
y = torch.tensor([3], dtype=torch.float64, requires_grad=True)

## 2: compute z.
z = x*x+y
z.backward()

## 3: check out gradients of x, y.
print(f'x gradient: {x.grad}')
print(f'y gradient: {y.grad}')

## 4: compute g.
g = x*y+3*x
g.backward()

## 5: check out gradients of x, y
print(f'x gradient: {x.grad}')
print(f'y gradient: {y.grad}')

## 6: Gradients are comulative so they are not 0 
# when I compute them with g
print()

## 7: Repeat 1-5 using torch.zero_grad()
x.grad.zero_()
y.grad.zero_()

# compute z.
z = x*x+y
z.backward()
# check out gradients of x, y.
print(f'x gradient: {x.grad}')
print(f'y gradient: {y.grad}')

x.grad.zero_()
y.grad.zero_()

# compute g.
g = x*y+3*x
g.backward()
# check out gradients of x, y
print(f'x gradient: {x.grad}')
print(f'y gradient: {y.grad}')


x gradient: tensor([4.], dtype=torch.float64)
y gradient: tensor([1.], dtype=torch.float64)
x gradient: tensor([10.], dtype=torch.float64)
y gradient: tensor([3.], dtype=torch.float64)

x gradient: tensor([4.], dtype=torch.float64)
y gradient: tensor([1.], dtype=torch.float64)
x gradient: tensor([6.], dtype=torch.float64)
y gradient: tensor([2.], dtype=torch.float64)
