## NPFL129 PyTorch Tutorial
(link to this notebook: http://bit.ly/3LyZ2uV)


### Based on:
* Dilara Soylu, Ethan Chi, "CS224N: PyTorch Tutorial (Winter '24)", (https://colab.research.google.com/github/ryanyuchen/NLP-Pytorch/blob/main/CS224N_PyTorch_Tutorial.ipynb)

In this notebook, we will have a basic introduction to `PyTorch` and work on a toy NLP task. Following resources have been used in preparation of this notebook:
* Official PyTorch Documentation on [Deep Learning with PyTorch: A 60 Minute Blitz](https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html) by Soumith Chintala
* PyTorch Tutorial Notebook, [Build Basic Generative Adversarial Networks (GANs) | Coursera](https://www.coursera.org/learn/build-basic-generative-adversarial-networks-gans) by Sharon Zhou, offered on Coursera

# Please make a copy into your own Drive!

## Introduction
[PyTorch](https://pytorch.org/) is a deep learning framework, one of the two main frameworks alongside [TensorFlow](https://www.tensorflow.org/) or [Theano](https://en.wikipedia.org/wiki/Theano_(software)). The installation can be done via Pip or Conda, as described [here](https://pytorch.org/). Let's start by importing PyTorch:

In [None]:
import torch
import torch.nn as nn

# Import pprint, module we use for making our print statements prettier
import pprint
pp = pprint.PrettyPrinter()

We are all set to start our tutorial. Let's dive in!

##Part 1: Tensors

**Tensors** are PyTorch's most basic building block. Each tensor is a multi-dimensional matrix; for example, a 256x256 square image might be represented by a `3x256x256` tensor, where the first dimension represents color. Here's how to create a tensor:


In [None]:
list_of_lists = [
  [1, 2, 3],
  [4, 5, 6],
]
print(list_of_lists)


[[1, 2, 3], [4, 5, 6]]


In [None]:
# Initializing a tensor
data = torch.tensor(list_of_lists)
print(data)
print(data.dtype)

tensor([[1, 2, 3],
        [4, 5, 6]])
torch.int64


Each tensor has a **data type**: the major data types you'll need to worry about are floats (`torch.float32`) and integers (`torch.int`). You can specify the data type explicitly when you create the tensor:

In [None]:
# Initializing a tensor with an explicit data type
# Notice the dots after the numbers, which specify that they're floats
data = torch.tensor(list_of_lists, dtype=torch.float32)
print(data)
print(data.dtype)

tensor([[1., 2., 3.],
        [4., 5., 6.]])
torch.float32


In [None]:
# Tensor data type is chosen automatically to
# Notice the dots after the numbers, which specify that they're floats
other_list_of_lists = [
                  [0.11111111, 1],
                  [2, 3],
                  [4, 5]
                ]
data = torch.tensor(other_list_of_lists)
print(data)
print(data.dtype)

tensor([[0.1111, 1.0000],
        [2.0000, 3.0000],
        [4.0000, 5.0000]])
torch.float32


In [None]:
# You do not need to explicitly cast the  tensors, PyTorch handles that for you
const = torch.tensor([5], dtype=torch.int32)
result = data * const
print(result)
print(result.dtype)

tensor([[ 0.5556,  5.0000],
        [10.0000, 15.0000],
        [20.0000, 25.0000]])
torch.float32


In [None]:
# You can recast the result manually
result_int = result.type(torch.int32)
print(result_int)
print(result_int.dtype)


tensor([[ 0,  5],
        [10, 15],
        [20, 25]], dtype=torch.int32)
torch.int32


In [None]:
# You can also cast to a dtype of an existing tensor
result_float = result_int.type_as(data)
print(result_float)
print(result_float.dtype)

tensor([[ 0.,  5.],
        [10., 15.],
        [20., 25.]])
torch.float32


You can also inter-convert tensors with **NumPy arrays**:

In [None]:
import numpy as np

# numpy.ndarray --> torch.Tensor:
arr = np.array([[1, 0, 5]])
data = torch.tensor(arr)
print("This is a torch.tensor", data)

# torch.Tensor --> numpy.ndarray:
new_arr = data.numpy()
print("This is a np.ndarray", new_arr)

This is a torch.tensor tensor([[1, 0, 5]])
This is a np.ndarray [[1 0 5]]


Utility functions also exist to create tensors with given shapes and contents:

In [None]:
zeros = torch.zeros(2, 5)  # a tensor of all zeros
print(zeros)


tensor([[0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.]])


In [None]:
ones = torch.ones(3, 4)   # a tensor of all ones
print(ones)

tensor([[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]])


In [None]:
rr = torch.arange(1, 10) # range from [1, 10)
print(rr)

tensor([1, 2, 3, 4, 5, 6, 7, 8, 9])


In [None]:
rr + 2

tensor([ 3,  4,  5,  6,  7,  8,  9, 10, 11])

In [None]:
rr ** 2

tensor([ 1,  4,  9, 16, 25, 36, 49, 64, 81])

In [None]:
a = torch.tensor([[1, 2], [2, 3], [4, 5]])      # (3, 2)
b = torch.tensor([[1, 2, 3, 4], [5, 6, 7, 8]])  # (2, 4)

print("A is", a)
print("B is", b)
print("The product is", a.matmul(b)) #(3, 4)
print("The other product is", a @ b) # +, -, *, @

A is tensor([[1, 2],
        [2, 3],
        [4, 5]])
B is tensor([[1, 2, 3, 4],
        [5, 6, 7, 8]])
The product is tensor([[11, 14, 17, 20],
        [17, 22, 27, 32],
        [29, 38, 47, 56]])
The other product is tensor([[11, 14, 17, 20],
        [17, 22, 27, 32],
        [29, 38, 47, 56]])


The **shape** of a matrix (which can be accessed by `.shape` attribute or the `.size()` method) is defined as the dimensions of the matrix. Here's some examples:

In [None]:
matr_2d = torch.tensor([[1, 2, 3], [4, 5, 6]])
# print the shape of a tensor
print(matr_2d.shape)
print(matr_2d.size())

# print the size of a tensor dimension
print(matr_2d.shape[0])
print(matr_2d.size(0))

print(matr_2d)

torch.Size([2, 3])
torch.Size([2, 3])
2
2
tensor([[1, 2, 3],
        [4, 5, 6]])


In [None]:
matr_3d = torch.tensor([[[1, 2, 3, 4], [-2, 5, 6, 9]], [[5, 6, 7, 2], [8, 9, 10, 4]], [[-3, 2, 2, 1], [4, 6, 5, 9]]])
print(matr_3d)
print(matr_3d.shape)

tensor([[[ 1,  2,  3,  4],
         [-2,  5,  6,  9]],

        [[ 5,  6,  7,  2],
         [ 8,  9, 10,  4]],

        [[-3,  2,  2,  1],
         [ 4,  6,  5,  9]]])
torch.Size([3, 2, 4])


**Reshaping** tensors can be used to make batch operations easier (more on that later), but be careful that the data is reshaped in the order you expect:

In [None]:
rr = torch.arange(1, 16)
print("The shape is currently", rr.shape)
print("The contents are currently", rr)
print()
rr = rr.view(5, 3)
print("After reshaping, the shape is currently", rr.shape)
print("The contents are currently:\n", rr)

The shape is currently torch.Size([15])
The contents are currently tensor([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15])

After reshaping, the shape is currently torch.Size([5, 3])
The contents are currently:
 tensor([[ 1,  2,  3],
        [ 4,  5,  6],
        [ 7,  8,  9],
        [10, 11, 12],
        [13, 14, 15]])


In [None]:
# If we want to order the data in a vertical listing, we need a help of the .transpose() method
rr = torch.arange(1, 16)
print("1d contents", rr)
rr = rr.view(3, 5)
print("Reshape to a 'transposed' shape:\n", rr)
rr = rr.transpose(0, 1)
print("Transpose to get the desired shape and ordering:\n", rr)

1d contents tensor([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15])
Reshape to a 'transposed' shape:
 tensor([[ 1,  2,  3,  4,  5],
        [ 6,  7,  8,  9, 10],
        [11, 12, 13, 14, 15]])
Transpose to get the desired shape and ordering:
 tensor([[ 1,  6, 11],
        [ 2,  7, 12],
        [ 3,  8, 13],
        [ 4,  9, 14],
        [ 5, 10, 15]])


In [None]:
# For a `N`-dimensional tensor, you only need to specify sizes of `N-1` dimensions
rr = torch.arange(1, 17)
print("1d contents", rr)
rr = rr.view(4, -1, 2)
print("Reshaped contents", rr)
print("Reshaped shape", rr.shape)

1d contents tensor([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16])
Reshaped contents tensor([[[ 1,  2],
         [ 3,  4]],

        [[ 5,  6],
         [ 7,  8]],

        [[ 9, 10],
         [11, 12]],

        [[13, 14],
         [15, 16]]])
Reshaped shape torch.Size([4, 2, 2])


One of the reasons why we use **tensors** is *vectorized operations*: operations that be conducted in parallel over a particular dimension of a tensor.

In [None]:
data = torch.arange(1, 36, dtype=torch.float32).view(5, 7)
#data = torch.ones(5, 7)
print("Data is:", data)

# We can perform operations like *sum* over each row...
print("Taking the sum over rows:")
print(data.sum(dim=1))

# or over each column.
print("Taking thep sum over columns:")
print(data.sum(dim=0))

# Other operations are available:
print("Taking the stdev over rows:")
print(data.std(dim=1))


Data is: tensor([[ 1.,  2.,  3.,  4.,  5.,  6.,  7.],
        [ 8.,  9., 10., 11., 12., 13., 14.],
        [15., 16., 17., 18., 19., 20., 21.],
        [22., 23., 24., 25., 26., 27., 28.],
        [29., 30., 31., 32., 33., 34., 35.]])
Taking the sum over rows:
tensor([ 28.,  77., 126., 175., 224.])
Taking thep sum over columns:
tensor([ 75.,  80.,  85.,  90.,  95., 100., 105.])
Taking the stdev over rows:
tensor([2.1602, 2.1602, 2.1602, 2.1602, 2.1602])


In [None]:
data = torch.arange(1, 7, dtype=torch.float32).view(1, 2, 3)
print(data.sum(dim=0).sum(dim=0))
print(data.sum(dim=0).sum(dim=0).shape)

tensor([5., 7., 9.])
torch.Size([3])


In [None]:
data.sum()

tensor(21.)

### Exercise

Write code that creates a `torch.tensor` with the following contents:
$\begin{bmatrix} 1 & 2.2 & 9.6 \\ 4 & -7.2 & 6.3 \end{bmatrix}$

Now compute the average of each row (`.mean()`) and each column.

What's the shape of the results?



In [None]:
m = torch.tensor([[1, 2.2, 9.6], [4, -7.2, 6.3]])
m.mean(1)

tensor([4.2667, 1.0333])

**Indexing**

You can access arbitrary elements of a tensor using the `[]` operator. A special `:` index can be used to get all elements from that specific dimension.

In [None]:
# Initialize an example tensor
x = torch.arange(1, 13, dtype=torch.float32).view(3, 2, 2)
print(x)
print(x.shape)

tensor([[[ 1.,  2.],
         [ 3.,  4.]],

        [[ 5.,  6.],
         [ 7.,  8.]],

        [[ 9., 10.],
         [11., 12.]]])
torch.Size([3, 2, 2])


In [None]:
# Access the 0th element, which is the first row
print(x[0]) # Equivalent to x[0, :] or x[0, :, :]

print(x[0, :])

print(x[0, :, :])

tensor([[1., 2.],
        [3., 4.]])
tensor([[1., 2.],
        [3., 4.]])
tensor([[1., 2.],
        [3., 4.]])


We can also index into multiple dimensions with `:`.

In [None]:
# If we want to extract 0th element from the "column", we need to explicitly state to take all rows
x[:, 0]

tensor([[ 1.,  2.],
        [ 5.,  6.],
        [ 9., 10.]])

In [None]:
# Get the top left element of each element in our tensor
x[:, 0, 0]

tensor([1., 5., 9.])

In [None]:
print(x)
print(x[:, :, :])  # this is same as print(x)

tensor([[[ 1.,  2.],
         [ 3.,  4.]],

        [[ 5.,  6.],
         [ 7.,  8.]],

        [[ 9., 10.],
         [11., 12.]]])
tensor([[[ 1.,  2.],
         [ 3.,  4.]],

        [[ 5.,  6.],
         [ 7.,  8.]],

        [[ 9., 10.],
         [11., 12.]]])


We can also access arbitrary elements in each dimension.

In [None]:
# Let's access the 0th and 1st elements, each twice
# same as stacking x[0], x[0], x[1], x[1]
i = torch.tensor([0, 0, 1, 1])
x[i]

tensor([[[1., 2.],
         [3., 4.]],

        [[1., 2.],
         [3., 4.]],

        [[5., 6.],
         [7., 8.]],

        [[5., 6.],
         [7., 8.]]])

In [None]:
# This is similar to stacking of 4 tensors extracted by individual indices
x_stacked = torch.stack([x[0], x[0], x[1], x[1]], axis=0)
print("Stacked tensor:\n", (x_stacked))
print("Stacked shape: ", x_stacked.shape)

Stacked tensor:
 tensor([[[1., 2.],
         [3., 4.]],

        [[1., 2.],
         [3., 4.]],

        [[5., 6.],
         [7., 8.]],

        [[5., 6.],
         [7., 8.]]])
Stacked shape:  torch.Size([4, 2, 2])


In [None]:
# We can also do a concatenation
x_concat = torch.cat([x[0], x[0], x[1], x[1]], axis=0)
print("Concatenated tensor:\n", (x_concat))
print("Concatenated shape: ", x_concat.shape)

Concatenated tensor:
 tensor([[1., 2.],
        [3., 4.],
        [1., 2.],
        [3., 4.],
        [5., 6.],
        [7., 8.],
        [5., 6.],
        [7., 8.]])
Concatenated shape:  torch.Size([8, 2])


In [None]:
# Let's access the 0th elements of the 1st and 2nd elements
# This works thanks to broadcasting (index-array j has shape 1)

i = torch.tensor([1, 2])
j = torch.tensor([0])
x[i, j]

tensor([[ 5.,  6.],
        [ 9., 10.]])

In [None]:
# We can also use a matrix of indices instead of arrays.
indices = [[1, 2], [0, 0]]
print("Full index list:\n", x[tuple(indices)])

indices = [[1, 2], [0]]
print("Broadcasting over the second index list:\n", x[tuple(indices)])


Full index list:
 tensor([[ 5.,  6.],
        [ 9., 10.]])
Broadcasting over the second index list:
 tensor([[ 5.,  6.],
        [ 9., 10.]])


We can get a `Python` scalar value from a tensor with `.item()`.

In [None]:
x[0, 0, 0]

tensor(1.)

In [None]:
x[0, 0, 0].item()

1.0

### Exercise:

Write code that creates a `torch.tensor` with the following contents:
$\begin{bmatrix} 1 & 2.2 & 9.6 \\ 4 & -7.2 & 6.3 \end{bmatrix}$

How do you get the second column? The first row?



In [None]:
# Write your code here...

## Autograd
Pytorch is well-known for its automatic differentiation feature. We can call the `.backward()` method to ask `PyTorch` to calculate the gradients, which are then stored in the `grad` attribute.

In [None]:
# Create an example tensor
# requires_grad parameter tells PyTorch to store gradients
x = torch.tensor([2.], requires_grad=True)  # True by default
print(x)

# Print the gradient if it is calculated
# Currently None since x is a scalar
print("Tensor's current gradient value: ", x.grad)

tensor([2.], requires_grad=True)
Tensor's current gradient value:  None


In [None]:
# Calculating the gradient of y with respect to x
y = x * x * 3 # 3x^2
y.backward()
pp.pprint(x.grad) # d(y)/d(x) = d(3x^2)/d(x) = 6x = 12

tensor([12.])


Let's run backprop from a different tensor again to see what happens.

In [None]:
z = (x ** 2) * 3 # 3x^2
z.backward()
pp.pprint(x.grad)

z1 = (x ** 2) * 3 # 3x^2
z1.backward()
pp.pprint(x.grad)

# Computing .backward() through the same graph throws a RuntimeError
#z.backward()
#pp.pprint(x.grad)

tensor([24.])
tensor([36.])


In [None]:
# When iterating through a next batch of data, you need to reset the gradients first
x.grad = None
z = x * x * 3 # 3x^2
z.backward()
# y = x * x * 3
pp.pprint(x.grad)

tensor([12.])


In [None]:
x.grad = None
z = x * x * 3 # 3x^2
z.backward()
# y = x * x * 3
pp.pprint(x.grad)

tensor([12.])


In [None]:
z = x * x * 3 # 3x^2
z.backward()
# y = x * x * 3
pp.pprint(x.grad)

tensor([24.])


We can see that the `x.grad` is updated to be the sum of the gradients calculated so far. When we run backprop in a neural network, we sum up all the gradients for a particular neuron before making an update. This is exactly what is happening here! This is also the reason why we need to run `zero_grad()` in every training iteration (more on this later). Otherwise our gradients would keep building up from one training iteration to the other, which would cause our updates to be wrong.

## Customized Backward Function
In some rare cases, you might want to design your own operators, or calculate higher order gradients that are not supported by Pytorch. In these cases you can define your own function with customized forward & backward computation. However, keep in mind that always check if something is already implemented by Pytorch (which is very likely) before customizing your own forward & backward computation. See more at https://pytorch.org/docs/stable/notes/extending.html.

## Using GPUs

With large models, you can significantly speed-up training (and to some extent, inference) by assigning GPU devices to model computation.

NOTE: This Collab notebook runs on a machine without GPU resources, by default.

In [None]:
# Use GPU, if available, otherwise, run on CPU
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)} is available.")
else:
    print("No GPU available. Training will run on CPU.")

GPU: Tesla T4 is available.


In [None]:
# In practice, you want to set a variable based on the GPU availability and assign this variable to parts of the model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


In [None]:
# Create a tensor on the CPU
tensor = torch.randn((3, 3))
print("This tensor is on device: ", tensor.device)
# Move the tensor to the GPU
tensor_gpu = tensor.to(device)
print("This tensor is on device: ", tensor_gpu.device)

This tensor is on device:  cpu
This tensor is on device:  cuda:0


## Neural Network Module

So far we have looked into the tensors, their properties and basic operations on tensors. These are especially useful to get familiar with if we are building the layers of our network from scratch. We will utilize these in Assignment 3, but moving forward, we will use predefined blocks in the `torch.nn` module of `PyTorch`. We will then put together these blocks to create complex networks. Let's start by importing this module with an alias so that we don't have to type `torch` every time we use it.

In [None]:
import torch.nn as nn

### **Linear Layer**
We can use `nn.Linear(H_in, H_out)` to create a a linear layer. This will take a matrix of `(N, *, H_in)` dimensions and output a matrix of `(N, *, H_out)`. The `*` denotes that there could be arbitrary number of dimensions in between. The linear layer performs the operation `Ax+b`, where `A` and `b` are initialized randomly. If we don't want the linear layer to learn the bias parameters, we can initialize our layer with `bias=False`.

In [None]:
# Create the inputs
input = torch.ones(4, 3)
# N, -1, H_in -> N, -1, H_out


# Make a linear layers transforming N,-1,H_in dimensinal inputs to N,-1,H_out
# dimensional outputs
linear = nn.Linear(3, 2)
linear_output = linear(input)
print("Output of Linear", linear_output)
print("Output shape", linear_output.shape)
print("Linear weights and biases:\n", linear.weight, linear.bias)

Output of Linear tensor([[-0.1373,  0.3348],
        [-0.1373,  0.3348],
        [-0.1373,  0.3348],
        [-0.1373,  0.3348]], grad_fn=<AddmmBackward0>)
Output shape torch.Size([4, 2])
Linear weights and biases:
 Parameter containing:
tensor([[-0.1842, -0.2575,  0.2432],
        [-0.3184,  0.4725, -0.1700]], requires_grad=True) Parameter containing:
tensor([0.0611, 0.3507], requires_grad=True)


In [None]:
print("All parameters :\n", list(linear.parameters())) # Ax + b

print("Linear weights:\n", linear.weight)
print("Linear biases:\n", linear.bias)

All parameters :
 [Parameter containing:
tensor([[-0.3111,  0.0479, -0.4732],
        [-0.0173,  0.3803,  0.0819]], requires_grad=True), Parameter containing:
tensor([-0.4634, -0.0155], requires_grad=True)]
Linear weights:
 Parameter containing:
tensor([[-0.3111,  0.0479, -0.4732],
        [-0.0173,  0.3803,  0.0819]], requires_grad=True)
Linear biases:
 Parameter containing:
tensor([-0.4634, -0.0155], requires_grad=True)


In [None]:
# Data of shape [batch_size, feature_dim] # 4
# [batch_size, output_dim] # 2

# linear layer of shape (feature_dim, output_dim)

### **Other Module Layers**
There are several other preconfigured layers in the `nn` module. Some commonly used examples are `nn.Conv2d`, `nn.ConvTranspose2d`, `nn.BatchNorm1d`, `nn.BatchNorm2d`, `nn.Upsample` and `nn.MaxPool2d` among many others. We will learn more about these as we progress in the course. For now, the only important thing to remember is that we can treat each of these layers as plug and play components: we will be providing the required dimensions and `PyTorch` will take care of setting them up.

### **Activation Function Layer**
We can also use the `nn` module to apply activations functions to our tensors. Activation functions are used to add non-linearity to our network. Some examples of activations functions are `nn.ReLU()`, `nn.Sigmoid()` and `nn.LeakyReLU()`. Activation functions operate on each element seperately, so the shape of the tensors we get as an output are the same as the ones we pass in.

In [None]:
linear_output

tensor([[-0.1373,  0.3348],
        [-0.1373,  0.3348],
        [-0.1373,  0.3348],
        [-0.1373,  0.3348]], grad_fn=<AddmmBackward0>)

In [None]:
sigmoid = nn.Sigmoid()
output = sigmoid(linear_output)
output

tensor([[0.4657, 0.5829],
        [0.4657, 0.5829],
        [0.4657, 0.5829],
        [0.4657, 0.5829]], grad_fn=<SigmoidBackward0>)

### **Putting the Layers Together**
So far we have seen that we can create layers and pass the output of one as the input of the next. Instead of creating intermediate tensors and passing them around, we can use `nn.Sequentual`, which does exactly that.

In [None]:
block = nn.Sequential(
    nn.Linear(4, 2),
    nn.Sigmoid()
)

input = torch.ones(2,3,4)
output = block(input)
output

tensor([[[0.3491, 0.1892],
         [0.3491, 0.1892],
         [0.3491, 0.1892]],

        [[0.3491, 0.1892],
         [0.3491, 0.1892],
         [0.3491, 0.1892]]], grad_fn=<SigmoidBackward0>)

### Custom Modules

Instead of using the predefined modules, we can also build our own by extending the `nn.Module` class. For example, we can build a the `nn.Linear` (which also extends `nn.Module`) on our own using the tensor introduced earlier! We can also build new, more complex modules, such as a custom neural network. You will be practicing these in the later assignment.

To create a custom module, the first thing we have to do is to extend the `nn.Module`. We can then initialize our parameters in the `__init__` function, starting with a call to the `__init__` function of the super class. All the class attributes we define which are `nn` module objects are treated as parameters, which can be learned during the training. Tensors are not parameters, but they can be turned into parameters if they are wrapped in `nn.Parameter` class.

All classes extending `nn.Module` are also expected to implement a `forward(x)` function, where `x` is a tensor. This is the function that is called when a parameter is passed to our module, such as in `model(x)`.

In [None]:
class MultilayerPerceptronBinary(nn.Module):
  "MLP for logistic regression."

  def __init__(self, x_size, h_size):
    # Call to the __init__ function of the super class
    super(MultilayerPerceptronBinary, self).__init__()

    # Bookkeeping: Saving the initialization parameters
    self.input_size = x_size
    self.hidden_size = h_size

    # Defining of our model
    # There isn't anything specific about the naming of `self.model`. It could
    # be something arbitrary.
    self.model = nn.Sequential(
        nn.Linear(self.input_size, self.hidden_size),
        nn.ReLU(),
        nn.Linear(self.hidden_size, self.input_size),
        nn.Sigmoid()
    )

  def forward(self, x):
    y = self.model(x)
    return y

Here is an alternative way to define the same class. You can see that we can replace `nn.Sequential` by defining the individual layers in the `__init__` method and connecting the in the `forward` method.

In [None]:
class MultilayerPerceptron(nn.Module):
  "MLP for multiclass logistic regression."

  def __init__(self, x_size, h_size, y_size):
    # Call to the __init__ function of the super class
    super(MultilayerPerceptron, self).__init__()

    # Bookkeeping: Saving the initialization parameters
    self.input_size = x_size
    self.hidden_size = h_size
    self.output_size = y_size

    # Defining of our layers
    self.hidden_linear = nn.Linear(self.input_size, self.hidden_size)
    self.hidden_act = nn.ReLU()
    self.output_linear = nn.Linear(self.hidden_size, self.output_size)
    self.output_act = nn.Softmax()

  def forward(self, x):
    h_in = self.hidden_linear(x)
    h = self.hidden_act(h_in)
    # Alternatively, you can define more complex behaviour, e.g. a "residual connection" between the input tensor and the hidden layer output
    # h = self.hidden_act(self.hidden_linear(x)) + x

    y_in = self.output_linear(h)
    y = self.output_act(y_in)
    return y  # returns a probability distribution over the K=`y_size` labels

Now that we have defined our class, we can instantiate it and see what it does.

In [None]:
# Make a sample input
input = torch.randn(2, 5)

# Create our model
model = MultilayerPerceptronBinary(5, 3)

# Pass our input through our model
model(input)

tensor([[0.4572, 0.4271, 0.4568, 0.3642, 0.4728],
        [0.4582, 0.4279, 0.4583, 0.3604, 0.4683]], grad_fn=<SigmoidBackward0>)

We can inspect the parameters of our model with `named_parameters()` and `parameters()` methods.

In [None]:
list(model.named_parameters())

[('model.0.weight',
  Parameter containing:
  tensor([[-0.3012, -0.0858,  0.0569, -0.0735, -0.1743],
          [ 0.1012, -0.0810, -0.3994, -0.2952, -0.3351],
          [ 0.3343, -0.1778,  0.0729, -0.3328, -0.4405]], requires_grad=True)),
 ('model.0.bias',
  Parameter containing:
  tensor([-0.3644,  0.0364, -0.3522], requires_grad=True)),
 ('model.2.weight',
  Parameter containing:
  tensor([[ 0.1213,  0.1365,  0.2605],
          [ 0.2877,  0.5309,  0.5119],
          [ 0.5018, -0.2756,  0.0433],
          [ 0.3594,  0.1354, -0.2512],
          [-0.5266, -0.4991, -0.3710]], requires_grad=True)),
 ('model.2.bias',
  Parameter containing:
  tensor([ 0.4163, -0.4924, -0.0918,  0.1731,  0.5422], requires_grad=True))]

## Optimization
We have showed how gradients are calculated with the `backward()` function. Having the gradients isn't enought for our models to learn. We also need to know how to update the parameters of our models. This is where the optimozers comes in. `torch.optim` module contains several optimizers that we can use. Some popular examples are `optim.SGD` and `optim.Adam`. When initializing optimizers, we pass our model parameters, which can be accessed with `model.parameters()`, telling the optimizers which values it will be optimizing. Optimizers also has a learning rate (`lr`) parameter, which determines how big of an update will be made in every step. Different optimizers have different hyperparameters as well.

In [None]:
import torch.optim as optim

After we have our optimization function, we can define a `loss` that we want to optimize for. We can either define the loss ourselves, or use one of the predefined loss function in `PyTorch`, such as `nn.BCELoss()`. Let's put everything together now! We will start by creating some dummy data.

In [None]:
# Create the y data
y = torch.ones(10, 5)

# Add some noise to our goal y to generate our x
# We want out model to predict our original data labels, albeit the noise
x = y + torch.randn_like(y)
x

tensor([[-0.0439,  1.5338, -0.2522,  2.1946,  0.7624],
        [ 1.9122,  2.1143,  2.7625,  2.0433,  1.7148],
        [ 3.8834,  0.4632,  1.7572,  1.1206, -0.4500],
        [ 0.4880, -0.1021,  0.4833,  1.1883,  0.5024],
        [ 1.8483,  0.1161,  1.2318,  3.5066,  0.3577],
        [-0.5430,  0.2910, -0.8017,  2.5302,  0.5991],
        [ 1.1935,  1.4306,  2.5219,  1.1166,  0.1193],
        [ 1.8744,  0.6625,  1.0545,  1.2671,  1.4071],
        [ 1.3618, -0.6269,  1.5119,  0.7388,  1.1972],
        [ 0.7703,  1.2040,  1.5003,  2.0767,  1.7160]])

Now, we can define our model, optimizer and the loss function.

In [None]:
# Instantiate the model
model = MultilayerPerceptronBinary(5, 3)

# Define the optimizer
optimizer = optim.SGD(model.parameters(), lr=1)
#optimizer = optim.Adam(model.parameters(), lr=1e-1)

# Define loss using a predefined loss function
loss_function = nn.BCELoss()

# Calculate how our model is doing now
y_pred = model(x)
loss = loss_function(y_pred, y)
print(loss)

tensor(0.6457, grad_fn=<BinaryCrossEntropyBackward0>)


Let's see if we can have our model achieve a smaller loss. Now that we have everything we need, we can setup our training loop.

In [None]:
# Set the number of epoch, which determines the number of training iterations
n_epoch = 10

for epoch in range(n_epoch):
  # Set the gradients to 0
  optimizer.zero_grad()

  # Get the model predictions
  y_pred = model(x)

  # Get the loss
  loss = loss_function(y_pred, y)

  # Print stats
  print(f"Epoch {epoch}: traing loss: {loss}")

  # Compute the gradients
  loss.backward()

  # Take a step to optimize the weights
  optimizer.step()


Epoch 0: traing loss: 0.6456535458564758
Epoch 1: traing loss: 0.5407567024230957
Epoch 2: traing loss: 0.4014311730861664
Epoch 3: traing loss: 0.24919284880161285
Epoch 4: traing loss: 0.14228886365890503
Epoch 5: traing loss: 0.0883583053946495
Epoch 6: traing loss: 0.061105288565158844
Epoch 7: traing loss: 0.04563061147928238
Epoch 8: traing loss: 0.03591383248567581
Epoch 9: traing loss: 0.029342083260416985


In [None]:
list(model.parameters())

[Parameter containing:
 tensor([[ 0.2264, -0.4170, -0.0122, -0.2925,  0.1127],
         [-0.0281,  0.4767,  1.0541,  0.2641,  0.8284],
         [ 0.2989, -0.2280,  0.0177,  0.0099, -0.3006]], requires_grad=True),
 Parameter containing:
 tensor([-0.1330,  0.7851, -0.0184], requires_grad=True),
 Parameter containing:
 tensor([[-0.4401,  0.8581,  0.2906],
         [-0.4191,  0.6303, -0.0774],
         [ 0.0524,  0.5051, -0.5452],
         [ 0.0107,  0.8030, -0.3300],
         [-0.0989,  0.8633,  0.4599]], requires_grad=True),
 Parameter containing:
 tensor([0.4768, 0.5623, 0.9843, 0.5688, 0.2585], requires_grad=True)]

You can see that our loss is decreasing. Let's check the predictions of our model now and see if they are close to our original `y`, which was all `1s`.

In [None]:
# See how our model performs on the training data
y_pred = model(x)
y_pred

tensor([[0.9887, 0.9706, 0.9656, 0.9867, 0.9863],
        [0.9892, 0.9715, 0.9665, 0.9872, 0.9869],
        [0.9653, 0.9343, 0.9347, 0.9622, 0.9579],
        [0.9978, 0.9911, 0.9868, 0.9972, 0.9974],
        [0.9391, 0.9008, 0.9140, 0.9399, 0.9300],
        [0.9984, 0.9930, 0.9890, 0.9979, 0.9981],
        [0.9744, 0.9471, 0.9451, 0.9715, 0.9689],
        [0.9568, 0.9232, 0.9260, 0.9535, 0.9476],
        [0.8857, 0.8476, 0.8709, 0.8848, 0.8628],
        [0.9874, 0.9683, 0.9635, 0.9853, 0.9848]], grad_fn=<SigmoidBackward0>)

In [None]:
# Create test data and check how our model performs on it
x2 = y + torch.randn_like(y)
y2_pred = model(x2)
y2_pred

tensor([[0.9906, 0.9974, 0.9942, 0.9888, 0.9918],
        [0.9925, 0.9977, 0.9947, 0.9918, 0.9914],
        [0.9806, 0.9941, 0.9886, 0.9778, 0.9839],
        [0.9945, 0.9992, 0.9979, 0.9890, 0.9981],
        [0.9807, 0.9923, 0.9855, 0.9824, 0.9744],
        [0.9188, 0.9486, 0.9304, 0.9441, 0.8622],
        [0.8803, 0.9372, 0.9202, 0.8994, 0.8725],
        [0.8044, 0.8908, 0.8759, 0.8319, 0.8155],
        [0.9342, 0.9660, 0.9509, 0.9475, 0.9125],
        [0.9086, 0.9599, 0.9452, 0.9136, 0.9192]], grad_fn=<SigmoidBackward0>)

## Demo: mlp_classification_sgd in PyTorch

This is a simple example solution of the mlp_classification_sgd task using PyTorch.

In [None]:
import sklearn

# Default argparse arguments
seed =42
classes = 10
hidden_size = 50
epochs = 100
test_size = 797
batch_size = 100

# Set random seed
generator = np.random.RandomState(seed)

# Load the digits dataset.
data, target = sklearn.datasets.load_digits(n_class=classes, return_X_y=True)

# Split the dataset into a train set and a test set.
# Use `sklearn.model_selection.train_test_split` method call, passing
# arguments `test_size=args.test_size, random_state=args.seed`.
train_data, test_data, train_target, test_target = sklearn.model_selection.train_test_split(
    data, target, test_size=test_size, random_state=seed)

# Convert the data to torch.tensor
train_data = torch.tensor(train_data, dtype=torch.float32)
test_data = torch.tensor(test_data, dtype=torch.float32)
train_target = torch.tensor(train_target, dtype=torch.int64)
test_target = torch.tensor(test_target, dtype=torch.int64)

# Create the model, loss_fn and optimizer
model = MultilayerPerceptron(train_data.shape[1], hidden_size, classes)  # We do not explicitly specify the weigh initialization interval
loss_function = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.05)

In [None]:
# Training loop
for epoch in range(epochs):
    permutation = generator.permutation(train_data.shape[0])

    optimizer.zero_grad()
    loss = 0
    train_accuracy = 0
    test_accuracy = 0
    n_batches = torch.tensor(permutation.shape[0] / batch_size, dtype=torch.float32)
    for i in range(0, int(n_batches.item())):
        # Prepare batched training data
        batch_indices = permutation[i * batch_size : (i+1) * batch_size]
        batch_train = train_data[batch_indices]
        batch_target = train_target[batch_indices]
        batch_target_probs = torch.nn.functional.one_hot(batch_target, num_classes=classes).type(torch.float32)

        # Compute label probability distribution
        y_probs = model(batch_train)

        # Compute loss
        loss += loss_function(batch_target_probs, y_probs).mean()

        # Get predictions and compute accuracy
        y_pred = y_probs.argmax(-1)
        train_accuracy += (y_pred == batch_target).type(torch.float32).mean()

    # Average over all batches
    loss = loss / n_batches  # average the loss by the size of training data
    train_accuracy = train_accuracy / n_batches

    # Compute gradients
    loss.backward()

    # Update weights
    optimizer.step()

    # Evaluate on test dataset
    y_pred_test = model(torch.tensor(test_data, dtype=torch.float32)).argmax(-1)
    test_accuracy = (y_pred_test == torch.tensor(test_target, dtype=torch.float32)).type(torch.float32).mean()

    print("After epoch {}: train acc {:.1f}%, test acc {:.1f}%".format(
        epoch + 1, 100 * train_accuracy, 100 * test_accuracy))

  return self._call_impl(*args, **kwargs)
  y_pred_test = model(torch.tensor(test_data, dtype=torch.float32)).argmax(-1)
  test_accuracy = (y_pred_test == torch.tensor(test_target, dtype=torch.float32)).type(torch.float32).mean()


After epoch 1: train acc 5.4%, test acc 9.4%
After epoch 2: train acc 10.1%, test acc 12.2%
After epoch 3: train acc 12.8%, test acc 11.9%
After epoch 4: train acc 12.2%, test acc 12.2%
After epoch 5: train acc 12.0%, test acc 12.3%
After epoch 6: train acc 12.3%, test acc 13.3%
After epoch 7: train acc 12.9%, test acc 14.3%
After epoch 8: train acc 14.7%, test acc 16.8%
After epoch 9: train acc 16.2%, test acc 18.8%
After epoch 10: train acc 18.8%, test acc 21.5%
After epoch 11: train acc 22.7%, test acc 24.8%
After epoch 12: train acc 24.7%, test acc 26.6%
After epoch 13: train acc 27.0%, test acc 28.9%
After epoch 14: train acc 29.0%, test acc 30.4%
After epoch 15: train acc 30.9%, test acc 32.0%
After epoch 16: train acc 32.1%, test acc 33.4%
After epoch 17: train acc 33.2%, test acc 34.3%
After epoch 18: train acc 34.1%, test acc 35.1%
After epoch 19: train acc 35.3%, test acc 35.6%
After epoch 20: train acc 35.9%, test acc 36.3%
After epoch 21: train acc 37.0%, test acc 37.1%
Aft