## 2. PyTorch modules and automatic differentiation

PyTorch provides the users with an extensive set of modules and functions for machine learning.
The basis of the modules is the `torch.nn.Module` base class, which provides a basic blueprint for a functioning of a neural network layer.
The basic idea is that each component of the model should be a subclass of `Module` (including the model itself).
With the inheritance from `Module` come attributes and method which allow PyTorch to seamlessly handle the model and its parameters.

For instance, the method `parameters()` returns an iterator over the parameters of the model, which can be used to update the parameters during the training.
It is one of the basic methods defined by `Module`.
To fetch the parameters of the model, PyTorch can recursively call the `parameters()` for each submodule.

In [1]:
import torch
from torch import nn

The most basic block is the linear layer, which defines an affine transformation of the input.

`Linear` is characterized by:
- `in_features` - the number of input features
- `out_features` - the number of output features
- `bias` - an optional boolean flag indicating whether to include a bias term

$$\text{output} = W\cdot\text{input} + b$$

The linear layer is created as an **instance of the `torch.nn.Linear` class**.
Thus, we first instantiate it, then we can use it as a function to transform input tensors.

In [20]:
lin = torch.nn.Linear(
    in_features=10,
    out_features=1,
    bias=True
)

lin

Linear(in_features=10, out_features=1, bias=True)

We can see its parameters by calling the `weight` and `bias` attributes, or by calling the `parameters()` method (or even better, the `named_parameters()`)

In [21]:
print(lin.weight, "\n", lin.bias)

print("------------------")
for name, param in lin.named_parameters():
    print(name, ":", param.size())
    print(param)
    print("\n")

Parameter containing:
tensor([[ 0.0626,  0.1555, -0.2141, -0.2305,  0.1476,  0.2546, -0.1666, -0.1842,
         -0.0156, -0.2710]], requires_grad=True) 
 Parameter containing:
tensor([0.1635], requires_grad=True)
------------------
weight : torch.Size([1, 10])
Parameter containing:
tensor([[ 0.0626,  0.1555, -0.2141, -0.2305,  0.1476,  0.2546, -0.1666, -0.1842,
         -0.0156, -0.2710]], requires_grad=True)


bias : torch.Size([1])
Parameter containing:
tensor([0.1635], requires_grad=True)




**Take note of the `requires_grad` attribute of the parameters.
We'll discuss it later.**

In [22]:
x = torch.linspace(0.1,1,10)
print(x)
print("------------------")
y = lin(x)
y

tensor([0.1000, 0.2000, 0.3000, 0.4000, 0.5000, 0.6000, 0.7000, 0.8000, 0.9000,
        1.0000])
------------------


tensor([-0.2780], grad_fn=<AddBackward0>)

**Take note of the `grad_fn=<AddBackward0>` attribute of the output tensor.
We'll discuss it later.**

**Question:**

What is the Python pattern that allows us to call an object as a function?

```python
o = MyObject(...)
o(x) # calling an instance of MyObject as a function
```

*your answer here*

For supporting the basic operations of the layer, PyTorch provides a set of functions in the `torch.nn.functional` module.

The affine transformation is implemented as the `torch.nn.functional.linear` function.

In [23]:
nn.functional.linear(x, lin.weight, lin.bias)

tensor([-0.2780], grad_fn=<AddBackward0>)

**Question:**

Why don't we use the `torch.nn.functional.linear` function directly instead of defining a layer and calling it as a function?

*your answer here*

### Automatic differentiation

The two weird names we saw before are related to the way Python keeps track of the operations performed on the tensors. 
Each time we perform an operation on a tensor, PyTorch keeps track of the operation in a **computational graph**.

![](img/compgraph.png)

The `grad_fn=<AddBackward0>` we observed on `y` makes explicit reference to the last operation performed to obtain `y` itself.

Under the hood, PyTorch defines the tools to compute the backward pass on this function, i.e., the derivative of the summation.

**Notice: unless you want to define exotic new computations, you'll never need to define the backward behavior of a function: PyTorch does everything by itself by combining the various building blocks it's provided with**

We can prompt PyTorch to compute the gradients for each of the tensors involved in the computation by calling `backward()` on `y`:

In [24]:
y.backward()

We can now probe the gradients by calling `.grad` on the tensors involved in the computation:

In [27]:
print(lin.weight.grad)

print(lin.bias.grad)

print(x.grad)

tensor([[0.1000, 0.2000, 0.3000, 0.4000, 0.5000, 0.6000, 0.7000, 0.8000, 0.9000,
         1.0000]])
tensor([1.])
None


**Question:**

Why is `x.grad` `None`?

The answer has to do with the second name, `requires_grad`.
If a tensor does not require gradients, PyTorch will not compute them.

We can force `x` to require gradients. Notice the change:

In [28]:
x.requires_grad = True
y = lin(x)
y.backward()
print(x.grad)

tensor([ 0.0626,  0.1555, -0.2141, -0.2305,  0.1476,  0.2546, -0.1666, -0.1842,
        -0.0156, -0.2710])


Notice another thing. Let's multiply two random tensors:

In [29]:
t1 = torch.rand(2, 3, 4)
t2 = torch.randn(2, 3, 4)
t3 = t1 * t2
t3

tensor([[[ 0.2032,  0.2158, -1.4581,  0.2025],
         [ 1.4985, -0.1969,  0.2244, -0.0836],
         [-1.0066, -0.2862,  0.2189,  0.8253]],

        [[ 0.0450, -0.5406,  1.4224,  0.5126],
         [ 0.0625,  0.7091, -0.4182, -0.7371],
         [ 0.5692,  0.9028,  0.2584, -0.0439]]])

**Question:**

Why does not `t3` have the `grad_fn` attribute?

*Your answer here*

### Our first neural network

A regular Multilayer Perceptron (MLP) is defined as a sequence of linear layers with non-linearity in between.

The classical way of building MLPs in PyTorch is by defining a class that inherits from `torch.nn.Module` and instantiates the layers in the constructor.

Let's define a simple MLP with two hidden layers and a ReLU activation function.

The two basic things to remember when you're defining a new `Module` are:

* The constructor should call the constructor of the parent class (**this allows the instantiation of the important attribute and methods mentioned before**)
* You should define the `forward` method, which describes how the input is transformed into the output.

In [31]:
class MyMLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(10, 5)
        self.batch_norm = nn.BatchNorm1d(5)
        self.fc2 = nn.Linear(5, 3)

    def forward(self, x):
        x = self.fc1(x)
        x = nn.functional.relu(x)
        x = self.batch_norm(x)
        x = self.fc2(x)
        return x

mlp = MyMLP()
mlp

MyMLP(
  (fc1): Linear(in_features=10, out_features=5, bias=True)
  (batch_norm): BatchNorm1d(5, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (fc2): Linear(in_features=5, out_features=3, bias=True)
)

We can already try to call this model on a random vector of size 10

In [34]:
mlp(torch.rand(10))

ValueError: expected 2D or 3D input (got 1D input)

This error is because PyTorch already expects the input to come in batches.

In case of MLP, the expected shape is $B \times D$, where $B$ is the batch size and $D$ is the number of features.

In [36]:
out = mlp(torch.rand(4, 10)) # batch size of 4
print(out, "\n", out.size())

tensor([[ 0.8375,  0.5648, -0.6774],
        [ 0.3175,  0.4420, -0.0280],
        [ 0.2310,  0.3613,  0.7035],
        [ 0.3267,  0.1695,  0.1518]], grad_fn=<AddmmBackward0>) 
 torch.Size([4, 3])


If a model is composed of sequential layers and the information is not split into multiple branches, you can use the `torch.nn.Sequential` class, which allows you to define the model as a sequence of layers.

In [38]:
mlp = nn.Sequential(
    nn.Linear(10, 5),
    nn.ReLU(), # careful about the relu, we must pass `nn.Module`s here, not functions
    nn.BatchNorm1d(5),
    nn.Linear(5, 3)
)

out = mlp(torch.rand(4, 10))
out

tensor([[-0.4501,  0.4791, -0.1262],
        [-0.1265,  0.2621, -0.1062],
        [-0.1583, -0.3855, -0.3916],
        [-0.1800, -1.2625, -0.5996]], grad_fn=<AddmmBackward0>)