# Lab 2 - Basics of Autograd in Python (and first overview of _backpropagation_)

What did we see last time?

- PyTorch basics
- Construction of a multilayer perceptron using PyTorch API

**What caught our attention?**

![](imgs/02/grad.jpg)

## Differentiation in PyTorch

PyTorch is built with support for differentiation in mind.
In the end, Deep Learning (for now) is all about differentiation and building cascades of differentiable function into complicated multilayer deep neural networks.

Essentially, all PyTorch built-ins support differentiability (unless the function is not differentiable, of course).
Today we will see how to compute derivatives in PyTorch.
Also, we will learn how to create differentiable modules using PyTorch APIs.

#### Notation and recall

1. **Function** $f:\mathbb{R}\rightarrow\mathbb{R}$, given $x\in\mathbb{R}$, derivative is $\frac{\partial f}{\partial x}$
2. **Scalar function** $f:\mathbb{R}^d\rightarrow\mathbb{R}$, we have a vector $\mathbf{x}\in\mathbb{R}^d = (x_1,\dots,x_d)$, we calculate the derivative of $f$ w.r.t. each of the dimensions of $\mathbf{x}$ and obtain the gradient $\nabla_f = (\frac{\partial f}{\partial x_1},\dots,\frac{\partial f}{\partial x_d})$
3. **Vector function** $f:\mathbb{R}^d\rightarrow\mathbb{R}^k$, given $\mathbf{x}$, we have $f(\mathbf{x})=(f_1(\mathbf{x}),\dots,f_k(\mathbf{x}))$, hence we can calculate $k$ gradients which we can gather in the Jacobian: $J_f=\begin{pmatrix}\frac{\partial f_1}{\partial x_1} & \dots & \frac{\partial f_1}{\partial x_d}\\\vdots&\ddots&\vdots\\\frac{\partial f_k}{\partial x_1} & \dots & \frac{\partial f_k}{\partial x_d}\end{pmatrix} \in \mathbb{R}^{d\times k}$


### grad functionality

"Under-the-hood", each PT Tensor has an attribute `requires_grad`

In [None]:
import torch

x = torch.rand(3,3)

x

In [None]:
x.requires_grad

We can manually set this to `True` or create directly a Tensor supporting grad.

In [None]:
x.requires_grad = True

x

In [None]:
torch.rand(3, 3, requires_grad=True)

### Case 1

Suppose we are in case 1.: $f:\mathbb{R}\rightarrow\mathbb{R}$.

For instance, $f(x) = x^2$.

We could apply $f$ to a singleton tensor and calculate the derivative.

We expect the derivative to be... ?

In [None]:
x = torch.rand(1, requires_grad=True)

print("x:", x)

y = x**2

print("y:", y)

To calculate the gradient, we call `backward()` on the Tensor. Which one, `x` or `y`?

In [None]:
y.backward()

We can inspect the gradient of x
by accessing its grad attribute:

In [None]:
x.grad

Let's check that it's correct...

In [None]:
x.grad == 2*x

Notice that, when there's no gradient, it is automatically set to `None` to save memory

In [None]:
torch.rand(3,3).grad is None

### Case 2 (scalar function)

We can use the same `.backward()` call to get the gradient of a scalar function.

Now x will be a vector (or a matrix, it doesn't really matter for our case) and we will apply to it a function which returns a single scalar.

One example may be $f(\mathbf{x})=\sum_{i=1}^d x_i$.

**Q**: What is the gradient we expect to obtain? A vector of ones

In [None]:
x = torch.rand([5], requires_grad=True)

y = x.sum()

y.backward()


### Case 3 (vector function)

Unfortunately, the backward computation of the gradient is not directly capable of calculating the gradient for a vector of values, but only for a single scalar.
So just 1 node as output!

If we wanted to compute the gradient on a vector function, what could we do?

1. There exist a forward differentiation, which is not though present in PT
2. Using PT backward functionality... (complete as homework): you do it one component at the time in a for loop storing all the gradients

**Q**: Why is really the backward differentiation (and not the forward) useful for our case?


## Composition of functions

We can use also `backward` to compute the gradient of a composition of functions. For our objective, it will be very useful to think in terms of computational graph.

We can view $y=g(f(x))$ as

![](imgs/02/compgra1.jpg)

We might extend this and add a hidden node $z$
between $f$ and $g$

![](imgs/02/compgra2.jpg)

Supposing $f(x)=log(x)$
and $g(x)=x^2$, we can reproduce this example in PyTorch.

**Q**

- What we expect to get from $\partial g/\partial z$?

- And from $\partial f/\partial x$?

- And from $\partial g/\partial x$?

- More specifically, what technique do we use to calculate this final gradient?

In [None]:

x = torch.rand(1, requires_grad=True)

print("x:", x, "\n")

z = x.log()

y = z**2

print("y:", y, "\n")


by printing `y`, we can see that the tensor has a specific gradient function attached.

Let us now compute the gradient...

In [None]:

y.backward()

print("gradient of x:", x.grad, "\nQ:(gradient of x w.r.t. what?)")

Let us access $\partial g/\partial z$

In [None]:
## your code here

## A more complicated example

![](imgs/02/compgra3.jpg)

In [None]:
x_1 = torch.tensor([3.0], requires_grad=True)

x_2 = torch.tensor([2.0], requires_grad=True)

print("x_1:  ", x_1)
print("x_2:  ", x_2)

Construct `c`, calculate the gradient and access it for both `x_1` and `x_2`

In [None]:
## your code here

### Gradient accumulation

Let us see another feature of torch differentiation functionalities.

We can call `backward()` multiple times; let us see what happens.

In [None]:
## repeat the computation for c..
c.backward()
print(x_1.grad, x_2.grad)

**Q**: what is happening? Why the gradient is not the same?

## Building a custom, non-parametric PyTorch module

Basically, we want to create a module which is not controlled by any parameter, be it trainable or non-trainable.

As an example, we might have the **Leaky ReLU**, an activation function which can be used in place of the more-known ReLU.

$\text{LeakyReLU} = \max\{0.01\cdot x, x\}$

![](https://i1.wp.com/clay-atlas.com/wp-content/uploads/2019/10/image-37.png?resize=640%2C480&ssl=1)

We can construct it like a basic PyTorch module, analogously to the MultiLayer Perceptron which we built (but not trained) at the end of Lab 1.


In [None]:
class LeakyReLU(torch.nn.Module):
    def __init__(self):
        super().__init__()
    
    def forward(self, data):
        return torch.max(data, data*0.01)

and that's it. We may plug it into a neural network module and it'll work just fine, both for the forward and backward pass.

If we want, we can also use it as-is:

In [None]:
leaky_relu = LeakyReLU()

leaky_relu(torch.arange(-10,10)) # is identical to leaky_rely.forward(torch.arange(-10, 10))

let us test its autodiff functionality:

In [None]:
x = torch.tensor([1.0, -1.0], requires_grad=True)

y = leaky_relu(x).sum() # sum to get one single value out of it.

print("y:", y)

y.backward()

print("dy/dx:", x.grad)

We see that the gradient gets calculated automatically without our intervention in defining a gradient function.

But what if that was not already implemented in PyTorch? What if we needed to use some function that cannot be constructed by using PyTorch built-ins?

In this case, we must define a function class which inherits from `torch.autograd.Function`.

An autograd Function inherits from `torch.autograd.Function` and has two compulsory methods: `forward` and `backward`, whose meaning should be obvious to all.

Both functions have a compulsory first argument which is the **context**, `ctx` for brevity.
From the context we can infer informations about the entities involved in the calculation of the gradient.

The context is **built upon calling the `forward` method**, so that, during the `backward` call, we can obtain the info such what tensors have been used in `forward` and whether a tensor requires or not the grad.

In our case, the derivative is the following:
$\frac{\partial\text{LeakyReLU}}{\partial x} = \begin{cases} 1\text{ if }x>0 \\ 0\text{ if }x\leq 0\end{cases}$, so we only need to save $x$, i.e., the data coming into the module.

Moreover, the backward method needs an additional argument, `output_grad`, which conveys information about the gradient which is _entering_ the Function (be mindful, we're running _backward_, so a gradient _enters the function_ upstream w.r.t. the forward pass).

This is necessary in order to build a cascade of sequential module, each applied after the other. This calls for the application of the **chain rule** for the computation of the gradient of **compositions of functions**:

$$
z = g(f(x)): \\
y = f(x) \wedge z = g(y) \\
$$

![](imgs/02/compgra_forward.jpg)

Then, switching to the derivative:

$\Rightarrow \frac{\partial z}{\partial x} = \frac{\partial z}{\partial y} \cdot\frac{\partial y}{\partial x} $

![](imgs/02/compgra_backward2.gif)

So, it becomes immediately overt the necessity of having an **incoming** gradient which you use to multiply with the gradient produced by the current module, the result of which gets passed on to the previous node in the computational graph.

In [None]:
class LeakyReLU_Fun(torch.autograd.Function):
    @staticmethod # mind the decorator
    def forward(ctx, input_):
        ctx.save_for_backward(input_) # the parameters that will be involved in the gradient
        return torch.max(input_, input_ * 0.01)
    
    @staticmethod
    def backward(ctx, grad_output):
        input_, = ctx.saved_tensors # these are the variables which we need to backpropagate the gradient to (only the input)
        # the gradient is 1 for positive x's, 0.01 for negative x's
        grad_input = torch.ones_like(input_)
        grad_input[input_<0] = 0.01
        # now, we need to rescale for the grad_output
        grad_input *= grad_output
        '''
        a valid alternative (maybe better performing?):
        grad_input = grad_output.clone()
        grad_input[input_<0] *= 0.01
        '''
        return grad_input
        

In [None]:
fun = LeakyReLU_Fun.apply
x = torch.linspace(-5,5,11, requires_grad=True)
y = fun(x)
z = y.sum()
z.backward()

In [None]:
x

In [None]:
x.grad

Let us then rivisit our `LeakyReLU` module from before

In [None]:
class LeakyReLU_Better(torch.nn.Module):
    def __init__(self):
        super().__init__()
    
    def forward(self, X):
        return LeakyReLU_Fun.apply(X)

In [None]:
LeakyReLU_Better()(x)

## Building a custom parametric module

We wish to extend our Leaky ReLU module to the Parametric ReLU: $\text{ParamReLU} = \max\{\alpha\cdot x, x\}, \alpha \in [0,1)$.

![](https://pytorch.org/docs/stable/_images/PReLU.png)

Parametric ReLU with $\alpha=0.25$

In [None]:
class ParamReLU_Fun(torch.autograd.Function):
    @staticmethod # mind the decorator
    def forward(ctx, input_, alpha:float):
        assert alpha >= 0 and alpha < 1, f"alpha should be >= 0 and < 1. Found {alpha}."
        ctx.save_for_backward(input_) # the parameters that will be involved in the gradient
        ctx.alpha = alpha # note that we don't use self.alpha
        return torch.max(input_, input_ * alpha)
    
    @staticmethod
    def backward(ctx, grad_output):
        input_, = ctx.saved_tensors # these are the variables which we need to backpropagate the gradient to (only the input)
        grad_input = grad_output.clone()
        grad_input[input_<0] *= ctx.alpha
        return grad_input, None

In [None]:
class ParamReLU(torch.nn.Module):
    def __init__(self, alpha):
        super().__init__()
        self.alpha = alpha
    
    def forward(self, X):
        return ParamReLU_Fun.apply(X, self.alpha)

In [None]:
prelu = ParamReLU(0.25)
x = torch.linspace(-5,5,11, requires_grad=True)
y = prelu(x)
z = y.sum()
z.backward()
print(x.grad)

We have covered:

1. The construction of a non-parametric differentiable module
2. The construction of a parametric, non-trainable, differentiable module

What's missing?

## Extra

**Backpropagation**

Let us suppose we have the following calculation

$\mathbf{x} = [1,~2,~-1,~3,~5]$

$ y = f(\mathbf{x}) = \log\{[\exp (x_1 * x_2 )]^2 + \sin (x_3 + x_4 + x_5) \cdot x_5\}$

Find

$\nabla f(\mathbf{x})$

In [None]:
# try it in Python!

**Backpropagation with PyTorch modules**

$\mathbf{x} = [1,~2,~-1,~3,~5],~~\mathbf{w_1} = [3,~0,~1,~-3,~0.5]$

$y = \sigma(\mathbf{w_1}^\top \mathbf{x})$, where $\sigma$ is the sigmoidal function $\frac{1}{1+\exp(-x)}$


In [None]:
class TrivialBackpropagationExample(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.module1 = # what goes here?
        self.module1.weight.data = torch.Tensor([3, 0, 1, -3, .5])
    
    def forward(self, data):
        return torch.sigmoid(self.module1(data))

Let us try it...

In [None]:
model = TrivialBackpropagationExample()

x = torch.tensor([1,2,-1,3,5], dtype=torch.float32, requires_grad=True)

y = model(x)

y.backward()

print(x.grad)