# Basics of automatic differentiation in PyTorch

In this notebook, you will go through the basic notions of automatic differentiation (aka autodiff) in PyTorch.

## 1. Manual differentiation in pure Python

Before starting with `pytorch` and its automatic differentiation features, let us have a look at how to do manual differentiation in Python.

To do so, we will use a very basic example in 1D: let $x$ be a scalar and let $y$ be defined as:

$$y = (x - .5)^2$$

Our goal will be to tune $x$ in order to minimize $y$.

**Question 1.1.** Define a function `f` that takes `x` as input and returns `y` as defined above. 

In order to be able to minimize, we will use a strategy called gradient descent.
The idea of gradient descent is to iteratively update $x$ by moving it in the opposite direction of the gradient $\frac{\partial y}{\partial x}$.
We hence need to be able to compute $\frac{\partial y}{\partial x}$.
Since we do not rely on autodiff for now, we need to provide the explicit formula for this derivative.

**Question 1.2.** Define a function `grad_f` that takes `x` as input and returns $\frac{\partial y}{\partial x}$.

The basic idea behind gradient descent is to iteratively update $x$ using the following update rule:

$$x \leftarrow x - \eta \frac{\partial y}{\partial x}$$

**Question 1.3.** Define a starting value for `x` and a step size `eta` and apply gradient descent for 30 steps.

**Question 1.4.** Is the resulting value for `x` close to the value you would expect as a minimizer for $y = f(x)$?

_YOUR ANSWER HERE_

## 2. PyTorch and the automatic computation of gradients

PyTorch is very similar to numpy in practice. One main difference is that one can ask, at any moment, for the automatic computation of gradients.

To do so, if one wants to trigger the computation of $\frac{\partial a}{\partial b}$ for any $b$, she should write:

```python
a.backward()
```

This will trigger the computation of the gradient of `a` with respect to any tensor that was involved in the computation of `a`.

And the gradient $\frac{\partial a}{\partial b}$ will be stored in `b.grad`.

**Question 2.1.** Fill the code below to check what the gradient of `x` is before calling `backward()`:

In [None]:
import torch

def f(x):
    return (x - .5) ** 2

x = torch.tensor(0.125, requires_grad=True)
y = f(x)

# Fill the code here

**Question 2.2.** Now, trigger the computation of gradients $\frac{\partial y}{\partial x}$ and print this gradient.

## 3. Gradient descent in PyTorch

**Question 3.1.** Try to implement the gradient descent from Section 1 in PyTorch this time. You do not need to use `grad_f` anymore in your computations.
Each iteration should consist in:
1. computing `y` based on the current value for `x` ;
2. explicitly forcing gradient computations ;
3. updating `x` (this step needs to be protected in a `with torch.no_grad():` block) ;
4. zero-ing out gradients of `x` for future steps not to accumulate gradient computations.

In [None]:
x = torch.tensor(0.125, requires_grad=True)

stepsize = 0.1
n_iter = 30

for i in range(n_iter):
    # Compute y and force gradient computations

    with torch.no_grad():
        # Update x
        pass
    # Zero-out gradients (code below is OK, leave it as it is)
    x.grad.zero_()

## 4. Wrap-up: optimizing parameters of a univariate linear regression model

Below is some code to generate (and visualize) the synthetic dataset you will use in this section.

In [21]:
import matplotlib.pyplot as plt

X = torch.rand(100, 1)
# w* = -3, b* = 1.5
y = -3. * X + 1.5 + 0.4 * torch.randn(X.size())

plt.scatter(X.numpy(), y.numpy())

You will try to fit a linear regression model to this dataset.

**Question 4.1.** Given the code that generated the dataset, what should be the ideal values for $w$ and $b$ in your linear model?

_YOUR ANSWER HERE_

**Question 4.3.** Implement a function `mse` that would take `X`, `y`, `w`, `b` as inputs and outputs the mean squared error of the linear model parametrized by `w` and `b` on the dataset $(X, y)$.

**Question 4.2.** Implement a gradient descent loop to fit `w` and `b` that would minimize the mean squared error criterion based on the provided dataset. Use a step size of 0.1 and perform 1000 iterations of the algorithm.