In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

# Understanding and implementing a neural netwok

This erxercise consists of two sub exercises, that will help you to understand how backpropagation is carried out and how a neural network is implemented.

A) Backpropagation - Simplest back propagation explanation and exercise (with pen and paper)

B) Implementation of a convolutional neural network (with python)



## A. Backpropataion


### Simple neural network
For this exercise we will walk through the absolute simplest backpropagation example. The neural network that will be used, is as follows:

<img src="assets/01_simple_network.jpg\" width="400\" align="center\"/>

This network is used to predict a value of $\hat{y}$, given the input $x$, where both $x$ and $y$ are scalars. This network contains only one fully connected layer (without a bias), therefore the output can be calculated as $\hat{y}=w\cdot x$.


### Training set
The model needs to be trained to obtain the optimal value for the weight $w$. Normally a large training set is used to find the optimal values for all the weights, but for simplification purposes, our training set consists of a single input-output pair, which is as follows:


| Input ($x$) | Desired output ($y$) |
| :--- | :--- |
| 1.5 | 0.5 |

Because this is such an easy example we know that the solution to this optimization problem is $w = \frac{y}{x} = \frac{0.5}{1.5} \approx 0.33$. However, normally neural networks are used for much more complex optimization problems with millions of parameters (weights) and many more training examples. Therefore, the best solution cannot just simply be calculated like this, and an iterative training approach is needed where the network weights are optimized one step at a time.


### Model initialization
This optimization process predicts the output $\hat{y}$ given an input $x$ and a weight $w$. Subsequently the weight $w$ is updated such that the predicted output $\hat{y}$ becomes more similar to $y$. To start this optimization process, the model weight $w$ is initialized with a random value, let's say $0.8$. We can now calculate the current value (after zero epochs, i.e. at initialization) of $\hat{y}$, given that $x=1.5$ and $w=0.8$:

|Epoch | Input ($x$) | Desired output ($y$) | Weight ($w$) | Predicted output ($\hat{y}$) |
| :--- | :--- | :--- | :--- | :--- |
|0 (init) | 1.5 | 0.5 | 0.8 | 1.2 |


### Training & loss function
Now the question arises of how the model needs to be trained such that the current output reaches the desired output of $0.5$. For this training process, a loss function is defined. The model will try to minimize the value of the loss function, and therefore the loss function gives the model feedback on how the network weights should be updated. The loss function of this example is defined as the squared difference between the predicted and desired output:

\begin{equation}
L = (\hat{y} - y)^2
\end{equation}

The loss function is visualized for the given training pair in the following figure. It can be seen that the loss function is parabola with a minimum around $0.33$ (green dot), which is in line with the solution we calculated earlier. 

<img src="assets/02_Loss_function.jpg\" width="500\" align="center\"/>

The backpropagation algorithm seeks to minimize the loss by descending along the loss function (red arrow), which is called gradient descent. To take a descending step in the direction of the slope, the derivative of the loss function needs to be calculated.


#### EXERCISE A1
* Given the model $\hat{y}=wx$ and the loss function $L = (\hat{y} - y)^2$, find the derivative of the loss function with respect to the weight: $\frac{\partial L}{\partial w}$. (Tip: Use the chain rule: $\frac{\partial L}{\partial w}=\frac{\partial L}{\partial \hat{y}}\frac{\partial \hat{y}}{\partial w}$)

* Fill in the values of the training set: $x=1.5$ and $y=0.5$.


**ANSWER** 

* $\frac{\partial L}{\partial w} = 2x(wx-y)$

* $\frac{\partial L}{\partial w}(x=1.5, y=0.5) = 4.5w-1.5$


### Learning rate
After calculating the derivative, the model weight is updated by taking a step along the slope. Therefore, a step size needs to be formulated with care. If the step size is too small, it will take many steps before the minimum is reaches. But if the step size is too large, the minimum will not exactly be reached because the red dot 'bounces' around the minimum. This step size is usually called the learning rate. For this example, we take a learning rate ($r$) of $0.1$.

Now weight can be updated according to the gradient descent and the learning rate as follows:
\begin{equation}
w_{new} = w_{old} - r \frac{\partial L}{\partial w}
\end{equation}

After one step (i.e. after one epoch, since we have a training set size of 1), the weight is updated to 

\begin{equation}
w_{new} = 0.8 - 0.1 \frac{\partial L}{\partial w}(x=1.5, y=0.5, w=0.8) \approx 0.59,
\end{equation}

Which means that the updated predicted value is: $\hat{y}=0.59\cdot 1.5 \approx 0.89$.


#### EXERCISE A2
Calculate the weights and predicted outputs for the next epochs until the model converges (i.e. the weight is approximately 0.33). Fill in (and continue) the following table:

|Epoch | Input ($x$) | Desired output ($y$) | Weight ($w$) | Predicted output ($\hat{y}$) |
| :--- | :--- | :--- | :--- | :--- |
|0 (init) | 1.5 | 0.5 | 0.8 | 1.2 |
|1 | 1.5 | 0.5 | 0.59 | 0.89 |
|2 | 1.5 | 0.5 |  |  |
|3 | 1.5 | 0.5 |  |  |
|.. | .. | .. |  |  |
|n | 1.5 | 0.5 |  |  |

#### QUESTION A1
After how many epochs does the model converge?

### Following model training
Normally this process is done using python because the model is much more complex.But you still want to follow the training process to know when the model is done training. For this you can plot the loss values while training. To mimic this, we are also going to plot the loss curve. 

#### EXERCISE A3
Complete the list w (line 5, of the following python script) with the values of the weights that you calculated in the previous exercise. When the list is complete, run the cell and inspect the training process.

In [10]:
import matplotlib.pyplot as plt
import numpy as np

# Fill in the values of the weights of all the epochs (4th column in the previous table)
w = [0.8, 0.59, ...]
# ANSWER: w = [0.8, 0.59, 0.4745, 0.4109, 0.37603, 0.356]

# Define inputs
x = 1.5
y = 0.5

# M
weights = np.array(w)
epochs = np.arange(len(w))
L = (weights*x - y)**2

plt.figure()
plt.plot(epochs, L)
plt.title('Loss value over time')
plt.xlabel('Epochs')
plt.ylabel('Training loss')
plt.show()

TypeError: unsupported operand type(s) for *: 'ellipsis' and 'float'

#### QUESTION A2
What do you think of the training process when looking at the loss plot resulting from exercise 3? 
* Do you think the model trained long enough? Explain your answer. 
* Do you think the step size of 0.1 was appropriate for the given model? Explain your answer.

## B. Convolutional neural network implementation

### Hypothetical population
For the second part of this practical session we will implement a simple neural network using pytorch. The goal of this model is to distinguish healthy people from sick people. Imagine this hypothetical population where all people with a straight head are healthy and all people with a head that is rotated 90 degrees to the left or the right are 'sick'. An example of brain MRI data from this population is given in the following figure:

<img src="assets/03_Brain_downsampled.jpg\" width="1000\" align="center\"/>

As mentioned before, this will be a simple implementation, therefore all high-resolution images are downsampled to low-resolution 'MRI' images of 3x3. These low-resolution images are shown in the second row. The task for this exercise is to implement a neural network that can be trained on the 3x3 images to distinguish sick ('horizontal lines') from healthy ('vertical lines') subjects.


### Datasets
To train and test the model, a toy training and test dataset can be loaded by running the following python cell. 
* The training set consists of 100 3x3 'brain MRI' images with the corresponding label of healthy (0) or sick (1). These labels can be seen as the desired output ($y$) in part A of this practical session.
* The test set consists of 10 3x3 'brain MRI' images with the corresponding labels, which are used to compare the model performance to the ground truth (the labels).

#### Exercise B1
Run the following python cell

In [None]:
# LOAD datasets as large numpy arrays
# train_set = np array [100, 3, 3]
# test_set = np array [10, 3, 3]

#.... 