# Practical 2 How to learn with neural networks?
## <font color='green'>[+100 points in total, +30 optional points]</font>

This assignment contains several files that need to be completed/implemented. This Jypyter notebook serves only as a "glue", to connect everything together and let you run the code more easily. If you want to run a particular file, you just need to type in a standard code cell the following
```lua
dofile 'mynewfile.lua'
```
and it will run your code as if you would type that in the terminal.

The rest of this notebook will describe the assignment, present to you the questions that you need to complete and guide you through through the answer. You should not use this notebook to actually program the code for the assignment, other than calling "main" files to get, print and plot the results. You should provide the textual as well as the numerical answers inline, after each question by running the right scripts that you should have implemented. Once you complete the assignment and answer the questions inline, you can download the report in pdf (File->Download as->PDF) and send it to us, together with the code.

To share your code, results and pdf with us, please make a Bitbucket repository and invite us to your project. Then, we will clone your project and correct your assigmment.

# Instructions

Implement your code and answer all the questions. Make your own bitbucket repository (bitbucket allows for private repositories). Commit your answers to the repository and invite us to access your solutions. Please send your answers by February 18, at 23:59.
In the course there is a 7 late day policy. This means that you are allowed 7 days in total for delivering the assignments, you can use them as you please. Beyone that each extra late day will have a 10% penalty on your final score.

# Summary

By the end of this practical you should:
<ul>
<li>be able to derive the correct backpropagation equations for an arbitrary neural network</li>
<li>be able to implement a standard multi-layer perceptron (MLP) with various architectures.
MLP is the most basic neural network and can be used for a variety of tasks, such classification and regression.</li>
<li>be comfortable with picking the right hyperparameters for your neural network given a task.</li>
<li>be able to implement your own basic module (or layer) in Torch.</li>
<li>be able to implement your own optimization function with which you can train your neural network.</li>
<li>Optionally, you should be able to write your own backpropagation algorithm by implementing the basic equations. </li>
</ul>

# Section 2 Implement a multi-layer perceptron (MLP)

First, we will implement our very first multilayer perceptron, commonly referred to as MLP.
An MLP is a feedforward neural network.
This means that is composed of multiple layers, or modules as they referred to in the Torch jargon.
The possible design choices for an MLP, outside the learning hyperparameters, are the following:

<ul>
<li>number of modules</li>
<li>number of units in each module</li>
<li>type of operation applied inside each module</li>
<li>type of loss function</li>
</ul>

Your starter code is composed of the following files: <tt>load_data_mnist.lua</tt>, <tt>preprocess_data_mnist.lua</tt>, <tt>define_model.lua</tt>, <tt>define_training.lua</tt>, <tt>define_testing.lua</tt> and <tt>assignment2.lua</tt>.
Inside these functions there are <tt>TODO</tt> items that need to be completed before running any full scale experiment.

The <tt>assignment2.lua</tt> file is the wrapper that connects all the other files. 
It's split into 3 parts where the experiment parameters are configured and 5 parts, where the different functions are called.
Once all files are fully implemented, you can run an experiment with
```lua
dofile 'assignment2.lua'
```

Next, we guide you through the implementation.

## Section 2.1 Define the experiment setup

Before you start with anything, you need to define the setup of your experiments.
Here you can <tt>require</tt> all your Torch packages or define the torch parameters for running the epxeriment.
Also, you can define the model and optimization hyper-parameters.

Open your <tt>assignment2.lua</tt> file and go to section 2.1.
Fill in the missing values in the TODO parts.

### <font color='green'>Question A.1 [+5 points]</font>

<font color='green'>In the experiment options you need to define
<ul>
<li>the loss functions in <tt>opt['loss']</tt>.
You can check the different options here https://github.com/torch/nn/blob/master/doc/criterion.md. 
You can start with the <tt>Margin Criterion</tt>, which is essentially the SVM loss.</li>
<li>the optimization method that you are going to use in <tt>opt['optimization']</tt>.
You can check the different options here http://optim.readthedocs.org/en/latest/.
You can start with <tt>SGD</tt>.</li>
</ul></font>

<font color='green'>
In the model options you need to define
<ul>
<li> The number of hidden units.
You can start with 100 units.</li>
<li> The (mini-) batch size.
You can start with 1, meaning after each new training sample you make a parameter update with gradient descend.</li>
<li> The learning rate.
You can start with <tt>1e-3</tt>, which is a reasonable choice, see lecture notes.</li>
<li> The weight decay is equivalent to adding a regularization on your model parameters.
You can start with 0.0, in which case you have no regularization.</li>
<li> The momentum for computing more robust gradient updatse.
You can start with the vanilla version and set the momentum to 0.0.</li>
</ul>
</font>



## Section 2.2 Define the dataset

Next, you need to define the dataset.
In this experiment we are going to use the MNIST dataset, which is very popular in the neural network and deep learning community.
It is not a very difficult dataset and the best state-of-the-art reaches accuracies up to 99.7-8%.
Still, it's a good dataset for studying the essentials of neural networks.

You will first need a wrapper for loading the data that you are going to process.
We have already prepared a script for loading the MNIST dataset, which is invoked with the command
```lua
dofile 'load_data_mnist.lua'
```

## Section 2.3 Perform data preprocessing and normalization
### <font color='green'> [Questions A.2-A.4: +10 points] </font>

As discussed in the lectures, neural networks and deep learning prefer the input variables to contain as raw data as possible.
Still, this does not mean that any input will do and in the vast majority of cases the data need to be preprocessed and normalized.

The following 2 questions will outline the importance of data normalization on a toy, 1-d example.
These questions are not directly related to the rest of the assigment though.

### <font color='green'>Question A.2 Simple data normalization</font>

<font color='green'>
Open the dataset of points from "points.txt".
When plotted with the x-axis range x=[0, 20] and the y-axis range [-15, 0], you will get the following image.

<img src="images/toydataset.png">

Implement the gaussian normalization so that the two variables follow an $N(0, 1)$ distribution.
Plot the distributions of features before and after and visually verify that the features are indeed normalized.</font>

### <font color='green'>Question A.4</font>

<font color='green'>
Now that you have implemented the normalization for this simple dataset, do the same for the input features loaded from <tt>load_data_mnist.lua</tt>.
Different from before, we have multiple pixels and therefore multiple dimensions per sample image.

Normalize the input image pixels per channel.
Namely, take the all pixel values per channel (red, green, blue) and compute the mean and standard deviation for each channel (3 means, 3 standard deviations).
Print the mean and standard deviation values after normalization for both training and test set.
Do you observe a difference between the normalized training and test features and why?</font>

## Section 2.2 Define training, optimization, etc.
### <font color='green'> [Questions A.5-A.6: +10 points] </font>

Before working on our model and trying to find the best possible architecture and hyperparameters, we need to prepare the routines for training, namely learning the optimal model parameters to obtain the best possible accuracy for our model.

### <font color='green'>Question A.5</font>

<font color='green'>
Open the <tt>define_training.lua</tt> file and go to <tt>TODO1</tt>.
Write the code to split the data into minibatches of the predefined size.
For this you can add the minibatches in Lua tables with <tt>table.insert(...)</tt></font>

To complete the <tt>define_training.lua</tt> file you need to finish the implementation of the <tt>local feval = function(x)</tt> function.
The <tt>feval</tt> function performs a forward and backward pass to backpropagate the gradients computed from the minibatches.


### <font color='green'>Question A.6</font>

<font color='green'>
In the <tt>define_training.lua</tt> file finish the implementation of the feval function, see <tt>TODO2</tt>.
Given the input minibatches you first need to perform a forward pass and get the output of the network and the respective loss.
The you perform a step of backpropagation to compute the gradient estimates as the sum of the gradients computed per sample in the minibatch.
Optionally, you might want to update the confusion matrix so that you have an overview of the training progress.
In the end do not forget to divide the minibatch error and the minibatch gradient by the number of samples in the minibatch.</font>

You do not need to add code in other lines other than the ones that are obviously incomplete, followed by a comment <tt>--  COMMENT</tt>


## Section 2.3 Define MLP model

Once we have loaded and preprocessed the data, we need to define the architecture of our model.
Defining the architecture of a neural network in Torch is pretty straightforward.
First, we need to define the container of the network, which can be either one of the following:
```lua
require 'nn'
...
nn.Sequential()
nn.Parallel()
nn.Concat()
```

The <tt>nn.Sequential()</tt> connects modules in a feedforward manner.
The <tt>nn.Parallel()</tt> feeds parts of data to different parts of the network resembling a parallel structure.
The <tt>nn.Concat()</tt> concatenates the outputs of modules from a parallel container.
As the MLP is a feedforward neural network, we opt for </tt>nn.Sequential()</tt>.
Note that after calling the <tt>nn.Sequential()</tt> container, we call

```lua
nn.Reshape(D)
```
where $D$ is the feature dimensionality.
This function flattens the input, so that there are no issues with the geometry of the features.

Next, we need to define the rest of the modules of the architecture <b>and</b> the loss function.
Open the <tt>define_model.lua</tt> file and fill in the code in the <tt>TODO</tt> sections according to the following questions.

### <font color='green'>Question A.7 [+5 points]</font>

<font color='green'>
Define a network composed of the modules: a linear module $\rightarrow$ a $tanh(\cdot)$ module $\rightarrow$ a linear module $\rightarrow$ a log-softmax module.
Print the model.
What kind of output does the log-softmax module return?
In the default code we are using the negative log-likelihood criterion.
Write down the loss function this criterion minimizes.</font>

Having defined your training model, you are almost ready to run your first experiment.
Before you do that, you first also need to define your testing script, where you test the accuracy of your model during training, or after the training for the final evaluations.
For your convenience we have implemented thi script for you.
The script is implemented in the function <tt>define_testing.lua</tt>

## Section 2.4 Tune the network and the hyper-parameters 
### <font color='green'> [Questions A.8-A.12: +30 points] </font>

If you have completed correctly all the parts so far, you should be able to run a full A to Z experiment by typing</font>

```lua
dofile 'assignment2.lua'
```

Furthermore, although the default settings might work ok, there might be different choices for an architecture, loss function and hyperparameters.

### <font color='green'>Question A.8</font>

<font color='green'>
Run your MLP using the default, suggested parameters.
Print the accuracy results per class and overall.
Plot and report the training and testing loss, as well as the training and testing accuracy during the different training epochs.</font>

### <font color='green'>Question A.9</font>

<font color='green'>
Experiment with different architectures.
In the original model we had a linear module $\rightarrow$ a $tanh(\cdot)$ module $\rightarrow$ a linear module $\rightarrow$ a log-softmax module.
Now try increasing the depth, namely define a new model composed of:
<ul>
<li>a linear module</li>
<li>a $tanh(\cdot)$ module</li>
<li>a linear module</li>
<li>a $tanh(\cdot)$ module</li>
<li>a log-softmax module.</li>
</font>

<font color='green'>
Then, try replacing in the original network the $tanh(\cdot)$ with sigmoid modules, then with ReLU modules.
</font>

<font color='green'>
Try a combination of the two, namely replace the modules in the deeper network with the activation functions that worked best.
</font>

<font color='green'>
In this last network try increasing the number of neurons per module.
What are your observations?
</font>

### <font color='green'>Question A.10</font>

<font color='green'>
Experiment with different loss functions.
Try using the negative log likelihood instead of the margin loss.
What are your observations?
</font>

### <font color='green'>Question A.11</font>

<font color='green'>
Experiment with different optimization methods.
Try using the ASGD optimization and ADAM.
What are your observations?
</font>

### <font color='green'>Question A.12</font>

<font color='green'>
Tune the hyperparameters so that you have the best possible test accuracy.
For one, include regularization by increasing the weight decay hyperparameter.
What are your observations?
</font>



# Section 3 Backpropagation and implementing your own module

Until now you have made use of the already existing modules implemented in Torch.
However, the real power of deep learning and neural networks is that one can easily implement new modules and integrate them.

To implement a new module that is compatible with backpropagation, you must first define the forward pass of the module, $a=h(x;\theta)$, where $a$ are the activations and $h(\cdot)$ the module function.
Implementing the forward pass is relatively straightforward.
Then, you must define the backward pass used by backpropagation to compute the gradients of the module with respect to the input $x$ and the module parameters $\theta$, if the module has trainable parameters.
Implementing the backward pass is more complicated.

### <font color='green'>Question B.1 [+10 points]</font>

<font color='green'>
Derive and write the backpropagation equations for a softmax module.</font>


## Section 3.1 Implement a non-parametric module

In Torch is quite easy to implement a new module.
One simply needs to open a new file, and overload the following functions.
```lua
require 'nn'

local MyModule, Parent = torch.class(...)
-- Let's assume your new module is of the form a=h(x)

function MyModule:__init(...)
   ...
end

function MyModule:updateOutput(...)
   ...
end

function MyModule:updateGradInput(...)
   ...
end

function MyModule:accGradParameters(...)
   ...
end
```

It is not necessary to overload <tt>accGradParameters</tt>, you only need to so if your module is parametric, that is if it contains trainable parameters.

### <font color='green'>Question B.2 [+10 points]</font>

<font color='green'>
Implement a rectified quadratic unit (ReQU), $f(n) = \text{ReQU}(x) = \begin{cases} x^2 &\mbox{if } x \geq 0 \\ 0 & \mbox{if } x < 0 \end{cases}$.
The starter code can be found in <tt>MyModules/MyRequ.lua</tt>
The ReQU is not actually a module that is used in neural networks, however, it serves as a good example for implementing a new module.
You need to overload the <tt>__init(...)</tt>, <tt>updateOutput(...)</tt> and <ttt>updateGradInput(...)</tt>.
</font>


## Section 3.2 Gradient checks and Jacobians

One of the most dangerous parts in designing your own modules is implementing the gradient formulations.
We can eliminate the doubts if we check our gradients with the numerical gradients, as discussed in the lectures and the lecture notes.

### <font color='green'>Question B.3 [+5 points]</font>

<font color='green'>
Open the <tt>check_module_gradients.lua</tt> and fill in the missing code (1-line).
Check the Jacobians of your ReQU module and make sure that the implementation of your module is correct.
Report your error differences, which must be below the provided error threshold to be correct.
</font>


### <font color='green'>Question B.4 [+10 points]</font>

<font color='green'>
By now you should have verified that the implementation of your fancy new ReQU module is correct, namely the gradients are computed correctly.
The rest of the modules are provided by Torch, so we can assume they have already been checked.
However, it is good practice to generally check our full network gradients, just to make sure.
Open the <tt>check_model_gradients.lua</tt> and fill in the missing code.
</font>


### <font color='green'>Question B.5 [+5 points]</font>

<font color='green'>
Now you know the full backpropagation of your neural network is correct.
Start from your default network of the previous subsection.
Replace the $tanh(\cdot)$ module with a ReQU module.
Run your new network.
Plot again the accuracies and the losses during the training
What do you observe? 
Does the new module improve the results?
</font>

# <font color='purple'>Section 4 For enthusiastic students [Optional, +30 points]</font>

## <font color='purple'>Section 4.1 Implement backpropagation [+15 points]</font>

Reimplement the backpropagation for the original network composed of.
Check your gradients.
Are they correct?
Run the same experiment as for Question A.8.
Do you get similar results?

## <font color='purple'>Section 4.2 Implement your own modules [+15 points]</font>
Implement a parametric module, whose activation is according $a(x)=sin(Wx)$.
Note that since this is a parametric module, you have to also overload the function <tt>accGradParameters</tt>.
Check the Jacobians and gradients and see if these module makes your network more accurate.
In your report include also the gradient formulas.