# Overview

<img src="https://jackhaviland.com/posts/connecting-tensors-with-convolutions/img/fully-connected-nn.svg" alt="Illustration of different architectures using neural networks" align="right" style="width: 500px; float: right;"/>

In this notebook we will implement our own neural network step by step. To achieve this goal, we will need to write the code for a **fully connected layer**, and our choices for the **activation function** and **loss function**. Each with the partial derivatives needed for the back propagation.

# 1. Libraries

We will use the [NumPy](https://numpy.org/) library for the implementation of our neural network. But notice that we can also use [PyTorch](https://pytorch.org/) if we wish to take advantage of the AutoGrad library included therein to compute the derivatives for us.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# 2. Fully Connected Layer

Recall that a fully connected layer is an architecture that connects every input neuron to every output neuron. During this operation, the input ${\bf X}$ is multiplied by a weight matrix ${\bf W}$ and then a bias vector ${\bf B}$ is finally added.

For a fully connected layer $y_i, \forall\, i \in \{1,\, 2,\, 3,\,\cdots,\,n\}$ with the form
$$
y_1 = \sum_i x_iw_{i,1} + b_1\\
y_2 = \sum_i x_iw_{i,2} + b_2\\
y_3 = \sum_i x_iw_{i,3} + b_3\\
\vdots \\
y_n = \sum_i x_iw_{i,n} + b_n\\
$$

the forward pass can be expressed easily by defining the following notation

$$
{\bf X} = \begin{bmatrix} x_1 & \cdots & x_i \end{bmatrix}
$$

$$
{\bf B} = \begin{bmatrix} b_1 & \cdots & b_n\end{bmatrix}
$$

$$ {\bf W} =
\begin{bmatrix}
w_{11} & \cdots & w_{1n}\\
\vdots & \ddots & \vdots\\
w_{i1} & \cdots & w_{in}\\
\end{bmatrix}
$$

Hence, the forward propagation becomes ${\bf Y} = {\bf X W} + {\bf B}$.

## 2.1 Backward propagation

We start by first identifying that ${\bf W}$ and ${\bf B}$ are the parameters in our layer, therefore we need to compute the derivatives with respect to these parameters, as well as the input ${\bf X}$. For such purpose, we will define ${\bf L}$ as the loss (error) associated to the input. For a single layer $L$ is a scalar, but for the whole network, it is more appropriate to express it as a vector ${\bf L}$. The derivatives we need are

$$
\begin{eqnarray}
\frac{\partial L}{\partial \bf X} &=& \frac{\partial L}{\partial \bf Y}{\bf W}^T \\
\frac{\partial L}{\partial \bf W} &=& {\bf X}^T \frac{\partial L}{\partial \bf Y} \\
\frac{\partial L}{\partial \bf B} &=& \frac{\partial L}{\partial \bf Y} \\
\end{eqnarray}
$$

With these equations at hand, let's implement the code for a fully connected layer.

# 3. Activation Function

The forward propagation for an activation function consists of applying an activation function ${\bf Y} = f({\bf X})$ to all the elements of a given input ${\bf X}$. Note that this operation preserves the dimensions of the matrix ${\bf X}$.

## 3.1 Backward propagation

The contribution to the loss value $L$ is an element-wise multiplication between $\partial L/\partial {\bf X}$ and the derivative of the activation function, $f'({\bf X})$. We can express this operation through the following expression

$$
\frac{\partial L}{\partial \bf X} = \frac{\partial L}{\partial \bf Y} \odot f'({\bf X})
$$

Let's now implement the code for the forward and backward propagation for an activation function.

# 4. Loss Function
The loss function measures the error of the predictions with respect to the reference values. We have seen already different loss functions that commonly are used in artificial intelligence. Here, for the sake of simplicity, we will focus on the Mean Squared Error, MSE, that is expressed as

$$
L = \frac{1}{n} \sum{(\hat{\bf Y} - {\bf Y})^2}
$$

where $\hat{\bf Y}$ and ${\bf Y}$ are the reference and predicted values, respectively.

## 4.1 Backward propagation

Same as we have done previously, we will need to compute the derivative for the MSE. Since it depends on ${\bf Y}$, we have that

$$
\frac{\partial L}{\partial \bf Y} = \frac{2}{n}({\bf Y} - \hat{\bf Y})
$$

This is the last piece of information we need. Let's proceed to implement the code for the MSE loss.

# 5. Build Neural Network

Now that we have everything we need, we can define our neural network with as many fully connected layers as we wish. Since we used a notation analogous to that in [PyTorch](https://pytorch.org/), the implementation should look familiar to you. The only difference is that we have to define ourselves the backward propagation. For that, remember that you must **start from the last layer and continue until the first one**. Backwards. And now the term *backward propagation* might make more sense to you.

We can run some quick test to evaluate the performance of our neural network. A simple task could be learning to map one-hot encoding features into reals.

In [None]:
inputs = np.eye(8, dtype=float)
labels = np.arange(start=1.0, stop=9.0, step=1.0, dtype=float).reshape(-1,1)

network   = # define the network
criterion = # define the loss function

print(f"inputs are\n{inputs}\n\nlabels are\n{labels}")

Now the customary training loop

In [None]:
for epoch in range(10):
    
    # define training loop