# Multilayer neural networks and PyTorch

[Neural networks](https://en.wikipedia.org/wiki/Artificial_neural_network) are a class of machine learning models that are inspired by the structure and function of biological neural networks. They can learn complex functions from large amounts of data. We will start our journey of neural networks with the simplest neural network, the [logistic regression](https://en.wikipedia.org/wiki/Logistic_regression), which is a single-layer neural network that we have learned [previously](https://pykale.github.io/transparentML/03-logistic-reg/overview.html). We will then introduce multilayer neural networks and the [PyTorch](https://pytorch.org/) library to build our neural networks.

Watch the 16-minute video below for a visual explanation of neural networks.

```{admonition} Video
<iframe width="700" height="394" src="https://www.youtube.com/embed/CqOfi41LfDw?start=125&end=1090" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

[Explaining main ideas behind neural networks, by StatQuest](https://www.youtube.com/embed/CqOfi41LfDw?start=125&end=1090)
```

## Logistic regression as a neural network

Let us consider a simple neural network to classify data points, using the logistic regression model as an example:

* Each data point has one feature/variable, so we need one input node on the input layer.
* We are not going to use any hidden layer, for simplicity.
* We have two possible output classes so the output of the network will be a single value between 0 and 1, which is the estimated probability $\pi$ for a data point to belong to class 1. Then, the probability to belong to class 0 is simply $1-\pi$. Therefore, we have one single output neuron, the only neuron in the network.

If we use the [logistic (sigmoid) function](https://en.wikipedia.org/wiki/Logistic_function) as the [activation function](https://en.wikipedia.org/wiki/Activation_function) in the output neuron, this neuron will generate a value between 0 and 1, which can be used as a classification probability.

We can represent this simple network visually in the following figure:

```{figure} https://github.com/cbernet/maldives/raw/master/images/one_neuron.png
---
height: 250px
name: one_neuron
---
Neural network with one input node and one neuron, with no hidden layer. The neuron first computes the weighted input $z = wx + b$, where $w$ is the weight and $b$ is the bias, and then uses the sigmoid function $\sigma (z) = 1/(1+e^{-z})$ as the activation function to compute the output of the neuron.
```

In the output neuron: 

* The first box performs a change of variable and computes the **weighted input** $z$ of the neuron, $z = wx + b$, where $w$ is the weight and $b$ is the bias.
* The second box applies the **activation function**, the sigmoid $\sigma (z) = 1/(1+e^{-z})$, to the weighted input $z$.
* The output of the neuron is the value of the sigmoid function, which is a value between 0 and 1.

This simple network has only two parameters, the weight $w$ and the bias $b$, both used in the first box. We see in particular that when the bias $b$ is very large, the neuron will **always be activated**, whatever the input. On the contrary, for very negative biases, the neuron is **dead**. 

We can write the output simply as a function of $x$, 

$$f(x) = \sigma(z) = \sigma(wx+b).$$

This is exactly the **logistic regression** classifier. Thus, the logistic regression model can be viewed as a single-layer neural network with a single neuron. The neuron is a linear function of the input feature(s), and the sigmoid function is the activation function. Indeed, the logistic regression model and its multi-class extension, the softmax regression model, are standard units in neural networks. 

## Shallow vs deep learning

Shallow learning models learn their parameters directly from the features of the data {cite}`burkov2019hundred`. Most models that we studied in the previous chapters are shallow learning models, such as linear regression, logistic regression, or support vector machines. They are called shallow because they are composed of a single _layer_ of learning units, or a single learning unit. We see a shallow learning model with a single neuron in {numref}`one_neuron`.

Deep learning models are neural networks with multiple (typically more than two) hidden layers. The parameters for such deep learning models are not learned directly from the features of the data. Instead, the features are used to compute the input of the first hidden layer, which is then used to compute the input of the second hidden layer, and so on. The output of the last hidden layer is used as the input of the output layer. The output layer is typically a single neuron with a sigmoid activation function, which can be used to compute a classification probability.

## Single-layer neural networks

{numref}`single_layer_nn` shows a single-layer neural network with two ($D=2$) input nodes/units (features), one hidden layer with four ($K=4$) neurons as the hidden nodes/units (latent features), and one output node/unit (target/label). The input layer is the layer that receives the input data. The hidden layer is the layer that computes the weighted input of the output layer, which can be considered as latent features. The output layer is the layer that computes the output of the network from the weighted input provided by the hidden layer.

```{figure} https://upload.wikimedia.org/wikipedia/commons/9/99/Neural_network_example.svg
---
height: 300px
name: single_layer_nn
---
A simple neural network with a single hidden layer. The hidden layer computes activations $a_1, \cdots, a_K$ ($K=4$ here) that are nonlinear transformations of linear combinations of the input features $x_1, \cdots, x_D$ ($D=2$ here). The output layer computes the output $\hat{y}$ from the activations $a_1, \ldots, a_K$ in a similar way.
```

As illustrated in {numref}`one_neuron`, the $k$th neuron in the hidden layer computes the weighted input $z_k$ from the input data $\mathbf{x}$ using the weights $\mathbf{w}_k$ and bias $b_k$, and applies the activation function $g(z)$ to the weighted input $z_k$ to compute the output of the neuron as follows:

$$a_k=h_k(z_k)=g\left(w_{k0}+\sum_{d=1}^Dw_{kd}x_d\right).$$ 

The output of the hidden layer is the vector of the outputs of the $K$ neurons in the hidden layer. The output of the network is the output of the output layer, which is computed from the output of the hidden layer, $a_1, \cdots, a_K$, in a similar way as the output of a hidden neuron is computed from the input features.

### System transparency

This simple neural network derives four new features from the original two features by computing four differently weighted sums of the original two features and then squashing the weighted sums with an activation function for the hidden layer. It then uses these four new features to compute the output of the network by computing their weighted sum and then squashing the weighted sum with another activation function for the output layer. As mentioned in the overview, neural networks are semi-transparent systems where we can see the transformation of the input to the output, but it is difficult or too complicated to invert the transformation to obtain the input from the output.

```{admonition} System transparency
:class: important

- For any data point $ \mathbf{x} $, we can transform it through the hidden layer to obtain four latent features $ [a_1 \;\; a_2 \;\; a_3 \;\; a_4]^{\top} $, and then transform these latent features through the output layer to obtain the output $\hat{y}$. 

- Due the the presence of the hidden layer, it is difficult and complicated to invert the transformation from the input to the output for even single-layer neural networks. 
```


### Activation functions

The activation function $g(z)$ is a function that maps the weighted input $z_k$ to the output of the neuron. It is typically a **_nonlinear_** function. Early neural networks often use the sigmoid function or the hyperbolic tangent function (`tanh(cdot)`) as the activation function. In modern neural networks, the ReLU function is often used as the activation function:

$$g(z)=\mathtt{ReLU}(z)=\begin{cases}z & \text{if } z>0\\0 & \text{otherwise.}\end{cases}$$

The ReLU function is not differentiable at $z=0$, but it is computationally more efficient than the sigmoid function and the hyperbolic tangent function.

There are many other activation functions, such as the Gaussian Error Linear Unit (GELU). You can refer to the [Table of activation functions](https://en.wikipedia.org/wiki/Activation_function#Table_of_activation_functions) in Wikipedia for a list of activation functions. The following figure shows the ReLU and GELU functions.

```{figure} https://upload.wikimedia.org/wikipedia/commons/4/42/ReLU_and_GELU.svg
---
height: 300px
name: activation_functions
---
Activation functions: ReLU (left) and GELU (right).
```

The _nonlinearity_ of the activation function is important for neural networks. Without the nonlinearity, i.e., if all the activation functions of a neural network are linear, this neural network will be equivalent to a simple linear model, no matter how many hidden layers it has (since composition of linear functions is linear). The nonlinearity allows the neural network to learn more complex functions. On the other hand, the nonlinearity also makes the neural network more difficult to train (needs more data), more prone to overfitting, and more difficult to interpret.

## Multi-layer neural networks

The single-layer neural network in {numref}`single_layer_nn` is a special case of a multi-layer neural network. It is easier to make conceptual connections between single-layer neural networks and our previous shallow learning models. In the following sections, we will study multi-layer neural networks, which have more than one hidden layer and are more powerful than single-layer neural networks. In theory, a single-layer neural network with a large number of hidden units can approximate any function. However, the learning task of discovering a good set of weights for a single-layer neural network is more difficult than that of discovering a good set of weights for a multi-layer neural network. 

In a multi-layer neural network, the input layer is the layer that receives the input features. The hidden layers are the layers that compute the weighted input of the output layer in multiple steps, with each step computing the weighted input of the next layer similar to the above in a single-layer neural network. Therefore, there are multiple levels of transformations of the input features and these multiple levels of latent features represent multiple levels of abstraction of the input features. The output layer is the layer that computes the output of the network from the weighted input provided by the last hidden layer. 

All modern neural networks are multi-layer neural networks, although with various architectures. The popularity of neural networks and deep learning is accelerated by the availability of large datasets, the development of efficient training algorithms and hardware for training and inference. Moreover, the availability of open-source software libraries for neural networks and deep learning has made it easier for researchers and practitioners to use these advanced tools in their applications. PyTorch is one of such open-source software libraries.

## PyTorch basics

[PyTorch](https://en.wikipedia.org/wiki/PyTorch) is an open-source software library for machine learning and particularly deep learning. It is originally developed by Facebook (Meta) and is available under the Apache 2.0 license. In September 2022, PyTorch has been donated to the [Linux Foundation](https://en.wikipedia.org/wiki/Linux_Foundation) by Facebook (Meta), becoming a fully community-driven open-source project. 

### PyTorch installation

You should install PyTorch and TorchVision by selecting the appropriate [installation option](https://pytorch.org/get-started/locally/) that matches your hardware and software needs. For example, if you have a GPU, you should install PyTorch with GPU support (e.g. `conda install pytorch torchvision pytorch-cuda=11.6 -c pytorch -c nvidia` or `pip3 install torch torchvision --extra-index-url https://download.pytorch.org/whl/cu116`). If you do not have a GPU, you should install PyTorch with CPU (e.g. `conda install pytorch torchvision cpuonly -c pytorch` or `pip3 install torch torchvision`).

In [None]:
# !pip3 install torch torchvision # uncomment to install pytorch and torchvision if you haven't already.

### Tensor

`torch.Tensor` is a multidimensional array data structure (array). You may check out the full list of [tensor types](http://pytorch.org/docs/master/tensors.html) and various [tensor operations](https://pytorch.org/docs/stable/torch.html).

If you are not familiar with PyTorch yet, you should go over the first two modules of the [PyTorch tutorial](https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html) on *Tensors* and *A gentle introduction to torch.autograd* to get a basic understanding of PyTorch and tensors.

### Computational Graph
A computational graph defines and visualises a sequence of operations to go from input to model output. 

Consider a linear regression model $\hat{y} = \mathbf{W}\mathbf{x} + b$, where $\mathbf{x}$ is the input, $\mathbf{W}$ is a weight matrix, $b$ is a bias, and $\hat{y}$ is the predicted output. As a computational graph, this looks like:

![Linear Regression Computation Graph](https://imgur.com/IcBhTjS.png)

PyTorch dynamically build the computational graph, for example
![DynamicGraph.gif](https://raw.githubusercontent.com/pytorch/pytorch/master/docs/source/_static/img/dynamic_graph.gif)

## Linear regression using PyTorch `nn` module

In this section, we will implement a basic linear regression model using the PyTorch `nn` module. The `nn` module provides a high-level API to define neural networks. It is based on the `autograd` module to automatically compute the gradients of the model parameters. The `nn` module defines a set of `Module` classes, which you can use to build and compose neural networks. A `Module` is a neural network layer that has parameters that can be optimized during training. The `nn` module also defines a set of useful loss functions that are commonly used when training neural networks.

Implementing the fundamental linear regression model in PyTorch will help us study PyTorch concepts closely. This part follows the [PyTorch Linear regression example](https://github.com/pytorch/examples/tree/master/regression) that trains a **single fully-connected layer** to fit a 4th degree polynomial.

First, generate model parameters, weight and bias. The weight vector and bias are both tensors, 1D and 0D, respectively. We set a seed (2022) for **reproducibility**.

In [None]:
import torch
import torch.nn.functional as F

torch.manual_seed(2022)  # For reproducibility

POLY_DEGREE = 4
W_target = torch.randn(POLY_DEGREE, 1) * 5
b_target = torch.randn(1) * 5

Let us inspect the weight and bias tensors.

In [None]:
print(W_target)
print(b_target)

We can see the weight tensor is a 1D tensor with 4 elements, and the bias tensor is a 0D tensor. Both have random values.

Next, define a number of functions to generate the input (features/variables) and output (target/response). 

In [None]:
def make_features(x):
    """Builds features i.e. a matrix with columns [x, x^2, x^3, x^4]."""
    x = x.unsqueeze(1)
    return torch.cat([x**i for i in range(1, POLY_DEGREE + 1)], 1)


def f(x):
    """Approximated function."""
    return x.mm(W_target) + b_target.item()


def poly_desc(W, b):
    """Creates a string description of a polynomial."""
    result = "y = "
    for i, w in enumerate(W):
        result += "{:+.2f} x^{} ".format(w, i + 1)
    result += "{:+.2f}".format(b[0])
    return result


def get_batch(batch_size=32):
    """Builds a batch i.e. (x, f(x)) pair."""
    random = torch.randn(batch_size)
    x = make_features(random)
    y = f(x)
    return x, y

Define a simple neural network, which is a **single fully connected** (**FC**) layer. See [`torch.nn.Linear`](https://pytorch.org/docs/master/nn.html#torch.nn.Linear).

In [None]:
fc = torch.nn.Linear(W_target.size(0), 1)
print(fc)

This is a *network* with four input units, one output unit, with a bias term.
    
Now generate the data. Let us try to get five pairs of (x,y) first to inspect.

In [None]:
sample_x, sample_y = get_batch(5)
print(sample_x)
print(sample_y)

Take a look at the FC layer weights (randomly initialised)

In [None]:
print(fc.weight)

Reset the gradients to zero, perform a forward pass to get prediction, and compute the loss.

In [None]:
fc.zero_grad()
output = F.smooth_l1_loss(fc(sample_x), sample_y)
loss = output.item()
print(loss)

Not surprisingly, the loss is large and random initialisation did not give a good prediction. Let us do a backpropagation and update model parameters with gradients.

In [None]:
output.backward()
for param in fc.parameters():
    param.data.add_(-0.1 * param.grad.data)

Check the updated weights and respective loss.

In [None]:
print(fc.weight)
output = F.smooth_l1_loss(fc(sample_x), sample_y)
loss = output.item()
print(loss)

We can see the loss is reduced and the weights are updated. 

Now keep feeding more data until the loss is small enough. 

In [None]:
from itertools import count

for batch_idx in count(1):
    # Get data
    batch_x, batch_y = get_batch()

    # Reset gradients
    fc.zero_grad()

    # Forward pass
    output = F.smooth_l1_loss(fc(batch_x), batch_y)
    loss = output.item()

    # Backward pass
    output.backward()

    # Apply gradients
    for param in fc.parameters():
        param.data.add_(-0.1 * param.grad.data)

    # Stop criterion
    if loss < 1e-3:
        break

Examine the results.

In [None]:
print("Loss: {:.6f} after {} batches".format(loss, batch_idx))
print("==> Learned function:\t" + poly_desc(fc.weight.view(-1), fc.bias))
print("==> Actual function:\t" + poly_desc(W_target.view(-1), b_target))

We can see the loss is small and the weights are close to the true values.

## Exercises

min 3 max 5

