Hopefully you enjoyed the Intro to Machine Learning. Let's dive into **neural networks!**


# What is Deep Learning? #

Okay, so if that's what machine learning is, what is **deep learning**? Deep learning is a kind of machine learning that takes inspiration from the brain. It takes simple **neurones** that communicate with each other to build a larger **neural network**. Despite each neurone being simple by itself, they combine to exhibit complex behaviour. Your brain üß† contains ~86 billion neurones, and you can do lots of complex things!

If you want to understand why neural networks are so powerful from a mathematical perspective, you can read about the *Universal Approximation Theorem* [here ](https://towardsdatascience.com/neural-networks-and-the-universal-approximation-theorem-8a389a33d30a) or watch a short visualisation [here.](https://www.youtube.com/watch?v=Ln8pV1AXAgQ)

Most of the hype around artificial intelligence in recent years has been in **deep learning**. Natural language tasks like translation, summarisation, chatbot generation or image generation and recognition are just some of the tasks where deep learning models have neared or even exceeded human-level performance.


# What is Deep Learning? #

Okay, so if that's what machine learning is, what is **deep learning**? Deep learning is a kind of machine learning that takes inspiration from the brain. It takes simple **neurones** that communicate with each other to build a larger **neural network**. Despite each neurone being simple by itself, they combine to exhibit complex behaviour. Your brain üß† contains ~86 billion neurones, and you can do lots of complex things!

If you want to understand why neural networks are so powerful from a mathematical perspective, you can read about the *Universal Approximation Theorem* [here ](https://towardsdatascience.com/neural-networks-and-the-universal-approximation-theorem-8a389a33d30a) or watch a short visualisation [here.](https://www.youtube.com/watch?v=Ln8pV1AXAgQ)

Most of the hype around artificial intelligence in recent years has been in **deep learning**. Natural language tasks like translation, summarisation, chatbot generation or image generation and recognition are just some of the tasks where deep learning models have neared or even exceeded human-level performance.


# A Single Neurone ‚ö™

A single neurone is just a straight line function. It has two **parameters**: its **weight, w,** and its **bias, b**. The input is multiplied by the weight, and added to the bias, and the result is passed forwards. 

<figure style="padding: 1em;">
<img src="https://storage.googleapis.com/kaggle-media/learn/images/mfOlDR6.png" width="250" alt="Diagram of a linear unit.">

</figure>

Let's see how to make one in Pytorch üî®


In [28]:
# Let's import torch as well as the nn module
import torch
from torch import nn

# A layer of neurones with 1 input and 1 output (i.e. a single neuron)
layer = nn.Linear(1, 1)

# Let's print the neuron and its parameters
print(layer)
print(layer.weight)
print(layer.bias)

Linear(in_features=1, out_features=1, bias=True)
Parameter containing:
tensor([[0.6375]], requires_grad=True)
Parameter containing:
tensor([-0.6713], requires_grad=True)


These are just random values. If you want to know how these values are decided, you can check out [the docs](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html). 

Try changing the parameters of the single neurone to have more than one input and output.
**Thinkü§î**: What changed? What's the data type of `neuron.weight`? and `neuron.bias`? 

**Extension**: A bit tougher - any idea what `requires_grad` means? Why do you think it's set to `True`? Can you think of a reason you might want to set it to `False`?

Now let's run a value through the neuron by simply 'calling' the object. You can call `.forward()`, but this is worse practice. We get the output as a *tensor*. Don't be scared - this is just a fancy name for a list with more than one dimension. We use this word like how we use the word matrix for 2D arrays, but tensors can have any number of dimensions. For a tensor containing one value, you can call `.item()`.

In [45]:
# Ceate a tensor (fancy name for a multi-dimensional array) with a single value
x = torch.tensor([1.0])

# Let's apply the neuron to the tensor
y = layer.forward(x)

# Let's print the result
print(y)

tensor([-0.0338], grad_fn=<ViewBackward0>)


# Layers of Neurones

A single layer isn't very cool. Let's add more layers.

But first! - we can't just add layers upon layers of straight lines or elese we will only be making straight lines. We need an **activation function** to add some **non-linearity** to our model. We use `nn.ReLU` as it is the most common, but it is not the only choice.

**Thinkü§î**: Can you think of other ways you might add non-linearity to your model (without changing the neuron)? 

*Hint: max(0, wx + b) is not the only function that make straight lines not straight anymore.*

In [25]:
# We stack layers with torch.nn.Sequential
# We can also add activation functions
model = nn.Sequential(
    nn.Linear(1, 3),
    nn.ReLU(),
    nn.Linear(3, 1)
)

# Let's print the model
print(model)

Sequential(
  (0): Linear(in_features=1, out_features=3, bias=True)
  (1): ReLU()
  (2): Linear(in_features=3, out_features=1, bias=True)
)


**Extension**: Can you draw the model above? How many 'layers' are there really? How many parameters (weights and biases)? So how many 'neurones' are there? 

# Training A Neural Network üéì

### Loss Function üìâ 

To train a neural network, you need 2 things: a **loss function** and an **optimiser**.

A loss function decides how good a job your model did. The lower it is the better. Through training, the loss should go down, indicating that the model is getting better at the task.

For example, in the lecture we had an input of 1 and a desired output of 2. We ran the model and got an output of 0. We could score this with a loss function of:

$\text{Loss}(\text{model output}, \text{desired output}) = |\text{model output} - \text{desired output}|$

or in more mathematical language:

$\text{L}(\hat{y}, y) = |\hat{y} - y|$ 

where $y$ is the desired output and $\hat{y}$ is the model's guess

**Thinkü§î**: Why is the absolute value there?

*Hint: Imagine the absolute value wasn't there. What would happen if the model outputted 5 and you wanted it to output 10?*

For a real world example, if we wanted our model to predict the next word, we could define a loss function that scores 0 if the next word is correct and 1 if it is incorrect. Minimising this would mean that the model always outputs the correct next word - ideal! This is not conceptually too far from what is actually done in practice.

### Optimiser üèãÔ∏è‚Äç‚ôÄÔ∏è

An optimiser decides how to change the parameters of the model to do a better job next time. After the loss has been calculated, it goes in and updates the parameters in the model so that the model gets closer to the desired output. 

It does this by going backwards through the model, looking at each parameter, and seeing whether you should increase or decrease that parameter in order to make the loss smaller. 

The optimiser doesn't just look at the direction, it also looks at how much to change each parameter by. It does this by determining how sensitive the loss is to changes of each parameter. For math nerds: it does this using derivatives and the chain rule i.e. the gradient of the loss function w.r.t the parameter. It also multiplies this number by a *learning rate* which is generally set to somewhere between 0.01 and 0.00001, depending on the task, model and loss function. This is a *hyperparameter.* of our NN - it's set by you to guide the ML algorithm to do a better job.

**Thinkü§î**: Why do you think these processes are called **gradient descent** and **backpropagation**? Why is it called a **learning rate**?

**Thinkü§î**: What do you think changing the learning rate does? What happens if it's too small? Or too big?

*[Hint: Check out this picture](https://media.licdn.com/dms/image/D5612AQHEVVxj-OS1og/article-cover_image-shrink_720_1280/0/1695927263310?e=2147483647&v=beta&t=XHFLMNaRVcMTx_EG8twpMJeZNf5dgINmbXmYLzBa49U)*

**Extension**: How might you test what a good learning rate is for your specific task? How would you know? 

## Practical Side üî®

Create 
- a NN
- a loss function
- an optimiser, here you choose the *learning rate*

Train the model:
- Reset the optimiser with `optimiser.zero_grad()`
- Run some data through the model. 
- Calculate the loss
- Backpropagate and calculate gradients with `loss.backward()`
- Update model parameters with `optimiser.step()`
- Repeat

Done!

Some lingo: passing all of your data through is called 1 **epoch**. For some kinds of data, this is enough. For other kinds of data, you will need many epochs (10s if not 100s!).

Let's try it in a simple example:

In [452]:
# Create a single neuron
import torch.optim

neuron = nn.Linear(1, 1)

# Create a simple custom loss function
def custom_loss(y_pred, y_true):
    return torch.abs(y_pred - y_true)

# Create a optimizer, with a learning rate ('lr') of 0.01
# We pass in the parameters of the neuron
# SGD stands for Stochastic Gradient Descent.
optimiser = torch.optim.SGD(neuron.parameters(), lr=0.1)

Now we've got a model (a single neuron), a loss function and an optimiser we can train the model. You can run the following cell repeatedly

In [473]:
# Reset the gradients
optimiser.zero_grad()

# Create an input tensor with two values
x = torch.tensor([1.0])

# Create an output tensor with one value
y_true = torch.tensor([2.0])

# Print the weight and bias and prediction before training
print('Before training:')
print(f'y = {round(neuron.weight.item(), 2)}x + {round(neuron.bias.item(), 2)}')
# Compute the prediction
y_pred = neuron.forward(x)

print(f'Prediction: {round(y_pred.item(), 2)}')

# Compute the loss
loss = custom_loss(y_pred, y_true)

# Compute the gradients
loss.backward()

# Update the weights
optimiser.step()

# Print the updated weight and bias after training
print('After training:')
print(f'y = {round(neuron.weight.item(), 2)}x + {round(neuron.bias.item(), 2)}')

# Print the updated prediction
y_pred = neuron.forward(x)

print(f'Prediction: {round(y_pred.item(), 2)}')

Before training:
y = 0.6x + 1.21
Prediction: 1.81
After training:
y = 0.7x + 1.31
Prediction: 2.01


**Thinkü§î**: Can you see what's going on here? Try setting different learning rates and seeing what happens.

**Extension**: Try changing the data between training runs to make the model fit the straight line $y=10x + 5$.

Repeatedly running is a bit tiresome. Generally, we loop over our dataset in a for loop, as you'll see below.

Slightly more advanced detail, we also *batch* our inputs. This means running several values at once through the model. We then *accumulate* the gradients, and update the parameters for those data points all at once. This is mostly for efficiency's sake, which is also the reason we don't generally run the entire datset at once through the model.

# You First Neural Network! üî•

You're now ready for a real problem. We're going to tackle the Hello Wordl of Neural Networks: The MNIST dataset. This is a dataset of handwritten digits, and the model should take in the image and detect which digit this is.

So let's follow the plan. We'll start with a model, loss function and optimiser:

In [None]:
# TODO: Implement MNIST classification with a neural network

# This is an 'Adam' optimizer, which is a variant of the stochastic gradient descent. If you want to know more about it, check out the paper: https://arxiv.org/abs/1412.6980, but it's basically a fancy version of gradient descent.