# Deep Learning w PyTorch III (training neural networks)
Previous parts showed how to build and represent a neural network, i.e. a multilayer perceptron.This part covers the actual training of that model. The training or fitting of the model to the data consists of adjusting the weights in the model in a way that the structure of the underlying data-generating process is covered and the model is able to predict the outcome. 

- Part 1
- Part 2

## 1. Introduction and Theory
The previous network was untrained and couldn't therefore predict anything. Neural networks with non-linear activations work like function approximators by mapping the input to the output.

<img src="images/mapping.png">

To train the network we usually start with random weights and assessing its performance. In each step we assess the performance of the model with the current weights and then adjust the the weights.

#### Loss function
This is done by calculating a loss using a **loss function** (also called cost function, [link](https://stats.stackexchange.com/questions/179026/objective-function-cost-function-loss-function-are-they-the-same-thing)), which is a measure of our prediction error. For example, the mean squared (MS) loss is often used in regression and binary classification problems. 

$$
\begin{align}
\large Loss_{MS} =  \ell = \frac{1}{2n}\sum_i^n{\left(y_i - \hat{y}_i\right)^2}
\end{align}
$$

#### Gradient Descent: Minimize Loss
Our goal is then to minimize this loss iteratively wrt the network parameters (the weights). We find this minimum using a process called [gradient descent (GD)](https://en.wikipedia.org/wiki/Gradient_descent). 

The gradient is the slope of the loss function and points in the direction of steepest ascent of this function. To get to the minimum in the least amount of time, we then want to follow the gradient (downwards). A widely used analogy is to imagine that we are standing on a mountain and the gradient points towards the direction of steepest descent. We take therefore steps towards this promising direction to find a way down to the reach the valley.

#### Backpropagation: Adjust weights of the network
Training multilayer networks is done through backpropagation which is an iterative application of the chain rule. 
<img src="images/backward_pass.png">
3Blue1Brown provides a series of nice visualizations on neural networks, also covering [backpropagation](https://www.youtube.com/watch?v=Ilg3gGewQ5U&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi&index=4&t=0s). In backpropagation we first pass forward the input through layers to get the error/cost of the function (left side of the image). Afterwards, we adjust the weights and biases to minimize the loss. To adjsust the weights with gradient descent, we propagate the gradient of the loss backwards through the network by iteratively applying the chain rule: 

$$
\large \frac{\partial \ell}{\partial W_1} = \frac{\partial L_1}{\partial W_1} \frac{\partial S}{\partial L_1} \frac{\partial L_2}{\partial S} \frac{\partial \ell}{\partial L_2}
$$

We update our weights using this gradient with some learning rate $\alpha$ as we are taking only a small step towards the steepest descent of the function. 

$$
\large W^\prime_1 = W_1 - \alpha \frac{\partial \ell}{\partial W_1}
$$

The learning rate $\alpha$ is set such that the weight update steps are small enough that the iterative method settles in a minimum.

## 2. Loss in PyTorch
- PyTorch provides losses such as the cross-entropy loss (`nn.Cross.EntropyLoss`). The loss is usually assigned to `criterion`. 
- To actually calculate the loss, you first define the criterion then pass in the output of your network and the correct labels. 
- Note that we need to pass the raw output of our network into the loss, not the output of the softmax function (The raw output is usually called score or logits).

Something really important to note here. Looking at [the documentation for `nn.CrossEntropyLoss`](https://pytorch.org/docs/stable/nn.html#torch.nn.CrossEntropyLoss)
- This criterion combines `nn.LogSoftmax()` and `nn.NLLLoss()` in one single class.
- The input is expected to contain scores for each class.

We use the logits because softmax gives you probabilities which will often be very close to zero or one but floating-point numbers can't accurately represent values near zero or one ([read more here](https://docs.python.org/3/tutorial/floatingpoint.html)). It's usually best to avoid doing calculations with probabilities, typically we use log-probabilities.

In [1]:
# standard imports
import numpy as np

# Torch imports
import torch
from torch import nn, optim
import torch.nn.functional as F
from torchvision import datasets, transforms

#### Network in PyTorch and Forward Pass

In [2]:
# Define a transform to normalize the data
transform = transforms.Compose([transforms.ToTensor(),
                                transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
                              ])

# Download and load the training data
trainset = datasets.MNIST('~/.pytorch/MNIST_data/', download=True, train=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True)

# The network, no output activation function
model = nn.Sequential(nn.Linear(784, 128), 
                      nn.ReLU(), 
                      nn.Linear(128, 64), 
                      nn.ReLU(), 
                      nn.Linear(64, 10))

# Define Loss
criterion = nn.CrossEntropyLoss()

#### Load one Batch of images
Next, 
- we'll load one batch of 64 images and respective labels, 
- flatten the images and 
- pass them through the uninitialized model leading to a score. 

In [3]:
# One batch of 64 images
images, labels = next(iter(trainloader))
print("Shape of images:\n", images.shape, \
      "\nShape of labels:\n",labels.shape)

# Flattening the images
images = images.reshape(images.shape[0],-1)
print("Reshaped images:\n", images.shape)

# Forward Pass: Images are passed through untrained network
scores = model(images)
print("Shape of scores:\n", scores.shape)

Shape of images:
 torch.Size([64, 1, 28, 28]) 
Shape of labels:
 torch.Size([64])
Reshaped images:
 torch.Size([64, 784])
Shape of scores:
 torch.Size([64, 10])


Now, we use a loss function to map the inputs to a value which indicates the loss this batch given the current weights. The inputs are:
- 10 scores per image (i.e. one score per class)
- 1 label of the actual class. 

We'll be using [CrossEntropyLoss](https://pytorch.org/docs/stable/nn.html#crossentropyloss) and the inputs are 64x10 image scores and 64 labels. Cross-Entropy Loss are averaged across obersvations. This loss function formula can be described as:

$$
\begin{align}
loss \,(x, \,class) &= - log \bigg( \frac{exp(x[class])}{\sum_j exp(x[j])} \bigg) \\
&= -x[class] + log \bigg( \sum_j exp(x[j] \bigg)
\end{align}
$$

In [4]:
# Calculate the loss
loss = criterion(scores, labels)
print("Average loss:\n",loss)

Average loss:
 tensor(2.3298, grad_fn=<NllLossBackward>)


It is often covnentient to build a model with log-softmax output using using `nn.LogSoftmax` or `F.log_softmax` ([documentation](https://pytorch.org/docs/stable/nn.html#torch.nn.LogSoftmax)). Then we can get the actual probabilities by taking the exponential `torch.exp(output)`. With a log-softmax output, we want to use the negative log likelihood loss, `nn.NLLLoss` ([documentation](https://pytorch.org/docs/stable/nn.html#torch.nn.NLLLoss)).

Next step is to build a model returns the log-softmax as the output so we can calculate the loss using the negative log likelihood loss. 

Note that for `nn.LogSoftmax` and `F.log_softmax` we'll need to set the `dim` keyword argument appropriately. `dim=0` calculates softmax across the rows, so each column sums to 1, while `dim=1` calculates across the columns so each row sums to 1.

In [5]:
# same network with output activation
model = nn.Sequential(nn.Linear(784, 128), 
                      nn.ReLU(), 
                      nn.Linear(128, 64), 
                      nn.ReLU(), 
                      nn.Linear(64, 10),
                      nn.LogSoftmax(dim=1))

# Loss function
criterion = nn.NLLLoss()

# Data 
images, labels = next(iter(trainloader))
images = images.reshape(images.shape[0], -1)

# Forward Pass
output_scores = model(images)

# Calculate los
loss = criterion(output_scores, labels)

# Convert output scores to p
probs = torch.exp(output_scores[0]).detach().numpy()
print("Probabilities for each number.:\n")
for i in range(len(probs)):
    print( str(i) + ": " + str((probs[i]*100).round(4)) + "%")

Probabilities for each number.:

0: 9.9516%
1: 9.2876%
2: 10.9505%
3: 8.5246%
4: 11.2233%
5: 10.5715%
6: 8.5179%
7: 10.7461%
8: 10.5533%
9: 9.6734%


## 3. Gradients in PyTorch: Autograd
Recall that in order to adjust the weights of the model we want to know the direction of the steepest descent. Therefore, we need the gradient, which is a vector of the partial derivatives of the function. The gradient of a function of variables points to the direction of the greatest increase of the objective function. Taking a step towards the direction of greatest descent corresponds to taking a step towards the negative gradient. 

In PyTorch, `Autograd()` automatically calculates the gradients of tensors. It works by keeping track of operations performed on tensors, then going backwards trhough thse operations and calculating gradients along the way.
- * `requires_grad = True` must be set on a tensor.
* This can be done at creation of a tensor with `requires_grad`keyword, or at any time with `x.requires_grad_(True).

You can turn off gradients for a block of code with the `torch.no_grad()` content:
```python
x = torch.zeros(1, requires_grad=True)
>>> with torch.no_grad():
...     y = x * 2
>>> y.requires_grad
False
```

The gradients are computed with respect to some variable `z` with `z.backward()`. This does a backward pass through the operations that created `z`.

#### Example of Autograd

In [6]:
# Create tensor with autograd
x = torch.randn(2,2, requires_grad = True)
y = x**2
z = y.mean()

# print
print("x:\n",x)
print("\ny:\n",y)
print("\ngrad object of y:\n",y.grad_fn)
print("\nz:\n",z)
print("\ngrad of x:\n",x)

x:
 tensor([[ 0.0314, -0.4865],
        [ 1.2349, -0.6947]], requires_grad=True)

y:
 tensor([[9.8870e-04, 2.3671e-01],
        [1.5249e+00, 4.8260e-01]], grad_fn=<PowBackward0>)

grad object of y:
 <PowBackward0 object at 0x000001D15BA55470>

z:
 tensor(0.5613, grad_fn=<MeanBackward1>)

grad of x:
 tensor([[ 0.0314, -0.4865],
        [ 1.2349, -0.6947]], requires_grad=True)


To calculate the gradients, you need to run the `.backward` method on a Variable, `z` for example. This will calculate the gradient for `z` with respect to `x`

$$
\frac{\partial z}{\partial x} = \frac{\partial}{\partial x}\left[\frac{1}{n}\sum_i^n x_i^2\right] = \frac{x}{2}
$$

In [7]:
# backward()...
# ...

#### Back to the example: Loss and Autograd together
When we create a networks with PyTorch, all of the parameters are initialized with `requires_grad = True.`. This means that when we calculate the loss and call `loss.backward()`, the gradients for the parameters are calculated. 

In [8]:
# Same model
model = nn.Sequential(nn.Linear(784, 128),
                      nn.ReLU(),
                      nn.Linear(128, 64),
                      nn.ReLU(),
                      nn.Linear(64, 10), 
                      nn.LogSoftmax(dim=1))

# criterion
criterion = nn.NLLLoss()
images, labels = next(iter(trainloader))
images = images.reshape(images.shape[0], -1)

output_scores = model(images) # Logits
loss = criterion(output_scores, labels)

# Gradients of weight matrix of first layer
print("Before backward:\n",model[0].weight.grad)
loss.backward()
print("After backward():\n",model[0].weight.grad)

Before backward:
 None
After backward():
 tensor([[ 0.0052,  0.0052,  0.0052,  ...,  0.0052,  0.0052,  0.0052],
        [-0.0003, -0.0003, -0.0003,  ..., -0.0003, -0.0003, -0.0003],
        [ 0.0004,  0.0004,  0.0004,  ...,  0.0004,  0.0004,  0.0004],
        ...,
        [ 0.0025,  0.0025,  0.0025,  ...,  0.0025,  0.0025,  0.0025],
        [ 0.0001,  0.0001,  0.0001,  ...,  0.0001,  0.0001,  0.0001],
        [ 0.0002,  0.0002,  0.0002,  ...,  0.0002,  0.0002,  0.0002]])


#### Optimizer
We need one more thing before training the network: An optimizer that we'll use to update the weights with the gradients. We get these from PyTorch's [`optim` package](https://pytorch.org/docs/stable/optim.html). For example we can use stochastic gradient descent with `optim.SGD`.

In [9]:
# Specify optimizer
optimizer = optim.SGD(model.parameters(), lr = 0.01)

Now we have seen all the individual parts and it is time to bring everything together. At first, we'll consider one learning step before looping through all the data.

In general, the steps in PyTorch are:

- Make a forward pass through the network: `output = model.forward(images)`
- Use the network output to calculate the loss: `loss = criterion(output, labels)`
- Perform a backward pass through the network with `loss.backward()` to calculate the gradients
- Take a step with the optimizer to update the weights: `optimizer.step()`

Below we'll go through one training step and print out the weights and gradients:

Note that there is a line of code `optimizer.zero_grad()`. Whenever we perform multiple backwards passes with the same parameters, the gradients are accumulated. This means that we need to zero the gradients on each training pass or we'll retain gradients from previous training batches as we'll perform this process iteratively in a for loop.

We'll see how this works by taking a look at the first layer of our network. We will focus on the five weights and gradients of the first image. Recall that the weights are updated using the gradient with a learning rate $\alpha$ (We'll use 0.1):

$$
\large W^\prime_1 = W_1 - \alpha \frac{\partial \ell}{\partial W_1}
$$

In [10]:
optimizer = optim.SGD(model.parameters(), lr = 0.1)

# initial weights
print("Initial weights parameters:\n",model[0].weight[0,:5])

# data
images, labels = next(iter(trainloader))
images = images.reshape(images.shape[0], -1)

# clear gradients 
optimizer.zero_grad()

# forward pass
output = model.forward(images)

# loss
loss = criterion(output, labels)

# backward pass
loss.backward()

# gradients
print("\ngradients:\n", model[0].weight.grad[0, :5])
print("\ngradients*learnrate:\n",model[0].weight.grad[0, :5]*0.1)

# take a step towards negative gradient
optimizer.step()

# new weights: w = 
print("\nNew weights parameters:\n", model[0].weight[0,:5])

Initial weights parameters:
 tensor([-0.0345, -0.0270, -0.0218, -0.0307, -0.0291], grad_fn=<SliceBackward>)

gradients:
 tensor([5.8640e-05, 5.8640e-05, 5.8640e-05, 5.8640e-05, 5.8640e-05])

gradients*learnrate:
 tensor([5.8640e-06, 5.8640e-06, 5.8640e-06, 5.8640e-06, 5.8640e-06])

New weights parameters:
 tensor([-0.0345, -0.0270, -0.0218, -0.0308, -0.0291], grad_fn=<SliceBackward>)


## Next
In the example above we have looked at only a small part of how the network changes. Note that this happens for the entire model simultaneously.

Next step is to iteratively repeat this process using a for loop until we have found the best possible parameters for our network. 