# Part 3: Initializing a PyTorch CNN model and training it

In this tutorial, we'll look at the structure of a neural network and train it. We'll also do some activities to see how long it takes to train it and see what *overtraining* looks like.

Below, you'll see some familiar code from part 1 - all we are doing is loading in the data that we got previously. I could hide this code block in another file, but you'll want to make sure that `dataset` and `testset` are the same locations as you had last time.

In [1]:
# import libraries and some functions that are written for you 
from handwrite_functions import *

# where you want to save your dataset
dataset = 'MY_DATASET'
testset = 'MY_TESTSET'

transform = transforms.Compose([transforms.ToTensor(),
                              transforms.Normalize((0.5,), (0.5,)),
                              ])

# download training sets and test sets 
trainset = datasets.MNIST(dataset, download=False, train=True, transform=transform)
valset = datasets.MNIST(testset, download=False, train=False, transform=transform)

# load training sets and test sets in batch sizes of 64
trainloader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True)
valloader = torch.utils.data.DataLoader(valset, batch_size=64, shuffle=True)

# prepare loaded data sets to iterate over
dataiter = iter(trainloader)
images, labels = next(dataiter)

Did you successfully import the datasets? How do you know?

In [15]:
trainset, valset

(Dataset MNIST
     Number of datapoints: 60000
     Root location: MY_DATASET
     Split: Train
     StandardTransform
 Transform: Compose(
                ToTensor()
                Normalize(mean=(0.5,), std=(0.5,))
            ),
 Dataset MNIST
     Number of datapoints: 10000
     Root location: MY_TESTSET
     Split: Test
     StandardTransform
 Transform: Compose(
                ToTensor()
                Normalize(mean=(0.5,), std=(0.5,))
            ))

## Initializing a model

We're going to start by initializing a model - we'll tell Python that we'd like to make a model that is structured like this: 

---
$$\text{784 input neurons to 128 neurons} \rightarrow \text{applies rectifier activation function } \rightarrow \text{128 neurons to 64 neurons} \rightarrow \text{applies rectifier activation function} \rightarrow \text{64 neurons to 10 output neurons}$$ 
---

This is shown in lines 9-14 below. Does this structure match what gets printed below?

In [293]:
# look at structure of neural network
# play with these values to make the NN better! 

# input number of neurons, two "hidden layers", output neuron size
input_size = 784
hidden_sizes = [128, 64]
output_size = 10

model = nn.Sequential(nn.Linear(input_size, hidden_sizes[0]),
                      nn.ReLU(),
                      nn.Linear(hidden_sizes[0], hidden_sizes[1]),
                      nn.ReLU(),
                      nn.Linear(hidden_sizes[1], output_size),
                      nn.LogSoftmax(dim=1))
print("(0): Neurons")

print(model)

(0): Neurons
Sequential(
  (0): Linear(in_features=784, out_features=128, bias=True)
  (1): ReLU()
  (2): Linear(in_features=128, out_features=64, bias=True)
  (3): ReLU()
  (4): Linear(in_features=64, out_features=10, bias=True)
  (5): LogSoftmax(dim=1)
)


In the last tutorial, our mini neural network was structured more like this: 

---
$$\text{4 input neurons} \rightarrow \text{sigmoid activation function } \rightarrow \text{1 output}$$
---

Using hidden layers lets us apply the activation function multiple times along the chain with the same dataset. 

# Minimizing the "negative log-likelihood loss" (also known as the "forward pass")

At the end of the steps listed above in the model's structure, there is the `LogSoftmax(dim=1)` piece. This will do something that we did in the last tutorial - the "forward propagation" piece.

In the last tutorial, we took the derivative of our activation function to minimize the difference between predicted value and truth value. Then, we added these weights to each neuron to get closer to the "best" solution.

$$\text{minimize} = \frac{d}{dx}[\sum^{n=1}_{4}(f(\text{inputs}) -y)^{2}]$$

PyTorch does something like this - but does it in logarithms to save computing effort. Think back to algebra when we added and multiplied logs:

$$\log{(a)} + \log{(b)} = \log{(ab)}$$

Is it easier computationally to multiply or add? Because our neural network in the last tutorial was so simple, there was no need to worry about memory.

In [14]:
# this cell is complicated but essentially... 
# maximize the likelihood of observing the data by minimizing negative log-likelihood 
# we'll let pytorch do this for us! 

criterion = nn.NLLLoss()
images, labels = next(iter(trainloader))
images = images.view(images.shape[0], -1)

logps = model(images) #log probabilities
loss = criterion(logps, labels) #calculate the NLL loss
logps.shape, loss

(torch.Size([64, 10]), tensor(0.0573, grad_fn=<NllLossBackward0>))

## Backwards passes

How much error does each output contribute? This function by pytorch starts at the output and propagates the error backwards to see which pathways (or output-> hidden layers -> input) contribute the most error to the final guesses. We'll calculate this here to see what it looks like, then use the function later in training. 

The printed output below shows us the "changes in weights" for each pathway (also called *node*). After this backwards pass, the NN has calculated which pathways are most error-prone and which are least error-prone. When we start training the model, we will use an "optimizer," which will reduce the weights of the more error-prone pathways and increase the weights of the least error-prone pathways.

We didn't do this on our last neural network tutorial! 

In [6]:
print("Before backward pass: \n", model[0].weight.grad)
loss.backward()
print("After backward pass: \n", model[0].weight.grad)
print("Shape of the data: \n", model[0].weight.grad.shape)

Before backward pass: 
 None
After backward pass: 
 tensor([[ 0.0002,  0.0002,  0.0002,  ...,  0.0002,  0.0002,  0.0002],
        [-0.0001, -0.0001, -0.0001,  ..., -0.0001, -0.0001, -0.0001],
        [ 0.0022,  0.0022,  0.0022,  ...,  0.0022,  0.0022,  0.0022],
        ...,
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0025,  0.0025,  0.0025,  ...,  0.0025,  0.0025,  0.0025],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]])
Shape of the data: 
 torch.Size([128, 784])


# Putting everything together to train the NN

This structure should look very similar to the last tutorial's structure. 

In [7]:
# initialize our optimizer
optimizer = optim.SGD(model.parameters(), lr=0.003, momentum=0.9)

# define start time
time0 = time()

# How many times do we want our neural network to learn on the same data?
# Be careful not to do it too many times, since otherwise the NN will learn the anomalies in our data
epochs = 15
for e in range(epochs):
    running_loss = 0
    for images, labels in trainloader:
        # Make the training data fit the same size as our given input size
        images = images.view(images.shape[0], -1)
    
        # train the neural network
        optimizer.zero_grad()
        
        output = model(images)
        loss = criterion(output, labels)
        
        #backward propagate like in the last cell
        loss.backward()
        
        #have pytorch adjust to improve from backwards propagation
        optimizer.step()
        
        # store "loss.item()" or loss percentage in "running_loss" variable for us to print
        running_loss += loss.item()
    else:
        print("Epoch {} - Training loss: {}".format(e, running_loss/len(trainloader)))
        
# print time it took to train
print("\nTraining Time (in minutes) =",(time()-time0)/60)

Epoch 0 - Training loss: 0.6196157126379674
Epoch 1 - Training loss: 0.2810084430743128
Epoch 2 - Training loss: 0.22019984526658998
Epoch 3 - Training loss: 0.17913566346266377
Epoch 4 - Training loss: 0.14946448202850596
Epoch 5 - Training loss: 0.1289271490771506
Epoch 6 - Training loss: 0.11250920073865954
Epoch 7 - Training loss: 0.0990299687806223
Epoch 8 - Training loss: 0.0878910678826066
Epoch 9 - Training loss: 0.08161229755667879
Epoch 10 - Training loss: 0.07441538415932611
Epoch 11 - Training loss: 0.06722809490300953
Epoch 12 - Training loss: 0.06210828351272917
Epoch 13 - Training loss: 0.056837323304416654
Epoch 14 - Training loss: 0.0531546590571254

Training Time (in minutes) = 1.910279428958893


# Activities:

- can you plot the training loss for each epoch? (remember to plot axes/title and save all images!)
- can you plot how long it takes to train each epoch? (remember to plot axes/title and save all images!)
- what happens if you don't do the backwards pass?
- try adding another hidden layer to your NN initialization. How do you do this? 
