# 01 CNN Training With Code Example - Neural Network Programming Course

## CNN Training Process
So far in this series, we learned about Tensors, and we've learned all about PyTorch neural networks. We are now ready to begin the **training process**.
* Prepare the data
* Build the model
* Train the model
  * **Calculate the loss, the gradient, and update the weights**
* Analyze the model's results

## Training: What We Do After The Forward Pass

During training, we do a forward pass, but then what? We'll suppose we get a batch and pass it forward through the network. Once the output is obtained, we compare the **predicted output** to the **actual labels**, and once we know **how close** the predicted values are from the actual labels, we **tweak** the weights inside the network in such a way that the values the network predicts move closer to the true values (labels).其实就是通过loss function找最优解  

All of this is for **a single batch**, and we **repeat** this process for **every batch** until we have covered every sample in our training set. After we've completed this process for all of the batches and passed over every sample in our **training set**, we say that **an epoch** is complete. We use the word **epoch** to represent a **time period** in which our **entire training** set has been covered.

During the **entire training process**, we do as many **epochs** as necessary to reach our desired level of accuracy. With this, we have the following steps:
1. Get batch from the training set.
2. Pass batch to network.
3. Calculate the loss (difference between the predicted values and the true values).
4. Calculate the gradient of the loss function w.r.t the network's weights.
5. Update the weights using the gradients to reduce the loss.
6. Repeat steps 1-5 until one epoch is completed.
7. Repeat steps 1-6 for as many epochs required to reach the minimum loss.

We already know exactly how to do steps `1` and `2`. We use a loss function to perform step `3`, and you know that we use `backpropagation` and an optimization algorithm to perform step `4` and `5`. Steps `6` and `7` are just standard **Python loops (the training loop)**. Let's see how this is done in code.

## The Training Process

Since we disabled PyTorch's gradient tracking feature in a previous episode, we need to be sure to turn it back on (it is on by default).  
`torch.set_grad_enabled(True)`

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

import torchvision
import torchvision.transforms as transforms

torch.set_printoptions(linewidth=120) # Display options for output
torch.set_grad_enabled(True) # Already on by default


<torch.autograd.grad_mode.set_grad_enabled at 0x1b6f9de6e80>

In [2]:
print(torch.__version__)
print(torchvision.__version__)

1.6.0
0.7.0


In [3]:
def get_num_correct(preds,labels):
    return preds.argmax(dim = 1).eq(labels).sum().item()

### Preparing For The Forward Pass
We already know how to get a batch and pass it forward through the network. Let's see what we do after the forward pass is complete.

We'll begin by:
1. Creating an instance of our `Network` class.
2. Creating a data loader that provides batches of size 100 from our training set.
3. Unpacking the images and labels from one of these batches.

In [4]:
class Network(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(in_channels=1,out_channels=6,kernel_size=5)
        self.conv2 = nn.Conv2d(in_channels=6,out_channels=12,kernel_size = 5)
        
        self.fc1 = nn.Linear(in_features = 12*4*4,out_features = 120)
        self.fc2 = nn.Linear(in_features = 120,out_features = 60)
        self.out = nn.Linear(in_features = 60,out_features = 10)
        
    def forward(self,t):
        # (1) input layer
        t = t

        # (2) hidden conv layer
        t = self.conv1(t)
        t = F.relu(t)
        t = F.max_pool2d(t, kernel_size=2, stride=2)

        # (3) hidden conv layer
        t = self.conv2(t)
        t = F.relu(t)
        t = F.max_pool2d(t, kernel_size=2, stride=2)

        # (4) hidden linear layer
        t = t.reshape(-1, 12 * 4 * 4)
        t = self.fc1(t)
        t = F.relu(t)

        # (5) hidden linear layer
        t = self.fc2(t)
        t = F.relu(t)

        # (6) output layer
        t = self.out(t)
        #t = F.softmax(t, dim=1)

        return t

In [5]:
train_set = torchvision.datasets.FashionMNIST(
    root = './data/FashionMNIST'
    ,train = True
    ,download = True
    ,transform = transforms.Compose([
        transforms.ToTensor()
    ])
)

In [6]:
network = Network()

In [7]:
train_loader = torch.utils.data.DataLoader(train_set, batch_size=100)
batch = next(iter(train_loader)) # Getting a batch
images, labels = batch

Next, we are ready to pass our batch of images forward through the network and obtain the output predictions. Once we have the prediction tensor, we can use the predictions and the true labels to calculate the loss.

### Calculating The Loss
To do this we will use the `cross_entropy()` loss function that is available in PyTorch's `nn.functional` API. Once we have the loss, we can print it, and also check the number of correct predictions using the function we created a [previous post](https://github.com/unclestrong/DeepLearning-code/blob/master/05%20Neural%20Networks%20and%20PyTorch%20Design-P2.ipynb).

In [8]:
preds = network(images)
loss = F.cross_entropy(preds,labels) # Calculating the loss

In [9]:
loss.item()

2.307081460952759

In [10]:
get_num_correct(preds,labels)

11

The `cross_entropy()` function returned a scalar valued tenor, and so we used the `item()` method to print the `loss` as a Python number. We got `11` out of `100` correct, and since we have `10` prediction classes, this is what we'd expect by guessing at random.

### Calculating The Gradients
Calculating the gradients is very easy using PyTorch. Since our network is a PyTorch `nn.Module`, PyTorch has created a **computation graph** under the hood. As our tensor flowed forward through our network, all of the computations where added to the graph. The computation graph is then used by PyTorch to calculate the gradients of the loss function with respect to the network's weights.

Before we calculate the gradients, let's verify that we **currently** have **no gradients** inside our `conv1` layer. The gradients are tensors that are accessible in the `grad` (short for gradient) attribute of the weight tensor of each layer.

In [12]:
print(network.conv1.weight.grad)

None


To `calculate the gradients`, we call the `backward()` method on the loss tensor, like so:

In [13]:
loss.backward() # Calculating the gradients

Now, the gradients of the loss function have been stored inside weight tensors.

In [14]:
network.conv1.weight.grad.shape

torch.Size([6, 1, 5, 5])

In [15]:
network.conv1.weight.grad

tensor([[[[ 8.0532e-04,  7.1517e-04,  5.4289e-04,  4.2453e-04,  2.2062e-04],
          [ 4.2473e-04,  3.6081e-04,  3.4775e-04,  3.3520e-04,  1.3792e-04],
          [ 1.8878e-04,  2.0218e-04,  1.3977e-04,  3.1463e-05, -1.6695e-04],
          [ 6.2114e-06,  1.1477e-04,  4.5907e-05, -2.9935e-05, -1.2944e-04],
          [-2.1969e-04, -1.8537e-04, -2.5566e-04, -2.2936e-04, -2.4350e-04]]],


        [[[ 1.0334e-03, -1.3228e-04, -4.6828e-04,  7.5834e-04,  1.1306e-03],
          [ 7.5262e-04, -4.0342e-04, -9.5859e-04,  1.9084e-04,  6.4649e-04],
          [ 6.9752e-04, -2.2768e-04, -8.4701e-04,  3.4626e-04,  4.3055e-04],
          [ 3.6175e-04, -7.0846e-04, -1.4202e-03, -3.4338e-04, -2.2465e-04],
          [ 3.8891e-04, -5.8086e-04, -1.4649e-03, -5.2291e-04, -2.1644e-04]]],


        [[[-2.7583e-03, -2.3309e-03, -2.3823e-03, -2.7402e-03, -2.4740e-03],
          [-2.3130e-03, -1.8277e-03, -2.0964e-03, -2.7168e-03, -2.2019e-03],
          [-2.1739e-03, -1.8778e-03, -2.1596e-03, -2.5166e-03, -1.98

These gradients are used by the optimizer to update the respective weights. To create our optimizer, we use the `torch.optim` package that has many optimization algorithm implementations that we can use. We'll use `Adam` for our example.

### Updating The Weights
To the `Adam` class constructor, we pass the `network parameters` (this is how the optimizer is able to access the gradients), and we pass the `learning rate` .

Finally, all we have to do to update the weights is to tell the optimizer to use the gradients to step in the direction of the loss function's minimum.

In [16]:
optimizer = optim.Adam(network.parameters(), lr=0.01)
optimizer.step() # Updating the weights

When the `step()` function is called, the optimizer updates the weights using the gradients that are stored in the network's parameters. This means that we should expect our loss to be reduced if we pass the same batch through the network again. Checking this, we can see that this is indeed the case:

In [17]:
preds = network(images)
loss.item()

2.307081460952759

In [19]:
loss = F.cross_entropy(preds, labels)
loss.item()

2.2812142372131348

In [20]:
get_num_correct(preds, labels)

11

## Train Using A Single Batch
We can summarize the code for training with a single batch in the following way:

In [21]:
network = Network()

train_loader = torch.utils.data.DataLoader(train_set, batch_size=100)
optimizer = optim.Adam(network.parameters(), lr=0.01)

batch = next(iter(train_loader)) # Get Batch
images, labels = batch

preds = network(images) # Pass Batch
loss = F.cross_entropy(preds, labels) # Calculate Loss

loss.backward() # Calculate Gradients
optimizer.step() # Update Weights

print('loss1:', loss.item())
preds = network(images)
loss = F.cross_entropy(preds, labels)
print('loss2:', loss.item())

loss1: 2.300954818725586
loss2: 2.2833118438720703


## Quiz 01
Q1:During the training process, once the output is obtained, we compare the predicted output to the _______________.<br>
A1:labels

Q2:Once we know how close the predicted values are to the actual labels, we tweak the weights inside the network in such a way that the predicted values move _______________ the true values (labels).  
A2:closer to

Q3:After we've completed the training process for all the batches in our training set, we say that _______________ is complete.  
A3:an epoch

Q4:During the training process, we use the word _______________ to represent a time period for which the entire training set (every batch) has been passed to the network.
A4:epoch  

Q5:During the entire training process, we do as many epochs as necessary to reach the _______________.<br>
A5:minimum loss

Q6:To begin the training process, the first step is to get a batch from the training set. What is the second step?  
A6:Pass the obtained batch to the network.

Q7:During the training process, after we pass a batch to the network, we use the predicted values and the labels to _______________.<br>
A7:calculate the loss

Q8:PyTorch's gradient tracking feature is turned on using which piece of code?  
A8:torch.set_grad_enabled(True)

Q9:Which piece of code makes the most sense for creating a PyTorch DataLoader?  
A9:torch.utils.data.DataLoader(train_set)

Q10:The cross_entropy() loss function lives in which PyTorch package?  
A10:torch.nn.functional

# 02 CNN Training Loop Explained - Neural Network Code Project
## CNN Training Loop - Teach A Neural Network
In the last episode, we learned that the [training process](https://deeplizard.com/learn/video/sZAlS3_dnk0) is an iterative process, and to train a neural network, we build what is called the training loop.
* Prepare the data
* Build the model
* Train the model
  * Build the training loop
* Analyze the model's results

### Training With A Single Batch
We can summarize the code for training with a single batch in the following way:
```python
network = Network()

train_loader = torch.utils.data.DataLoader(train_set, batch_size=100)
optimizer = optim.Adam(network.parameters(), lr=0.01)

batch = next(iter(train_loader)) # Get Batch
images, labels = batch

preds = network(images) # Pass Batch
loss = F.cross_entropy(preds, labels) # Calculate Loss

loss.backward() # Calculate Gradients
optimizer.step() # Update Weights

print('loss1:', loss.item())
preds = network(images)
loss = F.cross_entropy(preds, labels)
print('loss2:', loss.item())
```

### Output:
```python
loss1: 2.300954818725586
loss2: 2.2833118438720703
```

One thing that you'll notice is that we get **different results each time** we run this code. This is because the model is created each time at the top, and we know from previous posts that the model weights are **randomly initialized**.

### Training With All Batches (Single Epoch)
Now, to train with all of the **batches** available inside our **data loader**, we need to make a few changes and add one additional line of code:

In [22]:
network = Network()

train_loader = torch.utils.data.DataLoader(train_set, batch_size=100)
optimizer = optim.Adam(network.parameters(), lr=0.01)

total_loss = 0
total_correct = 0

for batch in train_loader: # Get Batch
    images, labels = batch 

    preds = network(images) # Pass Batch
    loss = F.cross_entropy(preds, labels) # Calculate Loss

    optimizer.zero_grad()
    loss.backward() # Calculate Gradients
    optimizer.step() # Update Weights

    total_loss += loss.item()
    total_correct += get_num_correct(preds, labels)

print(
    "epoch:", 0, 
    "total_correct:", total_correct, 
    "loss:", total_loss
)

epoch: 0 total_correct: 46957 loss: 347.39798778295517


Instead of getting a single batch from our data loader, we'll create a for loop that will **iterate** over **all of the batches**.

Since we have `60,000` samples in our training set, we will have `60,000 / 100 = 600` iterations. For this reason, we'll remove the print statement from within the loop, and keep track of the `total loss` and the `total number` of correct predictions printing them at the end.

Something to notice about these `600` iterations is that our `weights` will be `updated 600 times` by the end of the loop. If we **raise the batch_size** this number will **go down** and if we **lower the batch_size** this number will **go up**.

Finally, after we call the `backward()` method on our loss tensor, we know the gradients will be calculated and **added** to the `grad` attributes of our network's parameters. For this reason, we need to zero out these gradients. We can do this with a method called `zero_grad()` that comes with the optimizer.

We are ready to run this code. This time the code will take longer because the loop is working on `600` batches.

```python
epoch: 0 total_correct: 46957 loss: 347.39798778295517
```

We get the results, and we can see that the total number correct out of 60,000 was 46,957.

In [23]:
total_correct / len(train_set)

0.7826166666666666

That's pretty good after only one epoch (a single full pass over the data). Even though we did one epoch, we still have to keep in mind that the **weights** were updated `600` times, and this fact depends on our batch size. If made our batch_batch size larger, say `10,000`, the weights would only be updated `6` times, and the results **wouldn't be quite as good**.

### Training With Multiple Epochs
To do **multiple epochs**, all we have to do is put this code into a **for loop**. We'll also add the epoch number to the print statement.

In [24]:
network = Network()

train_loader = torch.utils.data.DataLoader(train_set, batch_size=100)
optimizer = optim.Adam(network.parameters(), lr=0.01)

for epoch in range(10):

    total_loss = 0
    total_correct = 0

    for batch in train_loader: # Get Batch
        images, labels = batch 

        preds = network(images) # Pass Batch
        loss = F.cross_entropy(preds, labels) # Calculate Loss

        optimizer.zero_grad()
        loss.backward() # Calculate Gradients
        optimizer.step() # Update Weights

        total_loss += loss.item()
        total_correct += get_num_correct(preds, labels)

    print(
        "epoch", epoch, 
        "total_correct:", total_correct, 
        "loss:", total_loss
    )

epoch 0 total_correct: 46928 loss: 344.27279521524906
epoch 1 total_correct: 51277 loss: 232.71748647093773
epoch 2 total_correct: 52081 loss: 208.65398114919662
epoch 3 total_correct: 52609 loss: 194.9983945786953
epoch 4 total_correct: 52906 loss: 190.74674943089485
epoch 5 total_correct: 53021 loss: 186.76688426733017
epoch 6 total_correct: 53290 loss: 181.0335234105587
epoch 7 total_correct: 53226 loss: 180.50387901067734
epoch 8 total_correct: 53480 loss: 173.91857013106346
epoch 9 total_correct: 53588 loss: 170.3671340867877


## Complete Training Loop
Putting all of this together, we can pull the `network`, `optimizer`, and the `train_loader` out of the training loop cell.
```python
network = Network()
optimizer = optim.Adam(network.parameters(), lr=0.01)
train_loader = torch.utils.data.DataLoader(
    train_set
    ,batch_size=100
    ,shuffle=True
)
```
This makes it so that we can run the training loop without resetting the networks weights.
```python
for epoch in range(10):

    total_loss = 0
    total_correct = 0

    for batch in train_loader: # Get Batch
        images, labels = batch 

        preds = network(images) # Pass Batch
        loss = F.cross_entropy(preds, labels) # Calculate Loss

        optimizer.zero_grad()
        loss.backward() # Calculate Gradients
        optimizer.step() # Update Weights

        total_loss += loss.item()
        total_correct += get_num_correct(preds, labels)

    print(
        "epoch", epoch, 
        "total_correct:", total_correct, 
        "loss:", total_loss
    )
```

## Quiz 02
Q1:In the code below, what does the `lr` parameter do?
```python
optimizer = optim.Adam(network.parameters(), lr=0.01)
```
A1:sets the learning rate which tells the optimizer how far to step in the direction of the loss function's minimum

Q2:After we call the `backward()` method on our loss tensor, the gradients will be calculated and _______________ of our network's parameters.  
A2:added to the grad attributes

Q3:Using the code below, determine how many times optimizer.step() will be called during this training loop run.
```python
network = Network()

train_loader = torch.utils.data.DataLoader(train_set, batch_size=100)
optimizer = optim.Adam(network.parameters(), lr=0.01)

for epoch in range(10):
    for batch in train_loader: # Get Batch
        images, labels = batch
        # other stuff happens
        optimizer.step() # Update Weights
```
A3:6000

Q4:Suppose we have a fixed training set size. As batch size goes up, which of the following happens inside each epoch?
```python
for epoch in range(10):
    # what happens in here?
```
A4:the frequency of weight updates goes down

Q5:Suppose that our training set contains `60000 `samples. If we are using the data loader below, how many times will our weights be updated during one epoch?
```python
train_loader = torch.utils.data.DataLoader(train_set, batch_size=10000)
```
A5:6

Q6:Suppose that our training set contains `60000` samples. If we are using the data loader below, how many iterations will occur inside our `for batch in train_loader:` loop?
```python
train_loader = torch.utils.data.DataLoader(train_set, batch_size=1000)
```
A6:60

Q7:Suppose that our training set contains `60000` samples. If we are using the data loader below, how many iterations will occur inside our `for batch in train_loader:` loop?
```python
train_loader = torch.utils.data.DataLoader(train_set, batch_size=100)
```
A7:600

Q8:What is the result of running the line of code below?  
```python
loss.item()
```
A8:the loss as a Python number