<a href="https://colab.research.google.com/github/zelal-Eizaldeen/deeplearning_course/blob/main/3_9pt_variationsOfGD.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

- In this programming example, we will use some of the just introduced techniques to improve our digital classification networks, implemented using PyTorch.

# PyTorch

This Google Colab is showing how we can do an improved digital classification network using PyTorch. So in this case, we will demonstrate some of these newly introduced techniques. Most of the code is the same as the previous programming example of digital classification, but we're doing some changes both to the **network as well as the training process.**

In [None]:
"""
The MIT License (MIT)
Copyright (c) 2021 NVIDIA
Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in
the Software without restriction, including without limitation the rights to
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
the Software, and to permit persons to whom the Software is furnished to do so,
subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
"""

In [1]:
import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms
from torchvision.datasets import MNIST
from torch.utils.data import DataLoader
import numpy as np
torch.manual_seed(7)

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
EPOCHS = 20
BATCH_SIZE = 64

# Load training dataset into a single batch to compute mean and stddev.
transform = transforms.Compose([transforms.ToTensor()])
trainset = MNIST(root='./pt_data', train=True, download=True, transform=transform)
trainloader = DataLoader(trainset, batch_size=len(trainset), shuffle=True)
data = next(iter(trainloader))
mean = data[0].mean()
stddev = data[0].std()

# Helper function needed to standardize data when loading datasets.
transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize(mean, stddev)])

trainset = MNIST(root='./pt_data', train=True, download=True, transform=transform)
testset = MNIST(root='./pt_data', train=False, download=True, transform=transform)

100%|██████████| 9.91M/9.91M [00:00<00:00, 141MB/s]
100%|██████████| 28.9k/28.9k [00:00<00:00, 9.30MB/s]
100%|██████████| 1.65M/1.65M [00:00<00:00, 126MB/s]
100%|██████████| 4.54k/4.54k [00:00<00:00, 9.87MB/s]


**we can see that we are now declaring a batch size of 64 instead of one**. So we're gonna run train 64 training examples in parallel.

The actual network uses the activation function for the first layer to ReLu  instead of tanh. We are also going to use something known as softmax activation function in the last layer, but the PyTorch implementation of is a little bit odd in that the **actual** **function is included in the loss function itself**. So we only need to declare a linear layer here. And then by changing to use this cross entropy loss function layer, that will also include the activation for the last layer.

In [2]:
# Define model. Final activation is omitted since softmax is part of
# cross-entropy loss function in PyTorch.
model = nn.Sequential(
    nn.Flatten(),
    nn.Linear(784, 25),
    nn.ReLU(),
    nn.Linear(25, 10)
)

Then here we're now also just **retrieving those two layers**. We do this a little bit different than in the previous example we say we want to look at layer one and three. So **layer one refers to the first linear layer ( nn.Linear(784, 25))

In [3]:
# Retrieve layers for custom weight initialization.
layers = next(model.modules())
hidden_layer = layers[1]
output_layer = layers[3]

and there we will use something known as **Kaiming initialization**. That's the same thing **as HE initialization as** we discussed in previous video. We set the bias weight to zero, and then for the second layer, the output layer, we're gonna use **Xavier uniform, which is the same thing as Glorot initialization.** So we do that for **the regular weights**and we can set the bias weights to zero.

In [4]:
# Kaiming (He) initialization.
nn.init.kaiming_normal_(hidden_layer.weight)
nn.init.constant_(hidden_layer.bias, 0.0)

# Xavier (Glorot) initialization.
nn.init.xavier_uniform_(output_layer.weight)
nn.init.constant_(output_layer.bias, 0.0)

Parameter containing:
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], requires_grad=True)

We have created our network. Now looking at **the training process**, We're gonna use the **adam optimizer** instead of this stochastic gradient descent. We don't need to provide a learning rate, we'll just use the default learning rate for Adam to start with. <br><br>

And then as mentioned, we use the **cross entropy loss** for as a loss function instead of the mean squared error.

In [6]:
# Use the Adam optimizer
# Cross-entropy loss as loss function.
optimizer = torch.optim.Adam(model.parameters())
loss_function = nn.CrossEntropyLoss()

We no longer actually have to compute the one hot encoded version of the targets because the loss function, the cross entropy loss function **is assuming that it gets the index rather than the one hot version**. So we can just give the targets and compare to the outputs directly.
And then now that we have **a batch size of 64**,


So we have the **indices and targets** here will be **an array of 64 elements in parallel.** We do this equal operation. We'll check element wise if these are equal or not, and then sum up. **So this will give us out of these 64 examples for how many is the output equal to the label.**

In [7]:
# Transfer model to GPU
model.to(device)

# Create DataLoader objects that will help create mini-batches.
trainloader = DataLoader(trainset, batch_size=BATCH_SIZE, shuffle=True)
testloader = DataLoader(testset, batch_size=BATCH_SIZE, shuffle=False)

# Train the model. In PyTorch we have to implement the training loop ourselves.
for i in range(EPOCHS):
    model.train() # Set model in training mode.
    train_loss = 0.0
    train_correct = 0
    train_batches = 0
    for inputs, targets in trainloader:
        # Move data to GPU.
        inputs, targets = inputs.to(device), targets.to(device)

        # Zero the parameter gradients.
        optimizer.zero_grad()

        # Forward pass.
        outputs = model(inputs)
        # Cross-entropy loss does not need one-hot targets in PyTorch.
        loss = loss_function(outputs, targets)

        # Accumulate metrics.
        _, indices = torch.max(outputs.data, 1)
        train_correct += (indices == targets).sum().item()
        train_batches +=  1
        train_loss += loss.item()

        # Backward pass and update.
        loss.backward()
        optimizer.step()

    train_loss = train_loss / train_batches
    train_acc = train_correct / (train_batches * BATCH_SIZE)

    # Evaluate the model on the test dataset. Identical to loop above but without
    # weight adjustment.
    model.eval() # Set model in inference mode.
    test_loss = 0.0
    test_correct = 0
    test_batches = 0
    for inputs, targets in testloader:
        inputs, targets = inputs.to(device), targets.to(device)
        outputs = model(inputs)
        loss = loss_function(outputs, targets)
        _, indices = torch.max(outputs, 1)
        test_correct += (indices == targets).sum().item()
        test_batches +=  1
        test_loss += loss.item()

    test_loss = test_loss / test_batches
    test_acc = test_correct / (test_batches * BATCH_SIZE)

    print(f'Epoch {i+1}/{EPOCHS} loss: {train_loss:.4f} - acc: {train_acc:0.4f} - val_loss: {test_loss:.4f} - val_acc: {test_acc:0.4f}')

Epoch 1/20 loss: 0.3698 - acc: 0.8896 - val_loss: 0.2090 - val_acc: 0.9332
Epoch 2/20 loss: 0.1901 - acc: 0.9435 - val_loss: 0.1698 - val_acc: 0.9442
Epoch 3/20 loss: 0.1510 - acc: 0.9547 - val_loss: 0.1506 - val_acc: 0.9490
Epoch 4/20 loss: 0.1303 - acc: 0.9604 - val_loss: 0.1395 - val_acc: 0.9525
Epoch 5/20 loss: 0.1159 - acc: 0.9649 - val_loss: 0.1304 - val_acc: 0.9570
Epoch 6/20 loss: 0.1051 - acc: 0.9675 - val_loss: 0.1313 - val_acc: 0.9567
Epoch 7/20 loss: 0.0975 - acc: 0.9697 - val_loss: 0.1194 - val_acc: 0.9615
Epoch 8/20 loss: 0.0899 - acc: 0.9716 - val_loss: 0.1262 - val_acc: 0.9580
Epoch 9/20 loss: 0.0837 - acc: 0.9739 - val_loss: 0.1303 - val_acc: 0.9576
Epoch 10/20 loss: 0.0788 - acc: 0.9757 - val_loss: 0.1244 - val_acc: 0.9615
Epoch 11/20 loss: 0.0746 - acc: 0.9763 - val_loss: 0.1202 - val_acc: 0.9602
Epoch 12/20 loss: 0.0706 - acc: 0.9774 - val_loss: 0.1137 - val_acc: 0.9625
Epoch 13/20 loss: 0.0674 - acc: 0.9781 - val_loss: 0.1228 - val_acc: 0.9606
Epoch 14/20 loss: 0.0

Given that we are running 64 examples in parallel, instead of one at a time, the training process will be faster. So you will see that each epic goes significantly faster than what it did when you did the previous programming example.

 Not only did the training go faster than before, but we also got a better result. So the training accuracy, is at 98% and the test accuracy is at 96% compared to we had the 95% and 94% with the previous network.
 So the running **multiple training examples in parallel made things faster**.
 And then **changing the network and using the Adam optimizer made the network doing better in terms of accuracy.**

# Reference
Learning Deep Learning Book by Magnus Ekman
