<a href="https://colab.research.google.com/github/rahiakela/deep-learning-research-and-practice/blob/main/deep-learning-with-pytorch-step-by-step/Part-II-Computer-Vision/04-transfer-learning/03_residual_connections.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Transfer Learning: Residual Connections

Now, let's consider what is Transfer learning?

The idea is quite simple. First, some big tech company, which has access to virtually
infinite amounts of data and computing power, develops and trains a huge model
for their own purpose. 

Next, once it is trained, its architecture and the corresponding trained weights (the pre-trained model) are released. Finally,
everyone else can use these weights as a starting point and fine-tune them
further for a different (but similar) purpose.

That’s transfer learning in a nutshell.

##Setup

In [1]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:80% !important; }</style>"))

In [2]:
try:
    import google.colab
    import requests
    url = 'https://raw.githubusercontent.com/dvgodoy/PyTorchStepByStep/master/config.py'
    r = requests.get(url, allow_redirects=True)
    open('config.py', 'wb').write(r.content)    
except ModuleNotFoundError:
    pass

from config import *
config_chapter7()
# This is needed to render the plots in this chapter
from plots.chapter7 import *

Downloading files from GitHub repo to Colab...
Finished!


In [3]:
import numpy as np
from PIL import Image

import torch
import torch.optim as optim
import torch.nn as nn
import torch.nn.functional as F

from torch.utils.data import DataLoader, Dataset, random_split, TensorDataset
from torchvision.transforms import Compose, ToTensor, Normalize, Resize, ToPILImage, CenterCrop, RandomResizedCrop
from torchvision.datasets import ImageFolder
from torchvision.models import alexnet, resnet18, inception_v3
#from torchvision.models.alexnet import model_urls
try:
  from torchvision.models.utils import load_state_dict_from_url
except ImportError:
  from torch.hub import load_state_dict_from_url

from stepbystep.v3 import StepByStep
from data_generation.rps import download_rps

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
import os
# content/gdrive/My Drive/Kaggle is the path where kaggle.json is  present in the Google Drive
os.environ['KAGGLE_CONFIG_DIR'] = "/content/gdrive/MyDrive/kaggle-keys"

In [None]:
%%shell

# download dataset from kaggle> URL: https://www.kaggle.com/datasets/sanikamal/rock-paper-scissors-dataset
kaggle datasets download -d sanikamal/rock-paper-scissors-dataset

unzip -qq rock-paper-scissors-dataset.zip
rm -rf rock-paper-scissors-dataset.zip

Downloading rock-paper-scissors-dataset.zip to /content
 99% 449M/452M [00:04<00:00, 122MB/s]
100% 452M/452M [00:04<00:00, 98.1MB/s]




## Data Preparation

The data preparation step will be a bit more demanding this time since we’ll be
standardizing the images.Besides, we can use the ImageFolder dataset now.

The Rock Paper Scissors dataset is organized like that:

```
rps/paper/paper01-000.png
rps/paper/paper01-001.png

rps/rock/rock01-000.png
rps/rock/rock01-001.png

rps/scissors/scissors01-000.png
rps/scissors/scissors01-001.png
```

The dataset is also perfectly balanced, with each sub-folder containing 840 images
of its particular class.

In [None]:
ROOT_FOLDER = "Rock-Paper-Scissors"

Since we’re using a pre-trained model, we need to use the standardization
parameters used to train the original model. 

In other words, we need to use the
statistics of the original dataset used to train that model.

So, the data preparation step for the Rock Paper Scissors dataset looks like this now:

In [None]:
normalizer = Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
composer = Compose([
  Resize(256),
  CenterCrop(224),
  ToTensor(),
  normalizer
])

train_data = ImageFolder(root=f"{ROOT_FOLDER}/train", transform=composer)
val_data = ImageFolder(root=f"{ROOT_FOLDER}/test", transform=composer)

# Builds a loader of each set
train_loader = DataLoader(train_data, batch_size=16, shuffle=True)
val_loader = DataLoader(val_data, batch_size=16)

##Residual Connections

The idea of a residual connection is quite simple, actually: After passing the input
through a layer and activation function, the input itself is added to the result.

Why would I want to add the input to the result?



###Learning the Identity

Neural networks and their nonlinearities (activation functions) are great!

But nonlinearities are both a blessing and a curse: They
make it extremely hard for a model to learn the identity function.

To illustrate this, let’s start with a dummy dataset containing 100 random data
points with a single feature. 

But this feature isn’t simply a feature—it is also the
label.

In [4]:
torch.manual_seed(23)

dummy_points = torch.randn((100, 1))

dummy_dataset = TensorDataset(dummy_points, dummy_points)
dummy_loader = DataLoader(dummy_dataset, batch_size=16, shuffle=True)

If we were using a simple linear model, that would be a no-brainer, right?

But what happens if we introduce a nonlinearity? 

Let’s
configure the model and train it to see what happens:

In [5]:
class Dummy(nn.Module):
  def __init__(self):
    super(Dummy, self).__init__()

    self.linear = nn.Linear(1, 1)
    self.activation = nn.ReLU()

  def forward(self, x):
    output = self.linear(x)
    output = self.activation(output)
    return output

In [6]:
torch.manual_seed(555)

dummy_model = Dummy()
dummy_loss_fn = nn.MSELoss()
dummy_optimizer = optim.SGD(dummy_model.parameters(), lr=0.1)

In [7]:
dummy_sbs = StepByStep(dummy_model, dummy_loss_fn, dummy_optimizer)
dummy_sbs.set_loaders(dummy_loader)
dummy_sbs.train(200)

If we compare the actual labels with the model’s predictions, we’ll see that it failed
to learn the identity function:

In [8]:
np.concatenate([
  dummy_points[:5].numpy(),
  dummy_sbs.predict(dummy_points)[:5]
], axis=1)

array([[-0.9012059 ,  0.        ],
       [ 0.56559485,  0.56559485],
       [-0.48822638,  0.        ],
       [ 0.75069577,  0.7506957 ],
       [ 0.58925384,  0.58925384]], dtype=float32)

Since the `ReLU` can only return positive values, it will
never be able to produce the points with negative values.

Wait, that doesn’t look right … where is the output layer?

I suppressed the output layer on purpose to make a point here.

Please bear with me a little bit longer while I add a residual connection to the
model:

In [9]:
class DummyResidual(nn.Module):
  def __init__(self):
    super(DummyResidual, self).__init__()

    self.linear = nn.Linear(1, 1)
    self.activation = nn.ReLU()

  def forward(self, x):
    identity = x
    output = self.linear(x)
    output = self.activation(output)
    output = output + identity
    return output

In [10]:
torch.manual_seed(555)

dummy_model = DummyResidual()
dummy_loss_fn = nn.MSELoss()
dummy_optimizer = optim.SGD(dummy_model.parameters(), lr=0.1)

In [12]:
dummy_sbs = StepByStep(dummy_model, dummy_loss_fn, dummy_optimizer)
dummy_sbs.set_loaders(dummy_loader)
dummy_sbs.train(100)

Let’s double-check it.

In [13]:
np.concatenate([
  dummy_points[:5].numpy(),
  dummy_sbs.predict(dummy_points)[:5]
], axis=1)

array([[-0.9012059 , -0.9012059 ],
       [ 0.56559485,  0.56559485],
       [-0.48822638, -0.48822638],
       [ 0.75069577,  0.75069577],
       [ 0.58925384,  0.58925384]], dtype=float32)

It looks like the model actually learned the identity function … or did it? 

Let’s check its parameters:

In [14]:
dummy_model.state_dict()

OrderedDict([('linear.weight', tensor([[0.1490]])),
             ('linear.bias', tensor([-0.3329]))])

For an input value equal to zero, the output of the linear layer will be -0.3326,
which, in turn, will be chopped off by the ReLU activation.Then 

Which input values produce outputs greater than zero?

The answer: Input values above `2.2352 (=0.3326/0.1488)` will produce positive
outputs, which, in turn, will pass through the ReLU activation. 

But I have another question for you:

Guess what is the highest input value in our dataset?



In [15]:
dummy_points.max()

tensor(2.2347)

So what? Does it actually mean anything?

It means the model learned to stay out of the way of the inputs! 

Now that the
model has the ability to use the raw inputs directly, its linear layer learned to
produce only negative values, so its nonlinearity (ReLU) produces only zeros.

###The Power of Shortcuts

In [None]:
batch_normalizer.eval()

normed3 = batch_normalizer(batch3[0])
normed3.mean(axis=0), normed3.var(axis=0, unbiased=False)

(tensor([0.1350, 0.1450]), tensor([1.0134, 1.2981]))

Since it is standardizing unseen data using statistics computed on
training data, the results above are expected. 

The mean will be around zero and
the standard deviation will be around one.

###Momentum

There is an alternative way of computing running statistics: Instead of using a
simple average, it uses an exponentially weighted moving average (EWMA) of the
statistics.

So, to make it abundantly clear what is being computed, I present the formulas
below:

$$
\large
\begin{array}
& \text{EWMA}_t(\alpha, x) &= &\alpha &x_t &+ &(1-\alpha) &\text{EWMA}_{t-1}(\alpha, x)
\\
\text{running stat}_t &= &\text{"momentum"} &\text{stat}_t &+ &(1-\text{"momentum"}) &\text{running stat}_{t-1}
\end{array}
$$

Let’s try it out in practice.

In [None]:
batch_normalizer_mom = nn.BatchNorm1d(num_features=2, affine=False, momentum=0.1)
batch_normalizer_mom.state_dict()

OrderedDict([('running_mean', tensor([0., 0.])),
             ('running_var', tensor([1., 1.])),
             ('num_batches_tracked', tensor(0))])

What happens if we run
the first mini-batch through it?

In [None]:
normed1_mom = batch_normalizer_mom(batch1[0])
batch_normalizer_mom.state_dict()

OrderedDict([('running_mean', tensor([-0.0228, -0.0212])),
             ('running_var', tensor([1.2743, 1.3761])),
             ('num_batches_tracked', tensor(1))])

We can easily verify the results for the running
means:

In [None]:
running_mean = torch.zeros((1, 2))
running_mean = 0.1 * batch1[0].mean(axis=0) + (1 - 0.1) + running_mean
running_mean

tensor([[0.8772, 0.8788]])

###BatchNorm2d

The difference between the one-dimension and the two-dimension batch
normalization is actually quite simple: The former standardizes features (columns),
while the latter standardizes channels (pixels).



In [None]:
torch.manual_seed(39)

dummy_images = torch.rand((200, 3, 10, 10))
dummy_labels = torch.randint(2, (200, 1))

dummy_dataset = TensorDataset(dummy_images, dummy_labels)
dummy_loader = DataLoader(dummy_dataset, batch_size=64, shuffle=True)

iterator = iter(dummy_loader)
batch1 = next(iterator)
batch1[0].shape

torch.Size([64, 3, 10, 10])

The batch normalization is done over the C dimension, so it will compute statistics
using the remaining dimensions—N, H, and `W (axis=[0, 2, 3])`—representing all
pixels of a given channel from every image in the mini-batch.

The `nn.BatchNorm2d` layer has the same arguments as its one-dimension
counterpart, but its num_features argument must match the number of channels
of the input instead:

In [None]:
batch_normalizer = nn.BatchNorm2d(num_features=3, affine=False, momentum=None)
normed1 = batch_normalizer(batch1[0])
normed1.mean(axis=[0, 2, 3]), normed1.var(axis=[0, 2, 3], unbiased=False)

(tensor([ 2.3283e-08, -2.3693e-08,  8.8960e-08]),
 tensor([0.9999, 0.9999, 0.9999]))

As expected, each channel in the output has its pixel values with zero mean and
unit standard deviation.

##Summary

It goes over a lot
of information while only scratching the surface of this topic. So, I am organizing a
small summary of the main points we’ve addressed:

* During training time, batch normalization computes statistics (mean and
variance) for each individual mini-batch and uses these statistics to produce
standardized outputs.

* The fluctuations in the statistics from one mini-batch to the next introduce
randomness into the process and thus have a regularizing effect.

* Due to the regularizing effect of batch normalization, it may not work well if combined with other regularization techniques (like dropout).

* During evaluation time, batch normalization uses a (smoothed) average of the
statistics computed during training.

* Its original motivation was to address the so-called internal covariate shift by
producing similar distributions across different layers, but it was later found
that it actually improves model training by making the loss surface smoother.

* The batch normalization may be placed either before or after the activation
function; there is no "right" or "wrong" way.

* The layer preceding the batch normalization layer should have its bias=False
set to avoid useless computation.

* Even though batch normalization works for a different reason than initially
thought, addressing the internal covariate shift may still bring benefits, like
solving the vanishing gradients problem.

