<a href="https://colab.research.google.com/github/rahiakela/deep-learning-research-and-practice/blob/main/deep-learning-with-pytorch-step-by-step/Part-II-Computer-Vision/05_vanishing_and_exploding_gradients.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Vanishing and Exploding Gradients

The gradients, together
with the learning rate, are what makes the model tick, or better yet, learn.
We always
assumed that the gradients were well behaved, as long as our learning rate was
sensible. 

Unfortunately, this is not necessarily true, and sometimes the gradients
may go awry: They can either vanish or explode. Either way, we need to rein them
in, so let’s see how we can accomplish that.

Backpropagation works fine for
models with a few hidden layers, but as models grow deeper, the gradients
computed for the weights in the initial layers become smaller and smaller. That’s
the so-called vanishing gradients problem, and it has always been a major obstacle
for training deeper models.

If gradients vanish—that is, if they are close to zero—updating the weights will
barely change them. In other words, the model is not learning anything; it gets
stuck.

Why does it happen?



##Setup

In [1]:
try:
    import google.colab
    import requests
    url = 'https://raw.githubusercontent.com/dvgodoy/PyTorchStepByStep/master/config.py'
    r = requests.get(url, allow_redirects=True)
    open('config.py', 'wb').write(r.content)    
except ModuleNotFoundError:
    pass

from config import *
config_chapterextra()
# This is needed to render the plots in this chapter
from plots.chapterextra import *

Downloading files from GitHub repo to Colab...
Finished!


In [2]:
import torch
import torch.optim as optim
import torch.nn as nn
from sklearn.datasets import make_regression

from torch.utils.data import DataLoader, TensorDataset
from stepbystep.v3 import StepByStep

from data_generation.ball import load_data

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"

## Ball Dataset

Let’s use a dataset of 1,000 random points drawn from a ten-dimensional ball such that each feature has zero mean and unit standard
deviation. 

In this dataset, points situated within half of the radius of the ball are
labeled as negative cases, while the remaining points are labeled positive cases.

In [3]:
X, y = load_data(n_points=1000, n_dims=10)

In [4]:
ball_dataset = TensorDataset(torch.as_tensor(X).float(), torch.as_tensor(y).float())
ball_loader = DataLoader(ball_dataset, batch_size=len(X))

##Block Model

To illustrate the vanishing gradients problem, we need a deeper model.

Let’s call it the "block" model: It is a block of several hidden
layers (and activation functions) stacked together, every layer containing the same
number of hidden units (neurons).

In [5]:
torch.manual_seed(11)

n_layers = 5
n_features = X.shape[1]
hidden_units = 100
activation_fn = nn.ReLU

model = build_model(n_features, n_layers, hidden_units, activation_fn, use_bn=False)

In [6]:
print(model)

Sequential(
  (h1): Linear(in_features=10, out_features=100, bias=True)
  (a1): ReLU()
  (h2): Linear(in_features=100, out_features=100, bias=True)
  (a2): ReLU()
  (h3): Linear(in_features=100, out_features=100, bias=True)
  (a3): ReLU()
  (h4): Linear(in_features=100, out_features=100, bias=True)
  (a4): ReLU()
  (h5): Linear(in_features=100, out_features=100, bias=True)
  (a5): ReLU()
  (o): Linear(in_features=100, out_features=1, bias=True)
)


In [7]:
# We’re only missing a loss function and an optimizer
loss_fn = nn.BCEWithLogitsLoss()
optimizer = optim.SGD(model.parameters(), lr=1e-2)

##Weights, Activations, and Gradients

In [None]:
torch.manual_seed(42)

model = resnet18(weights=True)
model.fc = nn.Linear(512, 3)

There is no freezing since fine-tuning entails the training of all the weights, not only
those from the "top" layer.

In [None]:
multi_loss_fn = nn.CrossEntropyLoss(reduction="mean")
optimizer_model = optim.Adam(model.parameters(), lr=3e-4)

We have everything set to train.

In [None]:
sbs_transfer = StepByStep(model, multi_loss_fn, optimizer_model)
sbs_transfer.set_loaders(train_loader, val_loader)
sbs_transfer.train(1)

Let’s see what the model can accomplish after training for a single epoch.

In [None]:
StepByStep.loader_apply(val_loader, sbs_transfer.correct)

tensor([[124, 124],
        [124, 124],
        [124, 124]])

If we had frozen the layers in the model above, it would have been a case of
feature extraction suitable for data augmentation since we would be training the
"top" layer while it was still attached to the rest of the model.

##Feature Extraction

So, we’re modifying the model (replacing the "top" layer
with an identity layer) to generate a dataset of features first and then using it to
train the real "top" layer independently.

In [None]:
# Model Configuration
model = resnet18(weights=True).to(device)
model.fc = nn.Identity()
freeze_model(model)

In [None]:
# Data Preparation — Preprocessing
train_preproc = preprocessed_dataset(model, train_loader)
val_preproc = preprocessed_dataset(model, val_loader)
train_preproc_loader = DataLoader(train_preproc, batch_size=16, shuffle=True)
val_preproc_loader = DataLoader(val_preproc, batch_size=16)

Once the dataset of features and its corresponding loaders are ready, we only need
to create a model corresponding to the "top" layer and train it in the usual way.

In [None]:
# Model Configuration — Top Model
torch.manual_seed(42)

top_model = nn.Sequential(nn.Linear(512, 3))

multi_loss_fn = nn.CrossEntropyLoss(reduction="mean")
optimizer_top = optim.Adam(top_model.parameters(), lr=3e-4)

In [None]:
# Model Training — Top Model
sbs_top = StepByStep(top_model, multi_loss_fn, optimizer_top)
sbs_top.set_loaders(train_preproc_loader, val_preproc_loader)
sbs_top.train(10)

In [None]:
# We surely can evaluate the model now
StepByStep.loader_apply(val_preproc_loader, sbs_top.correct)

tensor([[ 98, 124],
        [124, 124],
        [104, 124]])

But, if we want to try it out on the original dataset (containing the images), we need to reattach the "top" layer.

In [None]:
model.fc = top_model
sbs_temp = StepByStep(model, None, None)

In this case, both loss function and
optimizers are set to None since we won’t be training the model anymore.

In [None]:
StepByStep.loader_apply(val_loader, sbs_temp.correct)

tensor([[ 98, 124],
        [124, 124],
        [104, 124]])

We got the same results, as expected.