<a href="https://colab.research.google.com/github/whitestones011/deep_learning/blob/master/pytorch_base_gradients.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Vanishing & exploding gradients

*Vanishing* -> gradients become infinitely smaller during backpropagation, so the earlier layers receive small update and model doesn't learn.

*Exploding* -> gradients are getting larger resulting in large parameter update, so training diverges.

Solutions for unstable gradients:

* Weight initialization;

* Activation function;

* Batch normalization.

## Weight initialization

Solution for good weight initialization ensures that:

* Comparable variance of input and output layers;

* Varience of gradients the same before and after a layer.


The choice of initialization method depends on activation function.

For example: ReLU -> He/Kaiming initilization.

In [55]:
import torch
from torch import nn

In [44]:
torch.manual_seed(0)
# default
linear = nn.Linear(2, 4)
linear.weight

Parameter containing:
tensor([[-0.0053,  0.3793],
        [-0.5820, -0.5204],
        [-0.2723,  0.1896],
        [-0.0140,  0.5607]], requires_grad=True)

In [47]:
# kaiming
torch.manual_seed(0)
nn.init.kaiming_uniform_(linear.weight)

Parameter containing:
tensor([[-0.0130,  0.9291],
        [-1.4256, -1.2747],
        [-0.6671,  0.4645],
        [-0.0343,  1.3733]], requires_grad=True)

In [None]:
# Example
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(9, 16)
        self.fc2 = nn.Linear(16, 8)
        self.fc3 = nn.Linear(8, 1)

        # Apply He initialization
        nn.init.kaiming_uniform_(self.fc1.weight)
        nn.init.kaiming_uniform_(self.fc2.weight)
        nn.init.kaiming_uniform_(self.fc3.weight, nonlinearity="sigmoid")

    def forward(self, x):
        # Update ReLU activation to ELU
        x = nn.functional.elu(self.fc1(x))
        x = nn.functional.elu(self.fc2(x))
        x = nn.functional.sigmoid(self.fc3(x))
        return x

## Batch normalization
Batch normalization tends to accelerate training convergence and protects the model from vanishing and exploding gradients issues.

Normalise layer output:

* Substract the mean;
* Divide by standard deviation;
* Scale and shift the inputs using learned parameters.



In [54]:
# Example
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(9, 16)
        self.fc2 = nn.Linear(16, 8)
        self.fc3 = nn.Linear(8, 1)
        # Add two batch normalization layers
        self.bn1 = nn.BatchNorm1d(16)
        self.bn2 = nn.BatchNorm1d(8)

        nn.init.kaiming_uniform_(self.fc1.weight)
        nn.init.kaiming_uniform_(self.fc2.weight)
        nn.init.kaiming_uniform_(self.fc3.weight, nonlinearity="sigmoid")

    def forward(self, x):
        x = self.fc1(x)
        x = self.bn1(x)
        x = nn.functional.elu(x)

        # Pass x through the second set of layers
        x = self.fc2(x)
        x = self.bn2(x)
        x = nn.functional.elu(x)

        x = nn.functional.sigmoid(self.fc3(x))
        return x