# 5. Dropout Regularization

In [3]:
import time
import math
import random

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import torch
import torchvision
from torch import nn
from torch.utils import data
from torchvision import transforms

## Dropout

A good model that **does not overfit** the training data should **not be sensitive** to tiny changes in the inputs. 

Therefore, we can add **random noise** to each layer of a neural network (in training) to overcome overfitting. This method is called **`dropout`** because it looks like it is **dropping out (set to zero) some neurons** of the current layer when adding noise.

We need to add the noise with an **unbiased** manner, so that the **expected output** of each layer remains the same:

$$\epsilon \sim \mathcal{N}(0,\sigma^2)$$

The **activations** $h$ of each layer is replaced by the random variables $h'$:

$$
\begin{aligned}
h' =
\begin{cases}
    0 & \text{with probability of } p \\
    \frac{h}{1-p} & \text{else}
\end{cases}
\end{aligned}
$$

where $p$ is the **probability** that a neuron is dropped. In this way, we have $E[h'] = h$.

![](http://d2l.ai/_images/dropout2.svg)

With dropout, the model can no longer **overly depending on** any single neuron.

Dropout is sometimes used in **testing** to evaluate the **stability** of a neural network model. A model is said to be more stable if the predictions remains similar when covered by different dropouts.

## Implementing Dropout

In [4]:
net = nn.Sequential(nn.Flatten(),
                    nn.Linear(784,256),
                    nn.ReLU(),
                    nn.Dropout(0.2),
                    nn.Linear(256,256),
                    nn.ReLU(),
                    nn.Dropout(0.5),
                    nn.Linear(256,10))

In [5]:
def init_weights(m):
    if type(m) == nn.Linear:
        nn.init.normal_(m.weight, std=0.01)

In [6]:
net.apply(init_weights)

Sequential(
  (0): Flatten(start_dim=1, end_dim=-1)
  (1): Linear(in_features=784, out_features=256, bias=True)
  (2): ReLU()
  (3): Dropout(p=0.2, inplace=False)
  (4): Linear(in_features=256, out_features=256, bias=True)
  (5): ReLU()
  (6): Dropout(p=0.5, inplace=False)
  (7): Linear(in_features=256, out_features=10, bias=True)
)