<a href="https://colab.research.google.com/github/staerkjoe/NLP_colab/blob/main/E04_dropout_mlp_pytorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Task 4.1: Theory and praxis of Regularization - Dropout

ITU KSAMLDS1KU - Advanced Machine Learning for Natural Language Processing 2025

by Stefan Heinrich, Eisuke Okuda, Jonathan Tiedchen,
& material by Kevin Murphy and Chris Bishop.

This notebook is based on http://d2l.ai/chapter_multilayer-perceptrons/dropout.html and further adaptations by Kevin Murphy in https://colab.research.google.com/github/probml/probml-notebooks/blob/main/notebooks/dropout_MLP_torch.ipynb

All info and static material: https://learnit.itu.dk/course/view.php?id=3024752

-------------------------------------------------------------------------------

Note: An important difficulty in Deep Learning is the high dimensional dependency of hyper parameters. Often, finding a setting for a specific task is brittle, and it is not guaranteed that this setting works well on a (slightly) changed task. We, therefore, also test for different random seeds.


In [1]:
# @title #### Import dependencies

import numpy as np
import matplotlib.pyplot as plt
import math
import torch
from torch import nn
from torch.nn import functional as F

np.random.seed(seed=0)

!mkdir figures # for saving plots

!wget https://raw.githubusercontent.com/d2l-ai/d2l-en/master/d2l/torch.py -q -O d2l.py
import d2l


KeyboardInterrupt: 

In [None]:
# We search for the GPUs and otherwise use the CPU for the computation.
# you can also set a fixed device here.
def try_gpu(i=0):
    """Return gpu(i) if exists, otherwise return cpu()."""
    if torch.cuda.device_count() >= i + 1:
      return torch.device(f'cuda:{i}')
    return torch.device('cpu')
device = try_gpu()

#### Add dropout layer by hand to an MLP

In [None]:
def dropout_layer(X, dropout):
    assert 0 <= dropout <= 1
    # In this case, all elements are dropped out
    if dropout == 1:
        return torch.zeros_like(X)
    # In this case, all elements are kept
    if dropout == 0:
        return X
    mask = (torch.Tensor(X.shape).uniform_(0, 1) > dropout).float().to(device)
    return mask * X / (1.0 - dropout)

In [None]:
# quick test
torch.manual_seed(0)
X = torch.arange(16, dtype=torch.float32).reshape((2, 8))
X = X.to(device)
print(X)
print(dropout_layer(X, 0.))
print(dropout_layer(X, 0.5))
print(dropout_layer(X, 1.))


In [None]:
#  A common trend is to set a lower dropout probability closer to the input layer
class Net(nn.Module):
    def __init__(self, num_inputs, num_outputs, num_hiddens1, num_hiddens2,
                 is_training=True, dropout1=0.2, dropout2=0.5):
        super(Net, self).__init__()
        self.dropout1 = dropout1
        self.dropout2 = dropout2
        self.num_inputs = num_inputs
        self.training = is_training
        self.lin1 = nn.Linear(num_inputs, num_hiddens1)
        self.lin2 = nn.Linear(num_hiddens1, num_hiddens2)
        self.lin3 = nn.Linear(num_hiddens2, num_outputs)
        self.relu = nn.ReLU()

    def forward(self, X):
        H1 = self.relu(self.lin1(X.reshape((-1, self.num_inputs))))
        # Use dropout only when training the model
        if self.training == True:
            # Add a dropout layer after the first fully connected layer
            H1 = dropout_layer(H1, self.dropout1)
        H2 = self.relu(self.lin2(H1))
        if self.training == True:
            # Add a dropout layer after the second fully connected layer
            H2 = dropout_layer(H2, self.dropout2)
        out = self.lin3(H2)
        return out


#### Fit to FashionMNIST

This time we use the [d2l.load_data_fashion_mnist](https://github.com/d2l-ai/d2l-en/blob/master/d2l/torch.py#L200) function to load our well-known FashionMNIST data set. For the purpose of understanding Dropout, a vision task is easier than an NLP task.

In [None]:
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size=256)

Fit model using SGD.
Uses the [d2l.train_ch6](https://github.com/d2l-ai/d2l-en/blob/master/d2l/torch.py#L326) function.

In [None]:
torch.manual_seed(0)
# We pick a wide model to cause overfitting without dropout
num_inputs, num_outputs, num_hiddens1, num_hiddens2 = 784, 10, 256, 256
net = Net(num_inputs, num_outputs, num_hiddens1, num_hiddens2,
          dropout1=0.5, dropout2=0.5)
loss = nn.CrossEntropyLoss()
lr = 0.5
trainer = torch.optim.SGD(net.parameters(), lr=lr)
num_epochs = 10
d2l.train_ch6(net, train_iter, test_iter, num_epochs, lr, device=device)

When we turn dropout off, we notice a slightly larger gap between train and test accuracy.

In [None]:
torch.manual_seed(0)
net = Net(num_inputs, num_outputs, num_hiddens1, num_hiddens2,
          dropout1=0.0, dropout2=0.0)
loss = nn.CrossEntropyLoss()
trainer = torch.optim.SGD(net.parameters(), lr=lr)
num_epochs = 10
d2l.train_ch6(net, train_iter, test_iter, num_epochs, lr, device=device)
#d2l.train_ch13(net, train_iter, test_iter, loss, trainer, num_epochs, devices)

#### Dropout using PyTorch layer

In [None]:
dropout1 = 0.5
dropout2 = 0.5
net = nn.Sequential(
    nn.Flatten(), nn.Linear(num_inputs, num_hiddens1), nn.ReLU(),
    # Add a dropout layer after the first fully connected layer
    nn.Dropout(dropout1), nn.Linear(num_hiddens2, num_hiddens1), nn.ReLU(),
    # Add a dropout layer after the second fully connected layer
    nn.Dropout(dropout2), nn.Linear(num_hiddens2, num_outputs))

def init_weights(m):
    if type(m) == nn.Linear:
        nn.init.normal_(m.weight, std=0.01)

torch.manual_seed(0)
net.apply(init_weights);

In [None]:
trainer = torch.optim.SGD(net.parameters(), lr=lr)
d2l.train_ch6(net, train_iter, test_iter, num_epochs, lr, try_gpu())

#### Visualize some predictions

In [None]:
def display_predictions(net, test_iter, n=6):
    # Extract first batch from iterator
    for X, y in test_iter:
        break
    # Get labels
    trues = d2l.get_fashion_mnist_labels(y)
    preds = d2l.get_fashion_mnist_labels(d2l.argmax(net(X.to(device)), axis=1))
    # Plot
    titles = [true + '\n' + pred for true, pred in zip(trues, preds)]
    d2l.show_images(d2l.reshape(X[0:n], (n, 28, 28)), 1, n,
                    titles=titles[0:n])

In [None]:
display_predictions(net, test_iter)