# Homework 04 - Valentina Blasone

## Deep Learning - A.A. 2020/2021

> 1. Now that you have all the tools to train an MLP with high performance on MNIST, try reaching 0-loss on the training data (with a small epsilon, e.g. 99.99% training performance -- don't worry if you overfit!).
The implementation is completely up to you. You just need to keep it an MLP without using fancy layers (e.g., keep the `Linear` layers, don't use `Conv1d` or something like this, don't use attention). You are free to use any LR scheduler or optimizer, any one of batchnorm/groupnorm, regularization methods... If you use something we haven't seen during lectures, please motivate your choice and explain (as briefly as possible) how it works.
> 2. Try reaching 0-loss on the training data with **permuted labels**. Assess the model on the test data (without permuted labels) and comment. Help yourself with [3](https://arxiv.org/abs/1611.03530).
*Tip*: To permute the labels, act on the `trainset.targets` with an appropriate torch function.
Then, you can pass this "permuted" `Dataset` to a `DataLoader` like so: `trainloader_permuted = torch.utils.data.DataLoader(trainset_permuted, batch_size=batch_size_train, shuffle=True)`. You can now use this `DataLoader` inside the training function.
Additional view for motivating this exercise: ["The statistical significance perfect linear separation", by Jared Tanner (Oxford U.)](https://www.youtube.com/watch?v=vl2QsVWEqdA).

In [1]:
import torch
import os
from torch import nn
from matplotlib import pyplot as plt

from scripts import mnist
from scripts.train_utils import accuracy, AverageMeter

## 1

In [2]:
class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Flatten(),
            nn.Linear(28*28, 256),
            nn.ReLU(),

            nn.BatchNorm1d(num_features=256),
            nn.Linear(256, 32),
            nn.ReLU(),

            nn.BatchNorm1d(num_features=32),
            nn.Linear(32, 10)
        )
    
    def forward(self, X):
        return self.layers(X)

In [3]:
def train_epoch(model, dataloader, loss_fn, optimizer, loss_meter, performance_meter, performance, device): # note: I've added a generic performance to replace accuracy and the device
    for X, y in dataloader:
        # TRANSFER X AND y TO GPU IF SPECIFIED
        X = X.to(device)
        y = y.to(device)
        # ... like last time
        optimizer.zero_grad() 
        y_hat = model(X)
        loss = loss_fn(y_hat, y)
        loss.backward()
        optimizer.step()
        acc = performance(y_hat, y)
        loss_meter.update(val=loss.item(), n=X.shape[0])
        performance_meter.update(val=acc, n=X.shape[0])

In [4]:
def train_model(model, dataloader, loss_fn, optimizer, num_epochs, checkpoint_loc=None, checkpoint_name="checkpoint.pt", performance=accuracy, lr_scheduler=None, epoch_start_scheduler=1, device=None):
    # added lr_scheduler

    # create the folder for the checkpoints (if it's not None)
    if checkpoint_loc is not None:
        os.makedirs(checkpoint_loc, exist_ok=True)
    
    # establish device
    if device is None:
        device = "cuda:0" if torch.cuda.is_available() else "cpu"
    print(f"Training on {device}")   
    
    model.to(device)
    model.train()

    # epoch loop
    for epoch in range(num_epochs):

        loss_meter = AverageMeter()
        performance_meter = AverageMeter()

        # added print for LR
        print(f"Epoch {epoch+1} --- learning rate {optimizer.param_groups[0]['lr']:.5f}")

        train_epoch(model, dataloader, loss_fn, optimizer, loss_meter, performance_meter, performance, device=device)

        print(f"Epoch {epoch+1} completed. Loss - total: {loss_meter.sum} - average: {loss_meter.avg}; Performance: {performance_meter.avg}")

        # produce checkpoint dictionary -- but only if the name and folder of the checkpoint are not None
        if checkpoint_name is not None and checkpoint_loc is not None:
            checkpoint_dict = {
                "parameters": model.state_dict(),
                "optimizer": optimizer.state_dict(),
                "epoch": epoch
            }
            torch.save(checkpoint_dict, os.path.join(checkpoint_loc, checkpoint_name))
        
        if lr_scheduler is not None:
            if epoch >= epoch_start_scheduler:
                lr_scheduler.step()
            # or you can use a MultiStepLR with milestones=[6, 11] thus deleting the `if` construct for the epoch   

    return loss_meter.sum, performance_meter.avg

In [5]:
def test_model(model, dataloader, performance=accuracy, loss_fn=None, device=None):
    # establish device
    if device is None:
        device = "cuda:0" if torch.cuda.is_available() else "cpu"

    # create an AverageMeter for the loss if passed
    if loss_fn is not None:
        loss_meter = AverageMeter()
    
    performance_meter = AverageMeter()

    model.to(device)
    model.eval()
    with torch.no_grad():
        for X, y in dataloader:
            X = X.to(device)
            y = y.to(device)

            y_hat = model(X)
            loss = loss_fn(y_hat, y) if loss_fn is not None else None
            acc = performance(y_hat, y)
            if loss_fn is not None:
                loss_meter.update(loss.item(), X.shape[0])
            performance_meter.update(acc, X.shape[0])
    # get final performances
    fin_loss = loss_meter.sum if loss_fn is not None else None
    fin_perf = performance_meter.avg
    print(f"TESTING - loss {fin_loss if fin_loss is not None else '--'} - performance {fin_perf}")
    return fin_loss, fin_perf

In [6]:
minibatch_size_train = 256
minibatch_size_test = 512

trainloader, testloader, trainset, testset = mnist.get_data(batch_size_train=minibatch_size_test, batch_size_test=minibatch_size_test)

learn_rate = 0.1
num_epochs = 30

model = MLP()
loss_fn = nn.CrossEntropyLoss()

optimizer = torch.optim.SGD(model.parameters(), lr=learn_rate, momentum=0.9)
#scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=.1)

In [7]:
train_loss, train_acc = train_model(model, trainloader, loss_fn, optimizer, num_epochs, device="cuda:0")

Training on cuda:0
Epoch 1 --- learning rate 0.10000
Epoch 1 completed. Loss - total: 15269.52510690689 - average: 0.25449208511511484; Performance: 0.92635
Epoch 2 --- learning rate 0.10000
Epoch 2 completed. Loss - total: 4758.294134140015 - average: 0.07930490223566691; Performance: 0.9756333333333334
Epoch 3 --- learning rate 0.10000
Epoch 3 completed. Loss - total: 3037.1740114688873 - average: 0.05061956685781479; Performance: 0.9846166666666667
Epoch 4 --- learning rate 0.10000
Epoch 4 completed. Loss - total: 2128.6691996455193 - average: 0.03547781999409199; Performance: 0.9891666666666666
Epoch 5 --- learning rate 0.10000
Epoch 5 completed. Loss - total: 1453.3751927614212 - average: 0.02422291987935702; Performance: 0.99265
Epoch 6 --- learning rate 0.10000
Epoch 6 completed. Loss - total: 1143.6829409003258 - average: 0.019061382348338762; Performance: 0.99425
Epoch 7 --- learning rate 0.10000
Epoch 7 completed. Loss - total: 815.8666926026344 - average: 0.01359777821004390

In [8]:
final_loss, final_perf = test_model(model, testloader, loss_fn=loss_fn)

TESTING - loss 9.037457719445229 - performance 1.0


## 2

Try reaching 0-loss on the training data with **permuted labels**. Assess the model on the test data (without permuted labels) and comment. Help yourself with [3](https://arxiv.org/abs/1611.03530).
*Tip*: To permute the labels, act on the `trainset.targets` with an appropriate torch function.
Then, you can pass this "permuted" `Dataset` to a `DataLoader` like so: `trainloader_permuted = torch.utils.data.DataLoader(trainset_permuted, batch_size=batch_size_train, shuffle=True)`. You can now use this `DataLoader` inside the training function.
Additional view for motivating this exercise: ["The statistical significance perfect linear separation", by Jared Tanner (Oxford U.)](https://www.youtube.com/watch?v=vl2QsVWEqdA).

In [9]:
trainset_permuted = trainset
trainset_permuted.targets = trainset_permuted.targets[torch.randperm(trainset_permuted.targets.size()[0])]
trainloader_permuted = torch.utils.data.DataLoader(trainset_permuted, batch_size=minibatch_size_train, shuffle=True)

In [10]:
learn_rate = 0.1
num_epochs = 30

model = MLP()
loss_fn = nn.CrossEntropyLoss()

optimizer = torch.optim.SGD(model.parameters(), lr=learn_rate, momentum=0.9)
#scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=.1)

In [11]:
train_loss, train_acc = train_model(model, trainloader_permuted, loss_fn, optimizer, num_epochs, device="cuda:0")

Training on cuda:0
Epoch 1 --- learning rate 0.10000
Epoch 1 completed. Loss - total: 139515.97901916504 - average: 2.325266316986084; Performance: 0.1023
Epoch 2 --- learning rate 0.10000
Epoch 2 completed. Loss - total: 138454.85108184814 - average: 2.307580851364136; Performance: 0.10558333333333333
Epoch 3 --- learning rate 0.10000
Epoch 3 completed. Loss - total: 138118.67540740967 - average: 2.301977923456828; Performance: 0.11113333333333333
Epoch 4 --- learning rate 0.10000
Epoch 4 completed. Loss - total: 137895.93157196045 - average: 2.2982655261993408; Performance: 0.1165
Epoch 5 --- learning rate 0.10000
Epoch 5 completed. Loss - total: 137695.47243499756 - average: 2.2949245405832928; Performance: 0.12113333333333333
Epoch 6 --- learning rate 0.10000
Epoch 6 completed. Loss - total: 137525.7042236328 - average: 2.2920950703938803; Performance: 0.12551666666666667
Epoch 7 --- learning rate 0.10000
Epoch 7 completed. Loss - total: 137271.4722290039 - average: 2.2878578704833

In [12]:
final_loss, final_perf = test_model(model, testloader, loss_fn=loss_fn)

TESTING - loss 120631.57796859741 - performance 0.2811
