## Lab 2
### Part 2: Dealing with overfitting

Today we work with [Fashion-MNIST dataset](https://github.com/zalandoresearch/fashion-mnist) (*hint: it is available in `torchvision`*).

Your goal for today:
1. Train a FC (fully-connected) network that achieves >= 0.885 test accuracy.
2. Cause considerable overfitting by modifying the network (e.g. increasing the number of network parameters and/or layers) and demonstrate in in the appropriate way (e.g. plot loss and accurasy on train and validation set w.r.t. network complexity).
3. Try to deal with overfitting (at least partially) by using regularization techniques (Dropout/Batchnorm/...) and demonstrate the results.

__Please, write a small report describing your ideas, tries and achieved results in the end of this file.__

*Note*: Tasks 2 and 3 are interrelated, in task 3 your goal is to make the network from task 2 less prone to overfitting. Task 1 is independent from 2 and 3.

*Note 2*: We recomment to use Google Colab or other machine with GPU acceleration.

In [None]:
import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms
import torchsummary
from IPython.display import clear_output
from matplotlib import pyplot as plt
from matplotlib.pyplot import figure
import numpy as np
import os


device = 'cuda:0' if torch.cuda.is_available() else 'cpu'

In [None]:
# Technical function
def mkdir(path):
    if not os.path.exists(root_path):
        os.mkdir(root_path)
        print('Directory', path, 'is created!')
    else:
        print('Directory', path, 'already exists!')

root_path = 'fmnist'
mkdir(root_path)

Directory fmnist already exists!


In [None]:
download = True
train_transform = transforms.ToTensor()
test_transform = transforms.ToTensor()
transforms.Compose((transforms.ToTensor()))


fmnist_dataset_train = torchvision.datasets.FashionMNIST(root_path,
                                                        train=True,
                                                        transform=train_transform,
                                                        target_transform=None,
                                                        download=download)
fmnist_dataset_test = torchvision.datasets.FashionMNIST(root_path,
                                                       train=False,
                                                       transform=test_transform,
                                                       target_transform=None,
                                                       download=download)

In [None]:
train_loader = torch.utils.data.DataLoader(fmnist_dataset_train,
                                           batch_size=128,
                                           shuffle=True,
                                           num_workers=2)
test_loader = torch.utils.data.DataLoader(fmnist_dataset_test,
                                          batch_size=256,
                                          shuffle=False,
                                          num_workers=2)

In [None]:
len(fmnist_dataset_test)

10000

In [None]:
for img, label in train_loader:
    print(img.shape)
#     print(img)
    print(label.shape)
    print(label.size(0))
    break

torch.Size([128, 1, 28, 28])
torch.Size([128])
128


### Task 1
Train a network that achieves $\geq 0.885$ test accuracy. It's fine to use only Linear (`nn.Linear`) layers and activations/dropout/batchnorm. Convolutional layers might be a great use, but we will meet them a bit later.

In [None]:
class TinyNeuralNetwork(nn.Module):
    def __init__(self, input_shape=28*28, num_classes=10, input_channels=1):
        super(self.__class__, self).__init__()
        self.model = nn.Sequential(
            nn.Flatten(), # This layer converts image into a vector to use Linear layers afterwards
            # Your network structure comes here
            nn.Linear(784, 256),
            nn.BatchNorm1d(256),
            nn.ReLU(),
            nn.Linear(256, 128),
            nn.Linear(128, num_classes)
        )

    def forward(self, inp):
        out = self.model(inp)
        return out

In [None]:
torchsummary.summary(TinyNeuralNetwork().to(device), (28*28,))

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
           Flatten-1                  [-1, 784]               0
            Linear-2                  [-1, 256]         200,960
       BatchNorm1d-3                  [-1, 256]             512
              ReLU-4                  [-1, 256]               0
            Linear-5                  [-1, 128]          32,896
            Linear-6                   [-1, 10]           1,290
Total params: 235,658
Trainable params: 235,658
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 0.01
Params size (MB): 0.90
Estimated Total Size (MB): 0.91
----------------------------------------------------------------


Your experiments come here:

In [None]:
def train(model, dataloader, criterion, optimizer, num_epochs):
    model.train()
    for epoch in range(num_epochs):
        running_loss = 0.0
        for inputs, labels in dataloader:
            inputs, labels = inputs.to(device), labels.to(device)
            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            running_loss += loss.item()

        print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {running_loss/len(dataloader):.4f}')

def evaluate(model, dataloader):
    model.eval()
    correct = 0
    total = 0

    with torch.no_grad():
        for inputs, labels in dataloader:
            inputs, labels = inputs.to(device), labels.to(device)
            outputs = model(inputs)
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    accuracy = correct / total
    return accuracy


In [None]:
model = TinyNeuralNetwork().to(device)
opt = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
loss_func = nn.CrossEntropyLoss()

In [None]:
train(model, train_loader, loss_func, opt, num_epochs=35)

Epoch [1/35], Loss: 0.8662
Epoch [2/35], Loss: 0.4947
Epoch [3/35], Loss: 0.4308
Epoch [4/35], Loss: 0.3975
Epoch [5/35], Loss: 0.3730
Epoch [6/35], Loss: 0.3541
Epoch [7/35], Loss: 0.3400
Epoch [8/35], Loss: 0.3276
Epoch [9/35], Loss: 0.3161
Epoch [10/35], Loss: 0.3047
Epoch [11/35], Loss: 0.2954
Epoch [12/35], Loss: 0.2858
Epoch [13/35], Loss: 0.2769
Epoch [14/35], Loss: 0.2689
Epoch [15/35], Loss: 0.2605
Epoch [16/35], Loss: 0.2533
Epoch [17/35], Loss: 0.2467
Epoch [18/35], Loss: 0.2396
Epoch [19/35], Loss: 0.2325
Epoch [20/35], Loss: 0.2266
Epoch [21/35], Loss: 0.2198
Epoch [22/35], Loss: 0.2144
Epoch [23/35], Loss: 0.2085
Epoch [24/35], Loss: 0.2025
Epoch [25/35], Loss: 0.1971
Epoch [26/35], Loss: 0.1909
Epoch [27/35], Loss: 0.1846
Epoch [28/35], Loss: 0.1815
Epoch [29/35], Loss: 0.1752
Epoch [30/35], Loss: 0.1700
Epoch [31/35], Loss: 0.1656
Epoch [32/35], Loss: 0.1613
Epoch [33/35], Loss: 0.1586
Epoch [34/35], Loss: 0.1524
Epoch [35/35], Loss: 0.1484


In [None]:
evaluate(model, test_loader)

0.8866

### Task 2: Overfit it.
Build a network that will overfit to this dataset. Demonstrate the overfitting in the appropriate way (e.g. plot loss and accurasy on train and test set w.r.t. network complexity).

*Note:* you also might decrease the size of `train` dataset to enforce the overfitting and speed up the computations.

In [None]:
class OverfittingNeuralNetwork(nn.Module):
    def __init__(self, input_shape=28*28, num_classes=10, input_channels=1):
        super(self.__class__, self).__init__()
        self.model = nn.Sequential(
            nn.Flatten(), # This layer converts image into a vector to use Linear layers afterwards
            nn.Linear(784, 4096),
            nn.ReLU(),
            nn.Linear(4096, 512),
            nn.ReLU(),
            nn.Linear(512, 256),
            nn.ReLU(),
            nn.Linear(256, 128),
            nn.Linear(128, num_classes)
        )

    def forward(self, inp):
        out = self.model(inp)
        return out

In [None]:
model = OverfittingNeuralNetwork().to(device)
opt = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
loss_func = nn.CrossEntropyLoss()

In [None]:
train(model, train_loader, loss_func, opt, num_epochs=35)

Epoch [1/35], Loss: 2.1524
Epoch [2/35], Loss: 1.1488
Epoch [3/35], Loss: 0.7970
Epoch [4/35], Loss: 0.6809
Epoch [5/35], Loss: 0.6115
Epoch [6/35], Loss: 0.5668
Epoch [7/35], Loss: 0.5345
Epoch [8/35], Loss: 0.5091
Epoch [9/35], Loss: 0.4882
Epoch [10/35], Loss: 0.4704
Epoch [11/35], Loss: 0.4524
Epoch [12/35], Loss: 0.4388
Epoch [13/35], Loss: 0.4234
Epoch [14/35], Loss: 0.4122
Epoch [15/35], Loss: 0.4013
Epoch [16/35], Loss: 0.3915
Epoch [17/35], Loss: 0.3829
Epoch [18/35], Loss: 0.3745
Epoch [19/35], Loss: 0.3698
Epoch [20/35], Loss: 0.3601
Epoch [21/35], Loss: 0.3545
Epoch [22/35], Loss: 0.3470
Epoch [23/35], Loss: 0.3443
Epoch [24/35], Loss: 0.3334
Epoch [25/35], Loss: 0.3305
Epoch [26/35], Loss: 0.3216
Epoch [27/35], Loss: 0.3160
Epoch [28/35], Loss: 0.3139
Epoch [29/35], Loss: 0.3072
Epoch [30/35], Loss: 0.3005
Epoch [31/35], Loss: 0.2961
Epoch [32/35], Loss: 0.2898
Epoch [33/35], Loss: 0.2870
Epoch [34/35], Loss: 0.2810
Epoch [35/35], Loss: 0.2767


In [None]:
evaluate(model, test_loader)

0.8737

In [None]:
evaluate(model, train_loader)

0.8994666666666666

In [None]:
class OverfittingNeuralNetwork(nn.Module):
    def __init__(self, input_shape=28*28, num_classes=10, input_channels=1):
        super(self.__class__, self).__init__()
        self.model = nn.Sequential(
            nn.Flatten(), # This layer converts image into a vector to use Linear layers afterwards
            nn.Linear(784, 4096),
            nn.BatchNorm1d(4096),
            nn.ReLU(),
            nn.Linear(4096, 512),
            nn.BatchNorm1d(512),
            nn.ReLU(),
            nn.Linear(512, 256),
            nn.BatchNorm1d(256),
            nn.ReLU(),
            nn.Linear(256, 128),
            nn.Linear(128, num_classes)
        )

    def forward(self, inp):
        out = self.model(inp)
        return out

In [None]:
torchsummary.summary(OverfittingNeuralNetwork().to(device), (28*28,))

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
           Flatten-1                  [-1, 784]               0
            Linear-2                 [-1, 4096]       3,215,360
       BatchNorm1d-3                 [-1, 4096]           8,192
              ReLU-4                 [-1, 4096]               0
            Linear-5                  [-1, 512]       2,097,664
       BatchNorm1d-6                  [-1, 512]           1,024
              ReLU-7                  [-1, 512]               0
            Linear-8                  [-1, 256]         131,328
       BatchNorm1d-9                  [-1, 256]             512
             ReLU-10                  [-1, 256]               0
           Linear-11                  [-1, 128]          32,896
           Linear-12                   [-1, 10]           1,290
Total params: 5,488,266
Trainable params: 5,488,266
Non-trainable params: 0
---------------------------

In [None]:
model = OverfittingNeuralNetwork().to(device)
opt = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
loss_func = nn.CrossEntropyLoss()



In [None]:
train(model, train_loader, loss_func, opt, num_epochs=35)

Epoch [1/35], Loss: 0.6402
Epoch [2/35], Loss: 0.3492
Epoch [3/35], Loss: 0.2907
Epoch [4/35], Loss: 0.2509
Epoch [5/35], Loss: 0.2183
Epoch [6/35], Loss: 0.1893
Epoch [7/35], Loss: 0.1627
Epoch [8/35], Loss: 0.1393
Epoch [9/35], Loss: 0.1194
Epoch [10/35], Loss: 0.1022
Epoch [11/35], Loss: 0.0883
Epoch [12/35], Loss: 0.0779
Epoch [13/35], Loss: 0.0626
Epoch [14/35], Loss: 0.0528
Epoch [15/35], Loss: 0.0464
Epoch [16/35], Loss: 0.0416
Epoch [17/35], Loss: 0.0381
Epoch [18/35], Loss: 0.0273
Epoch [19/35], Loss: 0.0292
Epoch [20/35], Loss: 0.0237
Epoch [21/35], Loss: 0.0193
Epoch [22/35], Loss: 0.0167
Epoch [23/35], Loss: 0.0139
Epoch [24/35], Loss: 0.0126
Epoch [25/35], Loss: 0.0145
Epoch [26/35], Loss: 0.0101
Epoch [27/35], Loss: 0.0093
Epoch [28/35], Loss: 0.0058
Epoch [29/35], Loss: 0.0048
Epoch [30/35], Loss: 0.0046
Epoch [31/35], Loss: 0.0033
Epoch [32/35], Loss: 0.0028
Epoch [33/35], Loss: 0.0026
Epoch [34/35], Loss: 0.0028
Epoch [35/35], Loss: 0.0017


In [None]:
evaluate(model, test_loader)

0.9028

In [None]:
evaluate(model, train_loader)

1.0

### Task 3: Fix it.
Fix the overfitted network from the previous step (at least partially) by using regularization techniques (Dropout/Batchnorm/...) and demonstrate the results.

In [None]:
class FixedNeuralNetwork(nn.Module):
    def __init__(self, input_shape=28*28, num_classes=10, input_channels=1):
        super(self.__class__, self).__init__()
        self.model = nn.Sequential(
            nn.Flatten(), # This layer converts image into a vector to use Linear layers afterwards
            nn.Linear(784, 4096),
            nn.BatchNorm1d(4096),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(4096, 512),
            nn.BatchNorm1d(512),
            nn.ReLU(),
            nn.Linear(512, 256),
            nn.BatchNorm1d(256),
            nn.ReLU(),
            nn.Linear(256, 128),
            nn.Linear(128, num_classes)
        )

    def forward(self, inp):
        out = self.model(inp)
        return out

In [None]:
torchsummary.summary(FixedNeuralNetwork().to(device), (28*28,))

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
           Flatten-1                  [-1, 784]               0
            Linear-2                 [-1, 4096]       3,215,360
       BatchNorm1d-3                 [-1, 4096]           8,192
              ReLU-4                 [-1, 4096]               0
           Dropout-5                 [-1, 4096]               0
            Linear-6                  [-1, 512]       2,097,664
       BatchNorm1d-7                  [-1, 512]           1,024
              ReLU-8                  [-1, 512]               0
            Linear-9                  [-1, 256]         131,328
      BatchNorm1d-10                  [-1, 256]             512
             ReLU-11                  [-1, 256]               0
           Linear-12                  [-1, 128]          32,896
           Linear-13                   [-1, 10]           1,290
Total params: 5,488,266
Trainable param

In [None]:
model = FixedNeuralNetwork().to(device)
opt = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
loss_func = nn.CrossEntropyLoss()

In [None]:
train(model, train_loader, loss_func, opt, num_epochs=35)

Epoch [1/35], Loss: 0.8206
Epoch [2/35], Loss: 0.4500
Epoch [3/35], Loss: 0.3923
Epoch [4/35], Loss: 0.3621
Epoch [5/35], Loss: 0.3405
Epoch [6/35], Loss: 0.3238
Epoch [7/35], Loss: 0.3119
Epoch [8/35], Loss: 0.2985
Epoch [9/35], Loss: 0.2894
Epoch [10/35], Loss: 0.2781
Epoch [11/35], Loss: 0.2701
Epoch [12/35], Loss: 0.2600
Epoch [13/35], Loss: 0.2527
Epoch [14/35], Loss: 0.2456
Epoch [15/35], Loss: 0.2365
Epoch [16/35], Loss: 0.2310
Epoch [17/35], Loss: 0.2242
Epoch [18/35], Loss: 0.2173
Epoch [19/35], Loss: 0.2108
Epoch [20/35], Loss: 0.2065
Epoch [21/35], Loss: 0.2006
Epoch [22/35], Loss: 0.1967
Epoch [23/35], Loss: 0.1898
Epoch [24/35], Loss: 0.1829
Epoch [25/35], Loss: 0.1775
Epoch [26/35], Loss: 0.1725
Epoch [27/35], Loss: 0.1683
Epoch [28/35], Loss: 0.1656
Epoch [29/35], Loss: 0.1605
Epoch [30/35], Loss: 0.1528
Epoch [31/35], Loss: 0.1492
Epoch [32/35], Loss: 0.1460
Epoch [33/35], Loss: 0.1419
Epoch [34/35], Loss: 0.1382
Epoch [35/35], Loss: 0.1340


In [None]:
evaluate(model, test_loader)

0.9055

In [None]:
evaluate(model, train_loader)

0.9786

### Conclusions:
Переобчение возникает, когда разнообразие зависимостей в данных меньше, чем сложность модели. Выражается в разнице метрик на трейне и тесте. Поэтому сначала, возможно, нужно переобучить, а потом "упрощать". Тем не менее, есть эффект grokking, когда сначала по метрикам наблюдаем переобучение, а потом через некоторое количество эпох видим рост метрик на тесте. Такое происходит не на всеъ моделях, но про это нужно помнить. Dropout можно воспринимать, как некий аналог (не на 100%) бутсрепа. Batchnorm все-таки не совсем про регуляризацию, а про стабилизацию и ускорение обучения