This notebook is an assignement given by [Emmanuel Rachelson](https://personnel.isae-supaero.fr/emmanuel-rachelson?lang=en) as part of the [Machine Learning class](https://github.com/erachelson/MLclass).

This notebook is based on [Andrew NG courses](https://www.coursera.org/learn/machine-learning), on [Andrew NG book](https://www.deeplearning.ai/content/uploads/2018/09/Ng-MLY01-12.pdf), on https://www.youtube.com/watch?v=F1ka6a13S9I&feature=youtu.be and on http://scott.fortmann-roe.com/docs/BiasVariance.html.

This notebook was written by [Robin Escallier](https://github.com/rescallier)
##### First let's import some prerequisites, while you'll read the introduction.

In [None]:
%matplotlib inline

import math
import time
import os
from IPython.display import Image
import numpy as np 
import matplotlib.pyplot as plt
import pandas as pd
import torch
import torchvision
import torch.nn as nn
import torchvision.transforms as transforms
import torch.optim as optim
from torch.autograd import Variable
import tqdm

<div style="font-size:22pt; line-height:25pt; font-weight:bold; text-align:center;">The bias - variance tradeoff</div>

We are going here to analyze error while doing machine learning predictions on a test set. To do so, we will split the error between bias and variance. To decrease the error, increasing the size of training data is not always a good solution, we will have different approach whether spotting bias or variance as main error.
1. [Spliting the error](#sec1)
    1. [Conceptual Definition](#sec1-1)
    2. [Mathematical Definition](#sec1-2)
2. [Spoting Bias or Variance](#sec2)
3. [Model Size Impact on Bias and Variance](#sec3)
4. [Other Techniques to Deal with Bias and Variance](#sec4)
    1. [Regularization](#sec4-1)
    2. [Early Stopping](#sec4-2)

<div class="alert alert-success"><b>In a nutshell:</b>
<ul>
<li> Test set Errors can be split in Variance and Bias
<li> Variance is due to an overfitting of the training set
<li> Bias is spotted with the training set error
<li> Variance is spotted using the comparison between training error and test error 
<li> While decreasing Variance, there is a risk of increasing Bias
<li> While decreasing Bias, there is a risk of increasing Variance   
<li> The best way to adress Bias is to increase model size
<li> The best way to adress Variance is to get new training data
</ul>
</div>

# 1. <a id="sec1"></a>Spliting the error
Suppose that we try to predict y according to X, we are going to define bias and variance in two ways :

## <a id="sec1-1"></a> 1.A Conceptual Definition

- The error due to bias is the difference between the expected (or average) prediction of our model and the true value of y. Imagine you train different model using different set of your full dataset. Due to randomness in the underlying data sets, the resulting models will have different predictions. Bias measures how far off in general these models' predictions are from the correct value.

- The error due to variance is taken as the variability of a model prediction for a given data point. Again, imagine you can train your model multiple times. The variance is how much the predictions for a given point vary between different training of the model.




    We can illustrate this using  bulls-eye diagram. The center of the target is the perfect model. And the hit are different realizations of the model.
    
<img src="img/Graphical_illustration_of_bias_and_variance.png" width="800px"></img>

## <a id="sec1-2"></a> 1.B Mathematical Definition

We try to predict y using Y. To do so, we suppose that there is a relationship between X and y : $ y = f(X) + \epsilon $ with the error term $\epsilon$ is normally distributed: $ \epsilon \sim N(0,\sigma_e) $

We estimate a model $\hat{f}(X)$ using a neural network or a linear regression for example.
Using this model, the squarred error at a point x is:  $Err(x) =  E[(Y-\hat{f}(x))²] $

We can decompose this error in Bias and Variance : $Err(x)=(E[\hat{f}(x)]−f(x))²+E[(\hat{f}(x)−E[\hat{f}(x)])²]+\sigma_e$
$Err(x)=Bias²+Variance + Irreductible$ $error$

The irreductible error is a noise term that cannot fundamentally be reduced by the model.

# 2. <a id="sec2"></a>Spotting Bias or Variance
Suppose we want to predict a variable y according to X, we split our data in traning set and in test set. So we have a training set error and a test set error. We want to have a low error on test set.

- First, the algorithm’s error rate on the training set. We think of this informally as the algorithm’s bias.

- Second, how much worse the algorithm does on the test set than the training set. We think of this informally as the algorithm’s variance.

To illustrate this idea, we will work on the fashion MNist using a neural network: 

The first model that we are going to use is a simple linear model, we will look at bias and variance on this model.

<div class="alert alert-danger">
From now and until the end of the notebook, you will be asked to try to implement models without training them, then to look at the solution that I provide and to load train models. 
The models used here are too long to train to be train in one hour.

If tou want to, you can uncomment lines with training and running the full notebook but try to do it on a gpu and it will take more than one hour.
<div>

We download and preprocess our data:

In [None]:
# We define some constants
batch_size = 100
num_epochs = 30
learning_rate = 0.001

In [None]:
# function to preprocess the data
transform = transforms.Compose([transforms.ToTensor(),
                                transforms.Normalize((0.5,), (0.5,))])
# Load both train and test set
train_dataset = torchvision.datasets.FashionMNIST(root='../data',
                                             train=True, 
                                             transform=transform,
                                             download=True)

test_dataset = torchvision.datasets.FashionMNIST(root='../data',
                                             train=False, 
                                             transform=transform,
                                             download=True)

# Define loader to access data by batch
train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
                                           batch_size=batch_size,
                                           shuffle=True)
test_loader = torch.utils.data.DataLoader(dataset=test_dataset,
                                          batch_size=batch_size,
                                          shuffle=True)



In [None]:
def train(cnn,train_loader,num_epochs,optimizer,criterion):
    '''
    Train the model
    -------
    
    Param:
        cnn : torch.nn.module, model to train
        train_loader : torch.utils.data.DataLoader, loader with the data to train the model on
        num_epochs : int, number of epoch 
        optimizer : torch.optim, optimizer to use during the training
        criterion: torch.nn, loss function used here
    '''
    losses = [];
    for epoch in range(num_epochs):
        for i, (images, labels) in enumerate(train_loader): 
        
            images = Variable(images.float())
            labels = Variable(labels)

            # Forward + Backward + Optimize
            optimizer.zero_grad()
            outputs = cnn(images)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            losses.append(loss.data);

            if (i+1) % 100 == 0:
                  print ('Epoch : %d/%d, Iter : %d/%d,  Loss: %.4f' 
                    %(epoch+1, num_epochs, i+1, len(train_dataset)//batch_size, loss.detach().numpy()))

In [None]:
# A function to compute accuracy of our model on a loader
def accuracy(cnn,loader,kind_of_set):
    '''
    Compute and print accuracy of the model on a dataset
    --------
    Param : 
        cnn : torch.nn.module, model used
        loader : torch.utils.data.DataLoader, loader with the data to compute the accuracy on
        kind_of_set: str, 'train' or 'test' just to print the kind of set working on 
    -------
    
    Return :
        correct : float, number of good predictions
        total: float, number of predictions
    '''
    cnn.eval()
    correct = 0
    total = 0
    for images, labels in loader:
        images = Variable(images.float())
        outputs = cnn(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum()
    print('Accuracy of the model on the ' + str(total)+ ' '+ kind_of_set+
          ' images: '+ str(100 * correct.detach().numpy() / total))
    return correct.detach().numpy()/total

<div class="alert alert-warning">
    
**Exercice**<br>
Implement a class wich inherit from nn.module, this class has to compute a simple linear model using neural network

</div>

In [None]:
class CNN(nn.Module):
    '''
    Define the model used
    '''
    def __init__(self):
        super(CNN, self).__init__()
        # To do initialize layers of the network
        
    def forward(self, x):
        # To do compute forward propagation of the model
        return out


Try to train the model for one epoch to see if it works

In [None]:
torch.manual_seed(0)
cnn_linear = CNN()
# CrossEntropyLoss as loss because of no softmax in the last layer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(cnn_linear.parameters(), lr=learning_rate)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

train(cnn_linear,train_loader,1,optimizer,criterion)

Load the solution to load a model already train

In [None]:
# %load solutions/linear_model.py
class CNN(nn.Module):
    '''
    Define the model 
    '''
    def __init__(self):
        super(CNN, self).__init__()
        
        self.fc = nn.Linear(28*28*1, 10)
        
    def forward(self, x):
        out = x.view(x.size(0), -1)
        out = self.fc(out)
        return out

In [None]:
# set the seed to get the same results at each run
torch.manual_seed(0)
cnn_linear = CNN()
# CrossEntropyLoss as loss because of no softmax in the last layer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(cnn_linear.parameters(), lr=learning_rate)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## To train the model uncomment lines below, 
#train(cnn_linear,train_loader,num_epochs,optimizer,criterion)
#torch.save(cnn_linear, 'models/linear_model.pt')

# Load trained model that was train using the code above using a gpu on google colab during 30 epochs
cnn_linear = torch.load('models/linear_model.pt')
cnn_linear.eval()
# 

accuracy_train_linear = accuracy(cnn_linear,train_loader,'train')
accuracy_test_linear  = accuracy(cnn_linear,test_loader,'test')

So here our test error is 16%, which is a high, we can separate it into bias and variance
The bias is 13.% while the variance is equal to 4.9%. So here our algorithm have a high bias. We are going to see how to adress it

# 3. <a id="sec3"></a>Model Size Impact on Bias and Variance

The first ideas to address bias and variances issues are the following :

- If the bias is high( the algorithm is underfitting ), increase the size of the mode 

- If the variance is high( the algorithm is overfitting ), add data to the training set

But we can not get infinite amount of data( for example here we can not add data to the training set), and sometimes increasing the size of the model will eventually cause computational problems.

And increasing the size of the model could easily lead to overfitting, this is what is called bias and variance tradeoff. Here is a graph to illustrate it:
<img src="img/biasvariance.png" width="800px"></img>


So now we are going to see this tradeoff by increasing the size of the model and looking at the impact on both training and test error:

We are going to use a big convolutionnal neural network with the following structure:
- Convolutional layer with a kernel of 5, a padding of 2 and 16 outputs channels
- Batchnormalization
- Relu
- Maxpooling by blocks of size 2×2

- Convolutional layer with a kernel of 5, a padding of 2 and 32 outputs channels
- Batchnormalization
- Relu
- Maxpooling by blocks of size 2×2

- Convolutional layer with a kernel of 5, a padding of 2 and 64 outputs channels
- Batchnormalization
- Relu
- Maxpooling by blocks of size 2×2

- A linear layer with no activation ( no needed beacause using Crossentropyloss) with an output of size 10

<div class="alert alert-warning">
    
**Exercice**<br>
Implement a class wich inherit from nn.module, this class has to define the model describe above

</div>

In [None]:
class CNN(nn.Module):
    '''
    Define the model used
    '''
    def __init__(self):
        super(CNN, self).__init__()
        # To do initialize layers of the network
        
    def forward(self, x):
        # To do compute forward propagation of the model
        return out


Try to train the model for one epoch to see if it works

In [None]:
torch.manual_seed(0)
cnn_3conv = CNN()
# CrossEntropyLoss as loss because of no softmax in the last layer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(cnn_3conv.parameters(), lr=learning_rate)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

train(cnn_3conv,train_loader,1,optimizer,criterion)

Load the pretrained model

In [None]:
# %load solutions/3conv_model.py
class CNN(nn.Module):
    def __init__(self):
        super(CNN, self).__init__()
        self.layer1 = nn.Sequential(
            nn.Conv2d(1, 16, kernel_size=5, padding=2),
            nn.BatchNorm2d(16),
            nn.ReLU(),
            nn.MaxPool2d(2))
        self.layer2 = nn.Sequential(
            nn.Conv2d(16, 32, kernel_size=5, padding=2),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(2))
        self.layer3 = nn.Sequential(
            nn.Conv2d(32, 64, kernel_size=5, padding=2),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.MaxPool2d(2))
        self.fc = nn.Linear(3*3*64, 10)
        
    def forward(self, x):
        out = self.layer1(x)
        out = self.layer2(out)
        out = self.layer3(out)
        out = out.view(out.size(0), -1)
        out = self.fc(out)
        return out

In [None]:
# set the seed to get the same results at each run
torch.manual_seed(0)
cnn_3conv = CNN()
# CrossEntropyLoss as loss because of no softmax in the last layer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(cnn_3conv.parameters(), lr=learning_rate)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## To train the model uncomment lines below, 
#train(cnn_3conv,train_loader,num_epochs,optimizer,criterion)
#torch.save(cnn_3conv, 'models/3steps_models.pt')

# Load trained model that was train using the code above using a gpu on google colab during 30 epochs
cnn_3conv = torch.load('models/3steps_models.pt')
cnn_3conv.eval()
accuracy_train_3conv = accuracy(cnn_3conv,train_loader,'train')
accuracy_test_3conv  = accuracy(cnn_3conv,test_loader,'test')

So here our test error is 9.3%, which is a bit high, we can separate it into bias and variance
The bias is 0.7% while the variance is equal to 8.6%. So here our algorithm have a high variance. We achieve to decrease the bias of the alorithm but this makes the variance increase because of the size of the model. 

### Reducing the size of the model
So here we have a very high variance wich is the main component of our error. So to adress it, we are going to decrease the size of the model. 

We are going to use a smaller convolutionnal neural network with the following structure:
- Convolutional layer with a kernel of 5, a padding of 2 and 16 outputs channels
- Batchnormalization
- Relu
- Maxpooling by blocks of size 2×2

- Convolutional layer with a kernel of 5, a padding of 2 and 32 outputs channels
- Batchnormalization
- Relu
- Maxpooling by blocks of size 2×2

- A linear layer with no activation ( no needed beacause using Crossentropyloss) with an output of size 10

<div class="alert alert-warning">
    
**Exercice**<br>
Implement a class wich inherit from nn.module, this class has to define the model describe above

</div>

In [None]:
class CNN(nn.Module):
    '''
    Define the model used
    '''
    def __init__(self):
        super(CNN, self).__init__()
        # To do initialize layers of the network
        
    def forward(self, x):
        # To do compute forward propagation of the model
        return out


Try to train the model for one epoch to see if it works

In [None]:
torch.manual_seed(0)
cnn_2conv = CNN()
# CrossEntropyLoss as loss because of no softmax in the last layer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(cnn_2conv.parameters(), lr=learning_rate)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

train(cnn_2conv,train_loader,1,optimizer,criterion)

Load the pretrained model

In [None]:
%load solutions/2conv_model.py


In [None]:
# set the seed to get the same results at each run
torch.manual_seed(0)
cnn_2conv = CNN()
# CrossEntropyLoss as loss because of no softmax in the last layer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(cnn_2conv.parameters(), lr=learning_rate)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## To train the model uncomment lines below, 
#train(cnn_2conv,train_loader,num_epochs,optimizer,criterion)
#torch.save(cnn_2conv, 'models/2steps_model.pt')

# Load trained model that was train using the code above using a gpu on google colab during 30 epochs
cnn_2conv = torch.load('models/2steps_model.pt')
cnn_2conv.eval()
accuracy_train_2conv = accuracy(cnn_2conv,train_loader,'train')
accuracy_test_2conv  = accuracy(cnn_2conv,test_loader,'test')

So here our test error is 9.6%, which is a bit high, we can separate it into bias and variance
The bias is 1.4% while the variance is equal to 8.2%. So here our algorithm have still a high variance. We achieve to decrease a bit the variance of the alorithm but we did not achieve to decrease the overall error because of the increase of bias due to decrease of model size.

### Reducing the size of the model ( again)
So here we still have a very high variance wich is the main component of our error. So to adress it, we are going to decrease the size of the model. 

We are going to use a smaller convolutionnal neural network with the following structure:
- Convolutional layer with a kernel of 5, a padding of 2 and 16 outputs channels
- Batchnormalization
- Relu
- Maxpooling by blocks of size 2×2

- A linear layer with no activation ( no needed beacause using Crossentropyloss) with an output of size 10

Load the pretrained model

In [None]:
 %load solutions/1conv_model.py

In [None]:
# set the seed to get the same results at each run
torch.manual_seed(0)
cnn_1conv = CNN()
# CrossEntropyLoss as loss because of no softmax in the last layer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(cnn_1conv.parameters(), lr=learning_rate)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## To train the model uncomment lines below, 
#train(cnn_1conv,train_loader,num_epochs,optimizer,criterion)
#torch.save(cnn_1conv, 'models/1step_model.pt')

# Load trained model that was train using the code above using a gpu on google colab during 30 epochs
cnn_1conv = torch.load('models/1step_model.pt')
cnn_1conv.eval()
accuracy_train_1conv = accuracy(cnn_1conv,train_loader,'train')
accuracy_test_1conv  = accuracy(cnn_1conv,test_loader,'test')

So here our test error is 9.8%, which is a bit high, we can separate it into bias and variance
The bias is 3.9% while the variance is equal to 5.9%. So here our algorithm have still a high variance.

We achieve to decrease a bit the variance of the alorithm but we did not achieve to decrease the overall error because of the increase of bias due to decrease of model size.

### Conclusion on the size of the model
By increasing the size of our model, we improve our overall error but the variance increased a lot. So we need to make it decrease but we have seen that reducing a bit the size of our model does not improve the overall error despite the variance decrease because of the increase of the bias.

##### This is what is called bias and variance tradeoff :  there are some machine learning algorithm tuning that reduce bias errors but at the cost of increasing variance, and vice versa. We have seen here that increasing the size of the model—adding neurons/layers in a neural network— reduces bias but could increase variance.

We are now going to see other ways of dealing with bias and variance.

# 4. <a id="sec4"></a>Other Techniques to Deal with Bias and Variance


### Dealing with bias

<div class="alert alert-success">
<ul>
 To deal with bias we can :
<li> - Increase model size
<li> - Modify input features based on insights from error analysis(looking at your error deeplu using informations about mispredicted samples), it will not be done here because we only have images as inputs
<li> - Reduce or eliminate regularization, it will not be done here because we don't have any regulariztion for now
<li> - Modify model architecture, that's what we have done going from linear model to convolutionnal neural networks
</ul>
</div>

### Dealing with variance

<div class="alert alert-success">
<ul>
To deal with variance, we can:

<li>- Add more training data, we don't have more here
<li>- Add regularization(L2 regularization, L1 regularization, dropout): It reduces variance but increases bias.We are going to try to implement it
<li>- Add early stopping(stop gradient descent early, using validation error):  It reduces variance but increases bias.We are going to try to implement it
<li>- Feature selection to decrease number/type of input features: I works well while working on small dataset, it can increase bias
<li>- Decrease the model size: we have seen it but it increases bias.
<li>- Modify model architecture: It can also help to decrease variance
</ul>
</div>

So here our bias error of the model with 3 convolutionnals layers seems good, so we are going to try to decrease variance using 2 techniques : regularization and dropouts


## <a id="sec4-1"></a> 4.A Regularization
So because our varianceof the model with 3 convolutions is very high, we will try to make it decreased using regularization, we are going to use L2 regularization

In [None]:
 %load solutions/3conv_model.py


Change the optimizer to add L2 regularization

In [None]:
torch.manual_seed(0)
#regularization parameter lambda
lambda_ = 0.005
cnn_regularization = CNN()
criterion = nn.CrossEntropyLoss()
# To do set the optimizer

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
train(cnn_regularization,train_loader,1,optimizer,criterion)


Use the pretrained model

In [None]:
 %load solutions/regularization.py

So here our test error is 8.7%, which is a bit better, we can separate it into bias and variance
The bias is 5.8% while the variance is equal to 2.9%. We achieve to decrease the variance at the cost of increasing our bias because of the bias-variance tradeoff but the overall error is slightly better.

Let's see if we can have better results using early stopping

## <a id="sec4-2"></a> 4.B Early Stopping
Let's try to implement early stopping and see the results on both bias and variance

To do so, we need to define a validation loader and to define an early stopping class to use to compare validation loss at each epoch

In [None]:
full_trainset = torchvision.datasets.FashionMNIST(root='../data', train=True, download=False, transform=transform)
trainset, full_validset = torch.utils.data.random_split(full_trainset, (10000, 50000))
validset, _ = torch.utils.data.random_split(full_validset, (1000, 49000))

trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size, shuffle=True)
validloader = torch.utils.data.DataLoader(validset, batch_size=batch_size, shuffle=False)

def validation(cnn,validloader,criterion):
    '''
    compute the loss on a validset
    -------
    
     Param:
        cnn : torch.nn.module, model to evaluate
        validloader : torch.utils.data.DataLoader, loader with the data to valid the model on
        criterion: torch.nn, loss function used here
    -------
    
    Return:
        valid_loss: loss on the validation set
    '''
    valid_loss = 0
    with torch.no_grad():
        for data in validloader:
            images, labels = data
            outputs = cnn(images)
            loss = criterion(outputs, labels)
            valid_loss += loss.item()
    return valid_loss

# We define an early stopping class
class EarlyStopping:
    
    def __init__(self, patience=5, delta=0):
        self.patience = patience
        self.counter = 0
        self.best_score = None
        self.delta = delta
        self.early_stop = False

    def step(self, val_loss):
        score = -val_loss
        if self.best_score is None:
            self.best_score = score
        elif score < self.best_score + self.delta:
            self.counter += 1
            print('EarlyStopping counter: %d / %d' % (self.counter, self.patience))
            if self.counter >= self.patience:
                self.early_stop = True
        else:
            self.best_score = score
            self.counter = 0               



<div class="alert alert-warning">
    
**Exercice**<br>
Change the train function to add early stopping
</div>

In [None]:
# We define new train function using validation set and early stopping
def train(cnn,train_loader,num_epochs,optimizer,criterion,validloader,patience):
    '''
    Train the model
    -------
    
    Param:
        cnn : torch.nn.module, model to train
        train_loader : torch.utils.data.DataLoader, loader with the data to train the model on
        num_epochs : int, number of epoch 
        optimizer : torch.optim, optimizer to use during the training
        criterion: torch.nn, loss function used here,
        validloader: torch.utils.data.DataLoader, loader with the data to validate the model on
        patience: int, number of epoch to wait with an higher loss before stopping the algorithm
    '''
    losses = []
    validlosses = []
    estop = EarlyStopping(patience=patience)
    for epoch in range(num_epochs):
        for i, (images, labels) in enumerate(train_loader): 
        
            images = Variable(images.float())
            labels = Variable(labels)

            # Forward + Backward + Optimize
            optimizer.zero_grad()
            outputs = cnn(images)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            
            losses.append(loss.data);
            
            if (i+1) % 100 == 0:
                  print ('Epoch : %d/%d, Iter : %d/%d,  Loss: %.4f' 
                    %(epoch+1, num_epochs, i+1, len(train_dataset)//batch_size, loss.detach().numpy()))
       # To do : compute valid loss and break the for loop if necessary
       

In [None]:
%load solutions/earlystopping.py

In [None]:
%load solutions/3conv_model.py

In [None]:
patience=3
torch.manual_seed(0)

cnn_early_stopping = CNN()
# CrossEntropyLoss as loss because of no softmax in the last layer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(cnn_early_stopping.parameters(),
                             lr=learning_rate)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## To train the model uncomment lines below, 
#train(cnn_early_stopping,train_loader,num_epochs,optimizer,criterion,validloader,patience)
#torch.save(cnn_early_stopping, 'models/early_stopping.pt')

# Load trained model that was train using the code above using a gpu on google colab during 30 epochs

cnn_early_stopping = torch.load('models/early_stopping.pt')
cnn_early_stopping.eval()
accuracy_train_early_stopping = accuracy(cnn_early_stopping,train_loader,'train')
accuracy_test_early_stopping  = accuracy(cnn_early_stopping,test_loader,'test')

The training stopped after 23 epochs
 
So here our test error is 8.9%, which is a bit better, we can separate it into bias and variance The bias is 0.6% while the variance is equal to 8.3%. We achieve to decrease the variance a bit without increasing to much bias, the overall error is slightly better than the model without early stopping. But the lowest test error is reached using regularization