# Homework 2, *part 2*
### (60 points total)

In this part, you will build a convolutional neural network (CNN) to solve (yet another) image classification problem: the Tiny ImageNet dataset (200 classes, 100K training images, 10K validation images). Try to achieve as high accuracy as possible.

**Unlike part 1**, you are now free to use the full power of PyTorch and its subpackages.

## Deliverables

* This file.
* A "checkpoint file" `"checkpoint.pth"` that contains your CNN's weights (you get them from `model.state_dict()`). Obtain it with `torch.save(..., "checkpoint.pth")`. When grading, we will load it to evaluate your accuracy.

**Should you decide to put your `"checkpoint.pth"` on Google Drive, update (edit) the following cell with the link to it:**

### [Dear TAs, I've put my "checkpoint.pth" on Google Drive, download it here](http://your-link-here)

## Grading

* 9 points for reproducible training code and a filled report below.
* 11 points for building a network that gets above 25% accuracy.
* 4 points for using an **interactive** (please don't reinvent the wheel with `plt.plot`) tool for viewing progress, for example Tensorboard ([with this library](https://github.com/lanpa/tensorboardX) and [an extra hack for Colab](https://stackoverflow.com/a/57791702)). In this notebook, insert screenshots of accuracy and loss plots (training and validation) over iterations/epochs/time.
* 6 points for beating each of these accuracy milestones on the private **test** set:
  * 30%
  * 34%
  * 38%
  * 42%
  * 46%
  * 50%
  
*Private test set* means that you won't be able to evaluate your model on it. Rather, after you submit code and checkpoint, we will load your model and evaluate it on that test set ourselves, reporting your accuracy in a comment to the grade.

Note that there is an important formatting requirement, see below near "`DO_TRAIN = True`".

## Restrictions

* No pretrained networks.
* Don't enlarge images (e.g. don't resize them to $224 \times 224$ or $256 \times 256$).

## Tips

* **One change at a time**: never test several new things at once (unless you are super confident). Train a model, introduce one change, train again.
* Google a lot: try to reinvent as few wheels as possible (unlike in part 1 of this assignment).
* Use GPU.
* Use regularization: L2, batch normalization, dropout, data augmentation...
* Pay much attention to accuracy and loss graphs (e.g. in Tensorboard). Track failures early, stop bad experiments early.

In [1]:
# Detect if we are in Google Colaboratory
try:
    import google.colab
    IN_COLAB = True
except ImportError:
    IN_COLAB = False

from pathlib import Path
# Determine the locations of auxiliary libraries and datasets.
# `AUX_DATA_ROOT` is where 'notmnist.py', 'animation.py' and 'tiny-imagenet-2020.zip' are.
if IN_COLAB:
    google.colab.drive.mount("/content/drive")
    
    # Change this if you created the shortcut in a different location
    AUX_DATA_ROOT = Path("/content/drive/My Drive/Deep Learning 2020 -- Home Assignment 2")
    
    assert AUX_DATA_ROOT.is_dir(), "Have you forgot to 'Add a shortcut to Drive'?"
else:
    AUX_DATA_ROOT = Path(".")

The below cell puts training and validation images in `./tiny-imagenet-200/train` and `./tiny-imagenet-200/val`:

In [2]:
# Extract the dataset into the current directory
if not Path("tiny-imagenet-200/train/class_000/00000.jpg").is_file():
    import zipfile
    with zipfile.ZipFile(AUX_DATA_ROOT / 'tiny-imagenet-2020.zip', 'r') as archive:
        archive.extractall()

**You are required** to format your notebook cells so that `Run All` on a fresh notebook:
* trains your model from scratch, if `DO_TRAIN is True`;
* loads your trained model from `"./checkpoint.pth"`, then **computes** and prints its validation accuracy, if `DO_TRAIN is False`.

In [3]:
DO_TRAIN = False

## Train the model

In [4]:
# Your code here (feel free to add cells)

In [5]:
if DO_TRAIN:
    TRAIN_DIR = './tiny-imagenet-200/train'
VAL_DIR = './tiny-imagenet-200/val'

In [6]:
if DO_TRAIN:
    import os
    CLASSES_NUMBER = len(os.listdir(TRAIN_DIR))
    print('Number of classes: {}'.format(CLASSES_NUMBER))

In [7]:
import torch
torch.manual_seed(0) # for reproducibility
import torchvision

### Dataset Download

In [8]:
BATCH_SIZE = 100
if DO_TRAIN:
    augmentation_transforms = torchvision.transforms.Compose([
        torchvision.transforms.RandomHorizontalFlip(p=1.0),
        torchvision.transforms.RandomAffine(10, translate=(0.1, 0.1)),
        torchvision.transforms.ColorJitter(brightness=(0.9, 2.0), contrast=(0.9, 2.0)),
        torchvision.transforms.ToTensor()
    ])
    train_dataset = torchvision.datasets.ImageFolder(TRAIN_DIR, transform=torchvision.transforms.ToTensor())
    augmented_dataset = torchvision.datasets.ImageFolder(TRAIN_DIR, augmentation_transforms)
    
    train_dataset = torch.utils.data.ConcatDataset([train_dataset, augmented_dataset])
    train_dataloader = torch.utils.data.DataLoader(train_dataset,
                                                   batch_size=BATCH_SIZE,
                                                   shuffle=True, num_workers=4)

val_dataset = torchvision.datasets.ImageFolder(VAL_DIR, transform=torchvision.transforms.ToTensor())
val_dataloader = torch.utils.data.DataLoader(val_dataset,
                                             batch_size=BATCH_SIZE,
                                             shuffle=False, num_workers=4)

### Dataset Examples

In [9]:
if DO_TRAIN:
    import matplotlib.pyplot as plt

    def show_input(ax, input_tensor, title=''):
        image = input_tensor.permute(1, 2, 0).numpy()
        ax.imshow(image)
        ax.set_title(title)

In [10]:
if DO_TRAIN:
    fig, axes = plt.subplots(ncols=4, nrows=2, figsize=(16, 8))
    X_batch, y_batch = next(iter(train_dataloader))
    
    INPUT_SHAPE = X_batch[0].shape
    print('Image size: {}'.format(INPUT_SHAPE))
    for ax, x_item, y_item in zip(axes.flat, X_batch[:8], y_batch[:8]):
        show_input(ax, x_item, title='class {}'.format(str(y_item.numpy())))
    plt.show()

### Model

In [11]:
class complex_conv(torch.nn.Module):
    def __init__(self, input_channels, mid_channels, output_channels, kernel_size=3, stride=1, padding=1):
        super(complex_2conv, self).__init__()

        self.conv = torch.nn.Sequential(
            torch.nn.Conv2d(input_channels, mid_channels, kernel_size, stride, padding),
            torch.nn.BatchNorm2d(mid_channels),
            torch.nn.ReLU(inplace=True),
            torch.nn.Conv2d(mid_channels, output_channels, kernel_size, stride, padding),
            torch.nn.BatchNorm2d(output_channels),
            torch.nn.ReLU(inplace=True),
            torch.nn.MaxPool2d(kernel_size=2)
        )

    def forward(self, input):
        output = self.conv(input)
        return output

class flatten(torch.nn.Module):
    def __init__(self):
        super(flatten, self).__init__()
    
    def forward(self, input):
        return input.view(input.size(0), -1)

class CNN(torch.nn.Module):
    def __init__(self, input_shape, classes_number):
        super(CNN, self).__init__()

        C, H, W = input_shape
        
        self.complex_conv1 = complex_conv(C, 64, 64)
        self.complex_conv2 = complex_conv(64, 128, 128)
        self.complex_conv3 = complex_conv(128, 256, 256)
        self.complex_conv4 = complex_conv(256, 512, 512)
        self.complex_conv5 = complex_conv(512, 512, 512)
        
        self.flatten = flatten()
        
        self.fc1 = torch.nn.Linear(512*H*W//32//32, 1024, bias=True)
        self.dropout = torch.nn.Dropout(p=0.2)
        self.relu = torch.nn.ReLU()

        self.fc2 = torch.nn.Linear(1024, 512, bias=True)

        self.fc3 = torch.nn.Linear(512, classes_number, bias=True)

    def forward(self, input):
        x = self.complex_conv1(input)
        x = self.complex_conv2(x)
        x = self.complex_conv3(x)
        x = self.complex_conv4(x)
        x = self.complex_conv5(x)
        
        flattened = self.flatten(x)
        
        flattened = self.fc1(flattened)
        flattened = self.dropout(flattened)
        flattened = self.relu(flattened)

        flattened = self.fc2(flattened)
        flattened = self.dropout(flattened)
        flattened = self.relu(flattened)

        flattened = self.fc3(flattened)
        
        return flattened

In [12]:
# class complex_2conv(torch.nn.Module):
#     def __init__(self, input_channels, mid_channels, output_channels, kernel_size=3, stride=1, padding=1):
#         super(complex_2conv, self).__init__()

#         self.conv = torch.nn.Sequential(
#             torch.nn.Conv2d(input_channels, mid_channels, kernel_size, stride, padding),
#             torch.nn.BatchNorm2d(mid_channels),
#             torch.nn.ReLU(inplace=True),
#             torch.nn.Conv2d(mid_channels, output_channels, kernel_size, stride, padding),
#             torch.nn.BatchNorm2d(output_channels),
#             torch.nn.ReLU(inplace=True),
#             torch.nn.MaxPool2d(kernel_size=2)
#         )

#     def forward(self, input):
#         output = self.conv(input)
#         return output

# class complex_3conv(torch.nn.Module):
#     def __init__(self, input_channels, mid_channels, output_channels, kernel_size=3, stride=1, padding=1):
#         super(complex_3conv, self).__init__()

#         self.conv = torch.nn.Sequential(
#             torch.nn.Conv2d(input_channels, mid_channels, kernel_size, stride, padding),
#             torch.nn.BatchNorm2d(mid_channels),
#             torch.nn.ReLU(inplace=True),
#             torch.nn.Conv2d(mid_channels, output_channels, kernel_size, stride, padding),
#             torch.nn.BatchNorm2d(output_channels),
#             torch.nn.ReLU(inplace=True),
#             torch.nn.Conv2d(output_channels, output_channels, kernel_size, stride, padding),
#             torch.nn.BatchNorm2d(output_channels),
#             torch.nn.ReLU(inplace=True),
#             torch.nn.MaxPool2d(kernel_size=2)
#         )

#     def forward(self, input):
#         output = self.conv(input)
#         return output

# class flatten(torch.nn.Module):
#     def __init__(self):
#         super(flatten, self).__init__()
    
#     def forward(self, input):
#         return input.view(input.size(0), -1)

# class CNN(torch.nn.Module):
#     def __init__(self, input_shape, classes_number):
#         super(CNN, self).__init__()

#         C, H, W = input_shape
        
#         self.complex_conv1 = complex_2conv(C, 64, 128)
#         self.complex_conv2 = complex_2conv(128, 256, 256)
#         self.complex_conv3 = complex_3conv(256, 256, 256)
        
#         self.flatten = flatten()
        
#         self.fc1 = torch.nn.Linear(256*H*W//8//8, 4096, bias=True)
#         self.dropout = torch.nn.Dropout(p=0.2)
#         self.relu = torch.nn.ReLU()

#         self.fc2 = torch.nn.Linear(4096, 1024, bias=True)

#         self.fc3 = torch.nn.Linear(1024, classes_number, bias=True)

#     def forward(self, input):
#         x = self.complex_conv1(input)
#         x = self.complex_conv2(x)
#         x = self.complex_conv3(x)
        
#         flattened = self.flatten(x)
        
#         flattened = self.fc1(flattened)
#         flattened = self.dropout(flattened)
#         flattened = self.relu(flattened)

#         flattened = self.fc2(flattened)
#         flattened = self.dropout(flattened)
#         flattened = self.relu(flattened)

#         flattened = self.fc3(flattened)
        
#         return flattened

In [13]:
# class BottleneckBlock(torch.nn.Module):
#     def __init__(self, input_channels, output_channels, kernel_size=3, stride=1, padding=1):
#         super(BottleneckBlock, self).__init__()
        
#         mid_channels = input_channels*2
#         self.bn1 = torch.nn.BatchNorm2d(input_channels)
#         self.relu = torch.nn.ReLU(inplace=True)
#         self.conv1 = torch.nn.Conv2d(input_channels, mid_channels, kernel_size, stride, padding)

#         self.bn2 = torch.nn.BatchNorm2d(mid_channels)
#         self.conv2 = torch.nn.Conv2d(mid_channels, output_channels, kernel_size, stride, padding)

#     def forward(self, input):
#         output = self.conv1(self.relu(self.bn1(input)))
#         output = self.conv2(self.relu(self.bn2(output)))
        
#         return torch.cat([input, output], dim=1)

# class BasicBlock(torch.nn.Module):
#     def __init__(self, input_channels, output_channels, kernel_size=3, stride=1, padding=1):
#         super(BasicBlock, self).__init__()

#         self.bn = torch.nn.BatchNorm2d(input_channels)
#         self.relu = torch.nn.ReLU(inplace=True)
#         self.conv = torch.nn.Conv2d(input_channels, output_channels, kernel_size, stride, padding)
    
#     def forward(self, input):
#         output = self.conv(self.relu(self.bn(input)))
#         return torch.cat([input, output], dim=1)

# class DenseBlock(torch.nn.Module):
#     def __init__(self, layers_number, block, input_channels, growth_rate):
#         super(DenseBlock, self).__init__()

#         self.layer = self.concat_layers(layers_number, block, input_channels, growth_rate)
    
#     def concat_layers(self, layers_number, block, input_channels, growth_rate):
#         layers = []
#         for i in range(layers_number):
#             layers.append(block(input_channels + i*growth_rate, growth_rate))
#         return torch.nn.Sequential(*layers)

#     def forward(self, input):
#         return self.layer(input)

# class TransitionBlock(torch.nn.Module):
#     def __init__(self, input_channels, output_channels, kernel_size=3, stride=1, padding=1):
#         super(TransitionBlock, self).__init__()

#         self.bn1 = torch.nn.BatchNorm2d(input_channels)
#         self.relu = torch.nn.ReLU(inplace=True)
#         self.conv1 = torch.nn.Conv2d(input_channels, output_channels, kernel_size, stride, padding)
#         self.pool = torch.nn.MaxPool2d(kernel_size=2)
    
#     def forward(self, input):
#         output = self.conv1(self.relu(self.bn1(input)))
#         return self.pool(output)

# class DenseNet(torch.nn.Module):
#     def __init__(self, input_shape, classes_number, growth_rate=24, reduction=0.8):
#         super(DenseNet, self).__init__()

#         C, H, W = input_shape
#         in_channels = 2*growth_rate
#         layers_number = 3

#         self.conv1 = torch.nn.Conv2d(C, in_channels, kernel_size=3, stride=1, padding=1)

#         self.block1 = DenseBlock(layers_number, BasicBlock, in_channels, growth_rate)
#         in_channels += layers_number*growth_rate
#         self.transition1 = TransitionBlock(in_channels, int(in_channels*reduction))
#         in_channels = int(in_channels*reduction)

#         self.block2 = DenseBlock(layers_number, BasicBlock, in_channels, growth_rate)
#         in_channels += layers_number*growth_rate
#         self.transition2 = TransitionBlock(in_channels, int(in_channels*reduction))
#         in_channels = int(in_channels*reduction)

#         self.block3 = DenseBlock(layers_number, BasicBlock, in_channels, growth_rate)
#         in_channels += layers_number*growth_rate
#         self.in_channels = in_channels

#         self.bn = torch.nn.BatchNorm2d(in_channels)
#         self.relu = torch.nn.ReLU(inplace=True)
#         self.pool = torch.nn.AvgPool2d(kernel_size=H//2//2)
#         self.fc = torch.nn.Linear(in_channels, classes_number)
    
#     def forward(self, input):
#         output = self.conv1(input)

#         output = self.transition1(self.block1(output))
#         output = self.transition2(self.block2(output))

#         output = self.block3(output)
#         output = self.relu(self.bn(output))

#         output = self.pool(output)
#         output = output.view(-1, self.in_channels)

#         return self.fc(output)

In [14]:
DEVICE = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print('Device: {}'.format(DEVICE))

Device: cuda:0


In [15]:
if DO_TRAIN:
    from torchsummary import summary

    model = CNN(INPUT_SHAPE, CLASSES_NUMBER)
    print('Model summary:')
    summary(model.to(DEVICE), INPUT_SHAPE, BATCH_SIZE)

### Training Method

In [16]:
if DO_TRAIN:
    LEARNING_RATE = 1.0e-3
    criterion = torch.nn.CrossEntropyLoss().to(DEVICE)
    
    optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)
    scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.1)

### Metric

In [17]:
from tqdm.notebook import tqdm

if DO_TRAIN:
    def accuracy_evaluating(predictions, labels):
        logits = torch.nn.LogSoftmax(dim=1)(predictions)
        prediction_labels = logits.max(dim=1).indices

        return (labels == prediction_labels).float().mean()

if not DO_TRAIN:
    def accuracy(model, dataloader):
        model.to(DEVICE)
        model.eval()

        running_accuracy = 0.0
        print('accuracy evaluating...')
        for images, labels in tqdm(dataloader):
            images, labels = images.to(DEVICE), labels.to(DEVICE)  

            predictions = model(images)
            logits = torch.nn.LogSoftmax(dim=1)(predictions)
            prediction_labels = logits.max(dim=1).indices

            running_accuracy += (labels == prediction_labels).float().mean()
        
        return running_accuracy / len(dataloader)

### Model Training

In [18]:
# TensorBoard interactive
if DO_TRAIN:
    from torch.utils.tensorboard import SummaryWriter
    LOG_DIR = "./logs"

In [19]:
if DO_TRAIN:
    def train(model, train_dataloader, val_dataloader,
              criterion,
              optimizer, scheduler,
              epochs,
              device,
              experiment_name):
        writer = SummaryWriter('{}/{}'.format(LOG_DIR, experiment_name))
        model.to(device)

        print('Train model for {} epochs\n'.format(epochs))
        for epoch in range(12, epochs + 12):
            print('Epoch {} train stage...'.format(epoch))
            model.train()
            train_running_loss = 0.0
            train_running_accuracy = 0.0
            
            for images, labels in tqdm(train_dataloader):
                images, labels = images.to(device), labels.to(device)
                
                optimizer.zero_grad()
                predictions = model(images)
                
                train_loss = criterion(predictions, labels)
                train_running_loss += train_loss.item()

                train_running_accuracy += accuracy_evaluating(predictions,
                                                              labels)
                
                train_loss.backward()
                optimizer.step()
            scheduler.step()

            writer.add_scalar('train loss',
                              train_running_loss / len(train_dataloader),
                              epoch)
            
            writer.add_scalar('train accuracy',
                              train_running_accuracy / len(train_dataloader),
                              epoch)
            
            print('Epoch {} validation stage...'.format(epoch))
            model.eval()
            val_running_loss = 0.0
            val_running_accuracy = 0.0
            
            for images, labels in tqdm(val_dataloader):
                images, labels = images.to(device), labels.to(device)
                
                predictions = model(images)
                
                val_loss = criterion(predictions, labels)
                val_running_loss += val_loss.item()

                val_running_accuracy += accuracy_evaluating(predictions,
                                                            labels)
            
            writer.add_scalar('val loss',
                              val_running_loss / len(val_dataloader),
                              epoch)
            
            writer.add_scalar('val accuracy',
                              val_running_accuracy / len(val_dataloader),
                              epoch)

In [20]:
if DO_TRAIN:
    # Your code here (train your model)
    # etc.
    EPOCHS = 2
    
    train(model, train_dataloader, val_dataloader,
          criterion,
          optimizer, scheduler,
          epochs=EPOCHS,
          device=DEVICE,
          experiment_name='CNN_3_{}_{}'.format(BATCH_SIZE, 12))

### Save Model

In [21]:
PATH_TO_MODEL = "./checkpoint.pth"
if DO_TRAIN:
    torch.save(model, PATH_TO_MODEL)

## Load and evaluate the model

In [22]:
# Your code here (load the model from "./checkpoint.pth")
# Please use `torch.load("checkpoint.pth", map_location='cpu')`

### Load Model

In [23]:
DO_TRAIN = False
if not DO_TRAIN:
    model = torch.load(PATH_TO_MODEL)



In [24]:
if not DO_TRAIN:
    val_accuracy = 100*accuracy(model, val_dataloader)
    assert 0 <= val_accuracy <= 100
    print("Validation accuracy: %.2f%%" % val_accuracy)

accuracy evaluating...


HBox(children=(FloatProgress(value=0.0), HTML(value='')))


Validation accuracy: 44.73%


### Results

Via **TensorBoard** interactive I obtained the follows plots:

On all plots, the $\color{orange}{\bf{orange}}$ line corresponds to the penultimate model, and the $\color{blue}{\bf{blue}}$ line corresponds to the last model.

**Train loss**
![train loss](./train_loss.svg)

**Train accuracy**
![train accuracy](./train_accuracy.svg)

**Validation loss**
![val loss](./val_loss.svg)

**Validation accuracy**
![val accuracy](./val_accuracy.svg)

# Report

Below, please mention:

* A brief history of tweaks and improvements.
* Which network architectures have you tried? What is the final one and why?
* What is the training method (batch size, optimization algorithm, number of iterations, ...) and why?
* Which techniques have you tried to prevent overfitting? What were their effects? Which of them worked well?
* Any other insights you learned.

For example, start with:

"I have analyzed these and those conference papers/sources/blog posts. \
I tried this and that to adapt them to my problem. \
The conclusions this task taught me are ..."

###### [Dataset Preprocessing](#Dataset-Download)

At first, I downloaded the data and looked at the images. Data preprocessing is one of the most important parts in Machine Learning, and in particular, in Computer Vision. For better performance of the model and prevent overfitting I will use the **augmentation** method $\color{blue}{[1, 2]}$. Of course, augmentation should be reasonable. In this case I use the following types:

* Horizontal axis ﬂipping
* Slight rotation
* Slight translation
* Color space transformations

###### [CNN Architecture](#Model)

As a baseline, I took a mixture of of lightweight (with fewer parameters) classical CNN architectures: LeNet $\color{blue}{[3]}$, AlexNet $\color{blue}{[4]}$ and VGGNet $\color{blue}{[5]}$. In this model I added the following modules used in modern Deep Learning:

* **Batch Normalization**. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. Applied to a state-of-the-art image classification model $\color{blue}{[6]}$.

* **ReLU**. The rectified linear unit (ReLU) activation function is the most widely used activation function for deep learning applications with state-of-the-art results to date $\color{blue}{[7]}$. The ReLU represents a nearly linear function and therefore preserves the properties of linear models that made them easy to optimize, with gradient-descent method.

Thus, the following model was obtained:

**Model**: `complex_conv ==> complex_conv ==> Flatten ==> FC ==> FC ==> FC`,

where `complex_conv = (Conv3x3 ==> BatchNorm ==> ReLU ==> Conv3x3 ==> BatchNorm ==> ReLU ==> MaxPool2x2)`.

###### [Model Training](#Training-Method)

To train the model, I have to use some loss function and optimization method. I chose:

* **Cross Entropy Loss**. The softmax function is widely adopted by many CNNs due to its simplicity and probabilistic interpretation. Together with the cross-entropy loss, they form arguably one of the most commonly used components in CNN architectures $\color{blue}{[8]}$.

* **ADAM**. Adam is algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods $\color{blue}{[9]}$.

As for any stochastic gradient descent method (including the mini-batch case), it is important for efficiency of the estimator that each example or minibatch be sampled approximately independently. In this context, it is safer if the examples or mini-batches are first put in a random order ([**shuffled**](#Dataset-Download)) $\color{blue}{[10]}$.

###### [Model Evaluation](#Metric)

To evaluate the performance of the model, I will use the **accuracy** metric (required by this assignment).

###### [Learning Process](#Model-Training)

When you want to train a model, **do not forget to set your paths!**

* [LOG_DIR](#Model-Training)
* [PATH_TO_MODEL](#Save-Model)

To train the model I set the following **parameters**:

* [Batch size](#Dataset-Download) = 100
* [Learning Rate](#Training-Method) = 0.001
* [Epochs number](#Model-Training) = 20

The following best **results** were obtained with the baseline model:

* train phase:
    * loss value: 1.79
    * accuracy: 0.567
* validation phase:
    * loss value: 2.9
    * accuracy: 0.35

After that I decided to improve my model.

###### [Model Improvings](#Model)

1. At first I decided to try a completely different architecture - some improvement of ResNet - **DenseNet** $\color{blue}{[11]}$. Of course, I used a somewhat lightweight network, which in spite of this required a lot of resources and a long training time. The following best **results** were obtained with this model:

  * train phase:
      * loss value: 1.12
      * accuracy: 0.695
  * validation phase:
      * loss value: 2.305
      * accuracy: 0.457

2. Then I decided improve baseline model. I used: 

  * **dropout** technique for regularization and preventing the co-adaptation of neurons $\color{blue}{[12]}$,
  * activation functions after fully connected layers,
  * another three convolutional layers.

  Thus, I got a model that has the same skeleton as **VGGNet-B**:

  **Model**: `complex_conv ==> complex_conv ==> complex_conv ==> complex_conv ==> complex_conv ==> Flatten ==> FC => Dropout ==> ReLU ==> FC ==> Dropout ==> ReLU ==> FC`,

  where `complex_conv = (Conv3x3 ==> BatchNorm ==> ReLU ==> Conv3x3 ==> BatchNorm ==> ReLU ==> MaxPool2x2)`.

  The following best **results** were obtained with this model:

  * train phase:
      * loss value: 1.97
      * accuracy: 0.51
  * validation phase:
      * loss value: 2.211
      * accuracy: 0.483
  
  In this case, I got the highest accuracy and the lowest value of the loss function, but there was an overfitting (the loss function of validation began to increase monotonously, starting from the 14th epoch). 
  
  Since this model showed the best results in validation, and also turned out to be lighter (parameters number) and faster on training than the previous one, I tried to overcome overfitting using the **learning rate scheduler** $\color{blue}{[13]}$:

  * [Scheduler](#Training-Method): step size = 10, gamma = 0.1
  
  The point is that if at some moment on training we fall on a plateau, then reducing the step of the gradient descent, we can find a gap that will lead to a smaller local minimum. 
  
Next, I tried two different versions for improving:

3. For the first one I also reduced the batch size:
  
  * [Batch size](#Dataset-Download) = 32
  
  Smaller values of batch size may benefit from more exploration in parameter space and a form of regularization both due to the "noise" injected in the gradient estimator, which may explain the better test results sometimes observed with smaller batch size $\color{blue}{[10]}$.

  The following best **results** were obtained with this model:

  * train phase:
      * loss value: 1.944
      * accuracy: 0.503
  * validation phase:
      * loss value: 2.282
      * accuracy: 0.447

4. For the second one I changed the architecture (but again considered VGGNet-like) and increased feature maps on first convolutions:

  **Model**: `complex_2conv ==> complex_2conv ==> complex_3conv ==> Flatten ==> FC => Dropout ==> ReLU ==> FC ==> Dropout ==> ReLU ==> FC`,

  where `complex_2conv = (Conv3x3 ==> BatchNorm ==> ReLU ==> Conv3x3 ==> BatchNorm ==> ReLU ==> MaxPool2x2)` and `complex_3conv = (Conv3x3 ==> BatchNorm ==> ReLU ==> Conv3x3 ==> BatchNorm ==> ReLU ==> Conv3x3 ==> BatchNorm ==> ReLU ==> MaxPool2x2)`

  The following best **results** were obtained with this model (**launched on 12 epochs**):
  
  * train phase:
      * loss value: 1.82
      * accuracy: 0.533
  * validation phase:
      * loss value: 2.234
      * accuracy: 0.459

###### [Results](#Results)

Via **TensorBoard** interactive I obtained the follows plots:

* train accuracy
* train loss
* val accuracy
* val loss

Then I downloaded the plots in svg format and insert it into the notebook.

###### [References](#Report):

$\color{blue}{[1]}$ Shorten, Connor & Khoshgoftaar, Taghi. (2019). [A survey on Image Data Augmentation for Deep Learning](https://www.researchgate.net/publication/334279066_A_survey_on_Image_Data_Augmentation_for_Deep_Learning). Journal of Big Data. 6. 10.1186/s40537-019-0197-0.

$\color{blue}{[2]}$ Perez, Luis & Wang, Jason. (2017). [The Effectiveness of Data Augmentation in Image Classification using Deep Learning](https://arxiv.org/pdf/1712.04621.pdf).

$\color{blue}{[3]}$ Lecun, Yann & Bottou, Leon & Bengio, Y. & Haffner, Patrick. (1998). [Gradient-Based Learning Applied to Document Recognition](http://vision.stanford.edu/cs598_spring07/papers/Lecun98.pdf). Proceedings of the IEEE. 86. 2278 - 2324. 10.1109/5.726791.

$\color{blue}{[4]}$ Krizhevsky, Alex & Sutskever, Ilya & Hinton, Geoffrey. (2012). [ImageNet Classification with Deep Convolutional Neural Networks](https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf). Neural Information Processing Systems. 25. 10.1145/3065386.

$\color{blue}{[5]}$ Simonyan, Karen & Zisserman, Andrew. (2014). [Very Deep Convolutional Networks for Large-Scale Image Recognition](http://www.robots.ox.ac.uk/~vgg/publications/2015/Simonyan15/simonyan15.pdf). arXiv 1409.1556.

$\color{blue}{[6]}$ Ioffe, Sergey & Szegedy, Christian. (2015). [Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift](https://arxiv.org/pdf/1502.03167.pdf).

$\color{blue}{[7]}$ Hinton, Geoffrey. (2010). [Rectified Linear Units Improve Restricted Boltzmann Machines Vinod Nair](https://www.cs.toronto.edu/~fritz/absps/reluICML.pdf). Proceedings of ICML. 27. 807-814.

$\color{blue}{[8]}$ Liu, Weiyang & Wen, Yandong & Yu, Zhiding & Yang, Meng. (2016). [Large-Margin Softmax Loss for Convolutional Neural Networks](https://arxiv.org/pdf/1612.02295.pdf). ProC. Int. Conf. Mach. Learn.

$\color{blue}{[9]}$ Kingma, Diederik & Ba, Jimmy. (2014). [Adam: A Method for Stochastic Optimization. International Conference on Learning Representations](https://arxiv.org/pdf/1412.6980.pdf).

$\color{blue}{[10]}$ Bengio, Y.. (2012). [Practical recommendations for gradient-based training of deep architectures](https://arxiv.org/pdf/1206.5533v2.pdf). Arxiv. 

$\color{blue}{[11]}$ Huang, Gao & Liu, Zhuang & van der Maaten, Laurens & Weinberger, Kilian. (2017). [Densely Connected Convolutional Networks](https://arxiv.org/pdf/1608.06993.pdf). 10.1109/CVPR.2017.243.

$\color{blue}{[12]}$ Srivastava, Nitish & Hinton, Geoffrey & Krizhevsky, Alex & Sutskever, Ilya & Salakhutdinov, Ruslan. (2014). [Dropout: A Simple Way to Prevent Neural Networks from Overfitting](https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf). Journal of Machine Learning Research. 15. 1929-1958.

$\color{blue}{[13]}$ Darken, Christian & Moody, John. (1990). [Note on Learning Rate Schedules for Stochastic Optimization](https://pdfs.semanticscholar.org/713f/55820406c9540428ae5ec2a0428010d6800c.pdf). 832-838.