<a id="toc"></a>
# Table of content

* [Introduction](#introduction)
* [Setup, imports, constants](#sic)
    * [Setting up the environment](#sic.env)
    * [Imports](#sic.imports)
    * [Constants](#sic.consts)
* [EDA (Exploratory Data Analysis)](#EDA)
* [Model](#model)
    * [Dataloaders](#model.dls)
    * [CNN learner](#model.cnn)
    * [Fine tuning the model](#model.finetune)
* [Make predictions](#predictions)
* [Conclusion](#conclusion)

<a id="introduction"></a>
# Introduction
[[back to top]](#toc)

In this notebook I'm implementing a convolutional neural network which is capable of classifing pictures of dogs by their breeds. The source of the data is the competition named [Dog Breed Identification](https://www.kaggle.com/c/dog-breed-identification/overview/description). The aim of this notebook is to setup a convolutional neural network in the simplest possible way using [fast.ai](https://fast.ai), hence there are no optimizations in the code. The aim is not to create a model which is capable of perdicting the breeds with state-of-the-art accuracy but to give a high level overview of how to easily setup a convolutional neural network and use it to make predictions.

In the [first chapter](#sic) the environment is set, the required libraries are imported and the constants are defined.

[Next](#EDA), a basic exploration of the provided input is performed.

The model is created and trained in the [third section](#model). First, the data is loaded in the necessary format, then the model is initialized with weights from ResNet. Then the fine-tuning takes place based on the input images.

The [fourth chapter](#predictions) describes how to make predictions based on the test data and how to save the outcome in the expected format.

<a id="sic"></a>
# Setup, imports, constants
[[back to top]](#toc)

This chapter consists of three sections. The [first](#sic.env) describes how to setup the environment, what libraries to install, etc. The [second](#sic.imports) explains what to import and why. The [last subchapter](#sic.consts) introduces the constants used throughout this notebook.

<a id="sic.env"></a>
## Setting up the environment
[[back to top]](#toc)

This notebook uses fast.ai 2.5.2 and for some reason it fails to work together with PyTorch 1.9.1. The problem is that certain operations throw the following error: `RuntimeError: solve: MAGMA library not found in compilation. Please rebuild with MAGMA.` As of october 2021 one work-around is to downgrade PyTorch from the latest (1.9.1) to 1.9.0. See [this discussion](https://www.kaggle.com/product-feedback/279990) for further details.

The following cell performs the downgrading of PyTorch and some related packages. Torchvision, torchaudio and torchtext needs to be downgraded to prevent having conflicting versions installed.

In [None]:
!pip install --user torch==1.9.0 torchvision==0.10.0 torchaudio==0.9.0 torchtext==0.10.0

<a id="sic.imports"></a>
## Imports
[[back to top]](#toc)

One of the comfortable aspects of fast.ai is that it is made really easy to import the necessary packages, all we need to do is import everything from their data & vision packages as the first two lines of codes shows.

Importing panda, os & stats etc. is necessary to have a better insights into our data. See the [EDA chapter](#EDA).

In [None]:
from fastai.data.all import *
from fastai.vision.all import *

import numpy as np
import pandas as pd
import os

from functools import cmp_to_key
from PIL import Image
from scipy import stats

print("Import finished.")

<a id="sic.consts"></a>
## Constants
[[back to top]](#toc)

The constants mostly store paths such as path to the working folder (`path`), path to the csv storing the labels (`labels_csv_path`) and the path to the train folder (`train_path`) & test folder (`test_path`).

The number of epochs is also set here (`number_of_epochs`).

In [None]:
# paths
path = '../input/dog-breed-identification/'
labels_csv_path = path + 'labels.csv'
sample_submission_csv_path = path + 'sample_submission.csv'
submission_csv_path =  './submission.csv'
train_path = path + 'train'
test_path = path + 'test'

number_of_epochs = 10

print(f'Constants are set. Fine tuning takes {number_of_epochs} epochs.')

<a id="EDA"></a>
# EDA (Exploratory Data Analysis)
[[back to top]](#toc)

The aim of any exploratory data analysis is to have a better understanding of the data to which the model needs to be fit, e.g. understand the shape/size/amount/distribution etc. of the input. Obviously, the analysis depends on the task at hand (what we want to get out of the model i.e.: image classification in this case) and on the model which we plan to deploy.

The very first step is to check the [data section](https://www.kaggle.com/c/dog-breed-identification/data) of the competition.

After that, I'm checking the content of the `labels.csv` using pandas.

In [None]:
labels_df = pd.read_csv(labels_csv_path)
print(f'The shape of the labels: {labels_df.shape}')

Pandas' `read_csv` reads everything from the given file (`labels.csv` in this case) and stores the content of the csv file in a so called dataframe. Then, we check the shape (number of rows & number of columns) of the dataframe.

In [None]:
labels_df.head

The `head` prints the first & last five entries of the dataframe. The first column contains the unique ID, the other column stores the breed of the dog on the given picture. Next, I'm checking the number of pictures in the train folder...

In [None]:
print(len(os.listdir(train_path)))

...which is (unsurprisingly) 10222 just like in the `label.csv` file.

The number of images in the test folder is...

In [None]:
print(len(os.listdir(test_path)))

That's how many items needs to be predicted.

Next, I'm checking the number of breeds (the unique occurrences in the breed column):

In [None]:
unique_breeds = pd.unique(labels_df['breed'])
print(len(unique_breeds))

It's 120, just like it's written in the [data section](https://www.kaggle.com/c/dog-breed-identification/data).

In [None]:
print(stats.describe(labels_df.value_counts('breed')))

Dataframe's `value_counts` returns the number of images for each breed. The code snippet above returns some pieces of statistic. For example the breed with the smallest number of images has 66 images, the one with the largest has 126, the mean is 85, etc. This data can be important and might require further investigation depending on the model deployed.

Next, I would like to know if the images are the same size, so I pick two randomly and check...

In [None]:
a_img_path = '../input/dog-breed-identification/train/000bec180eb18c7604dcecc8fe0dba07.jpg'
b_img_path = '../input/dog-breed-identification/train/002a283a315af96eaea0e28e7163b21b.jpg'

a_img = Image.open(a_img_path)
b_img = Image.open(b_img_path)

print(a_img.size)
print(b_img.size)

... it turns out that the dimensions might differ. It is important, because certain models require the input images to be the same size, therefore it will be necessary to resize them.

Lastly, I'm curious to understand the content of `sample_submission.csv`.

In [None]:
sample_df = pd.read_csv(sample_submission_csv_path)
sample_df.shape

The number of rows is 10357, so apparently there is an entry for each image in the test folder. The number of columns is 121, the first is the image ID and the rest stores the probabilities for each breed.

Do ids in `sample_submission.csv` are in alphabetical order?

In [None]:
sample_df['id'].is_monotonic_increasing

Yes, they are. It means that if the sample submission file is to be reused, we have to make sure that when the predictions are made the test input is in alphabetical order too.

<a id="model"></a>
# Model
[[back to top]](#toc)

This chapter is divided into three subchapters.

The first step of creating a model is to read the available data from the disk. Fast.ai provides [an API to handle data named datablock](https://docs.fast.ai/tutorial.datablock). Using that API `DataLoaders` can be created which are able to read, store and make the data available to the model (Fast.ai's `DataLoaders` (plural) are not to be confused with PyTorch's `DataLoader` (singular), though the two are related). Loading the data is the subject of the [first subchapter](#model.dls).

The [second subchapter](#model.cnn) describes how to create a convolutional network and how to get an insight of the structure of the model.

The [third subchapter](#model.finetune) explains how to train the model easily and how to evaluate the results.

<a id="model.dls"></a>
## Dataloaders
[[back to top]](#toc)

Fast.ai introduces the concept of `DataLoaders` (plural). They are explained in their excellent tutorial [here](https://colab.research.google.com/github/fastai/fastbook/blob/master/02_production.ipynb#scrollTo=b4NxWFFV6CNq) (see the other chapters of their book [here](https://course.fast.ai/start_colab#Opening-a-chapter-of-the-book)). The idea is that it makes sense to handle the training and validation `DataLoader` (singular) together. Apart from holding the training & validation dataloaders together fast.ai's `DataLoaders` also provide additional functionality and there are classes inherited from them.

One of the inherited classes is `ImageDataLoaders` which has a factory method named `from_csv` (see its documentation [here](https://docs.fast.ai/vision.data.html#ImageDataLoaders.from_csv)). By default the `from_csv` method expects a csv file which has two columns: the first contains the IDs of the images and the second contains the labels. Luckily we have exactly this, but it is possible to set the behaviour of the method to handle a different setup as well (see documentation).

`ImageDataLoaders#from_csv` expects a csv file whose name is provided in the `csv` parameter and its path is given in the `path` parameter (so the location of the csv file is `path`\\`csv`). The first column of the csv file contains the names of the images and `ImageDataLoaders` will search for the images in the directory provided in the `folder` parameter (note, that the `folder` is relative to `path`, just like the `csv`).

The csv file contains the ids, but not the filenames of the images. Luckily, the filenames are in the format of `ID.jpg`. That's when the `suff` parameter comes handy because it can add a suffix (most likely the file extension) to an ID of an image. So if the ID of an image is `IMAGE_ID`, `from_csv` reads that image from `path`\\`folder`\\`IMAGE_ID`\.`suff`.


The `item_tfms` parameter defines the transformation which has to be performed on each image before feeding it into the model. In our case it is resizing the images, because our model will expect that each item has the same size and it turned out during the data analysis in the previous paragraphs that the dimensions of the input images might differ from each other.

Calling `ImageDataLoaders#from_csv` with these parameters will read all the images from `path`\\`folder` with the IDs provided in the csv file and resize them to the same dimensions. It will also create and store the training and the validation set. The validation set is 20% of the total data by default and its members are selected randomly.

In [None]:
dls = ImageDataLoaders.from_csv(
    path=path,
    csv='labels.csv',
    folder='train',
    suff='.jpg',
    item_tfms=Resize(256)
)

print("ImageDataloaders initialized.")

<a id="model.cnn"></a>
## CNN learner
[[back to top]](#toc)

Once the `DataLoaders` (`dls`) are ready it is fairly simple to initialize a convolutional neural network using fast.ai.

In [None]:
learner = cnn_learner(dls, resnet34, metrics=error_rate)

The method named`cnn_learner` creates a convolutional neural network which reads the training and validation set from `dls`.

The model is based on ResNet-34. ResNet-34 is a pre-trained model, which is trained on the ImageNet dataset to be capable of distinguishing images (see more details [here](https://www.kaggle.com/pytorch/resnet34/home)). The idea is that it might be more effective to further improve a model which is already capable of recognizing everyday objects, rather than training one from scratch (transfer learning).

The `metrics` parameter defines how the performance of the model should be evaluated. In this case it's `error_rate` (how many breeds it misses).

There are plenty of other parameters of `CNNLearner`, but they either have a default value or deducted based on other parameters. For example one could wonder what the loss function is...

In [None]:
print(learner.loss_func)

...well, it's cross-entropy loss, which makes sense in the case of image classification problems.

The model itself is also accessable:

In [None]:
print(learner.model)

<a id="model.finetune"></a>
## Fine tuning the model
[[back to top]](#toc)

Once the model is ready, fast.ai makes it fairly straightforward to train it. The most basic option is to call the [`fine_tune`](https://docs.fast.ai/callback.schedule.html#Learner.fine_tune) method on the learner. The provided parameter determines the number of epochs (10 in this case).

(Note, that `fine_tune` performs certain optimization unlike [`fit_one_cycle`](https://docs.fast.ai/callback.schedule.html#Learner.fit_one_cycle). The easiest way to understand the difference between the two is to check the code itself.)

In [None]:
learner.fine_tune(number_of_epochs)

The data above shows how the model trains, i.e. how the loss changes on the training and validation set and how the error_rate improves. Note, that while the loss on the training set constantly declines epoch by epoch, the validation loss and the error_rate might get worse compared to a previous iteration.

It is possible to have a more detailed overview of the performance of the learner than what we get during training. Fast.ai provides a class named `ClassificationInterpration` which is able to show how accurately the model guesses a certain breed. (See Wikipedia for [F1-score](https://en.wikipedia.org/wiki/F-score#Definition), [precision, recall](https://en.wikipedia.org/wiki/Precision_and_recall) and [accuracy](https://en.wikipedia.org/wiki/Accuracy_and_precision))

It can happen that the model barely recognizes a certain type of breed. In that case it might worth looking at the input data because it might happen that that specific breed is underrepresented in our data and thus the model doesn't have the chance to learn to recognize it. In such cases the modification of the input data set might be necessary.

In [None]:
interp = ClassificationInterpretation.from_learner(learner)
interp.print_classification_report()

We also have the chance to deep dive into the actual operations performed by `fine_tune` using the `show_training_loop` method on the learner. This call prints the callback functions called during training.

In [None]:
learner.show_training_loop()

It is even possible to plot the images with the worst losses. It is useful because that way it's possible to filter out invalid data. For example if the task is to identify the breed of a dog on a picture and there is no dog on the image, then that input is invalid.

In [None]:
interp.plot_top_losses(5, nrows=5)

<a id="predictions"></a>
# Make predictions
[[back top top]](#toc)

Once the model is trained, the next step is to make actual predictions. Bulk predictions can be made by calling the `get_preds` method of the learner. The function `get_preds` expects a dataloader which stores the test data.

A `Learner` always holds a `DataLoaders` object which by default has two dataloaders: one for training and one for validation. 

The `DataLoaders` class has a method named `test_dl`. As the pydoc describes:

> "Create a test dataloader from `test_items` using validation transforms of `dls`"

What this means it that `test_dl` reads in the data provided in `test_items` and applies the transformations on the input images with which the original `DataLoaders` object was created. In our case it means that `test_items` are resized to 256\*256 just like the training items were. Probably the easiset is to check the [related code](https://github.com/fastai/fastai/blob/f8b74ef5b320512a2bb4a6c3cb17a5e917b7d6a3/fastai/data/core.py#L394) in fast.ai.

It seems to be reasonable to provide the model with the input files sorted alphabetically. This is what the next code snippet does.

In [None]:
def get_id_from_image_path(image_path):
    image_id, _ = os.path.splitext(os.path.basename(image_path))
    return image_id
        

def cmp_path(item1, item2):
    id1 = get_id_from_image_path(item1)
    id2 = get_id_from_image_path(item2)
    
    if id1 < id2:
        return -1
    if id1 > id2:
        return 1
    
    return 0


sorted_test_image_files = sorted(get_image_files(test_path), key=cmp_to_key(cmp_path))
print("Test images are sorted alphabetically.")

The `test_datalaoder` is created using the previously described `test_dl` method and then the prediction takes place.

In [None]:
test_dataloader = learner.dls.test_dl(sorted_test_image_files)
preds, _ = learner.get_preds(dl=test_dataloader)

print("Predictions are ready.")

In [None]:
preds.shape

The dimensions of the predictions are 10357 \* 120. 10357 is the number of the input images and 120 is the number of breeds ([see the EDA section](#EDA)). This is exactly the expected format (see `sample_submission.csv`).

The code snippet in the next cell puts the end results together: stores the IDs of the input pictures in an array, transposes it (changes the dimensions from (1, n), to (n, 1), and then puts the output tensor next to the ID column (`np.hstack`). Finally the results are written into the `submission.csv`.

In [None]:
submission = pd.read_csv(sample_submission_csv_path)

sorted_ids = list(map(lambda x: get_id_from_image_path(x) ,sorted_test_image_files))
id_col = np.transpose([np.asarray(sorted_ids)])

res = np.hstack((id_col, preds.detach().cpu().numpy()))
res_df = pd.DataFrame(res, columns=submission.columns)

res_df.to_csv(submission_csv_path, index=False)

print(f'Predictions are saved to {submission_csv_path}')

<a id ="conclusion"></a>
# Conclusion
[[back to top]](#toc)

In this notebook a convolutional neural network was trained to predict dog breeds from pictures. The aim was to understand what is the simplest way to initialize and to train a cnn using fast.ai. The aim was not to achieve state-of-the-art results and apply advanced optimization technics.

The code itself is indeed concise:

1. Create a dataloaders object:

    `dls = ImageDataLoaders.from_csv(
    path=path, csv='labels.csv', folder='train', suff='.jpg', item_tfms=Resize(256))`
        
        
2. Initialize a convolutional neural network based on ResNet:

    `learner = cnn_learner(dls, resnet34, metrics=error_rate)`

3. Fine tune the model for couple of epochs:

    `learner.fine_tune(number_of_epochs)`

4. Make predictions for the test data:

    `test_dataloader = learner.dls.test_dl(sorted_test_image_files)
    preds, _ = learner.get_preds(dl=test_dataloader)`


This couple of lines of codes is still capable of achieving \~0.64 score on the leaderboard (~80% of accuracy). The model is far from being state-of-the art but it is perfect to use as a baseline to which further improvements could be compared.

Further fine tuning of the model is out of the scope of this notebook. However, there are plenty of room for improvement. One way to go would be applying transformations on the input images (data augmentation). This could be implemented by passing the `batch_tfms` parameter to the `ImageDataLoaders`. It would also make sense to try transfer learning based on other pre-trained models as well. Another idea would be to use `fit_one_cycle` and experience with its parameters instead of just simply calling`fine_tune`.

However, this code is a good start to get a baseline model which performs reasonably well.