# Table of Contents
1. [Data Extraction & Label Generation](#section_1)
2. [{Training Set + Validation Set} Creation](#section_2)
3. [Mixed Precision Training](#section_3)
4. [Validation Set Results](#section_4)
5. [Improving the Classifier](#section_5)
    1. [Data Cleaning](#section_5_1)
    2. [Progressive Image Re-sizing](#section_5_2)
    3. [Unfreezing & Discriminative Layer Training](#section_5_3)
6. [Test Set Predictions & Submission](#section_6)

---

This notebook attempts to replicate a few tricks mentioned by <a href="https://www.kaggle.com/jhoward" target="_blank">Jeremy Howard</a> in <a href="https://course.fast.ai/" target="_blank">Practical Deep Learning for Coders</a> to improve the performance of an image classifier.

Module imports and other preliminaries:

In [None]:
import warnings
import zipfile
from fastai.vision import *
from fastai.metrics import error_rate
from fastai.widgets import *
import pandas as pd
import base64
from IPython.display import HTML
import re

warnings.filterwarnings('ignore') # Suppress warning messages.

%matplotlib inline

---

<a id="section_1"></a>
# 1. Data Extraction & Label Generation

Let's take a look at the competition data files.

In [None]:
os.chdir('/kaggle/input/dogs-vs-cats-redux-kernels-edition')

os.listdir()

Let's extract the files in `'train.zip'` and `'test.zip'` to the `'/kaggle/working/'` directory.

In [None]:
with zipfile.ZipFile('/kaggle/input/dogs-vs-cats-redux-kernels-edition/train.zip', 'r') as zip_ref:
    zip_ref.extractall('/kaggle/working/')

In [None]:
with zipfile.ZipFile('/kaggle/input/dogs-vs-cats-redux-kernels-edition/test.zip', 'r') as zip_ref:
    zip_ref.extractall('/kaggle/working/')

Next, let's change our working directory to `'/kaggle/working'` and take a look at its contents.

In [None]:
os.chdir('/kaggle/working/')

os.listdir()

The folders `'train'` and `'test'` contain the images.

Let's get the filenames in `'train'`.

In [None]:
train_fnames = get_image_files('/kaggle/working/train')

len(train_fnames)

There are `25000` files. Let's examine the first `5` filenames.

In [None]:
train_fnames[:5]

Finally, let's use a list comprehension to generate the labels.

In [None]:
labels = [('cat' if 'cat.' in str(fname) else 'dog') for fname in train_fnames]

labels[:5]

---

<a id="section_2"></a>
# 2. {Training Set + Validation Set} Creation

Let's create an `ImageDataBunch` object (containing both the training set and the validation set).

In [None]:
np.random.seed(123) # Ensure reproducibility.
data = ImageDataBunch.from_lists(
    path='/kaggle/working/train', 
    fnames=train_fnames, 
    labels=labels, 
    valid_pct=0.2, # Put 20% of the images in the validation set.
    ds_tfms=get_transforms(flip_vert=False), # Perform data augmentation.
    size=224, # Resize all images to the same size (224px by 224px).
    bs=32 # Set the batch size for training.
).normalize(imagenet_stats) # Normalize all images with ImageNet statistics.

In [None]:
len(data.train_ds), len(data.valid_ds)

The training set contains `20000` images and the validation set contains `5000` images.

Classes:

In [None]:
data.classes

A random sample of observations:

In [None]:
data.show_batch(rows=3, figsize=(12, 12))

---

<a id="section_3"></a>
# 3. Mixed Precision Training

A couple of checks:

In [None]:
torch.cuda.is_available()

In [None]:
torch.backends.cudnn.enabled

Let's create our learner by specifying:

- `data` {Training Set + Validation Set}
- `models.resnet50` (model with ResNet50 architecture pre-trained on ImageNet images)
- `error_rate` (metric to show during training)
- `to_fp16()` (mixed precision training)

In [None]:
# Make sure Internet is on.
learner = cnn_learner(data, models.resnet50, metrics=error_rate).to_fp16()

Next, let's run `fastai`'s learning rate finder.

In [None]:
learner.lr_find(start_lr=1e-6)

In [None]:
learner.recorder.plot()

After eyeballing the graph, let's choose a *maximum learning rate*.

In [None]:
max_lr_choice = 5e-4

Now, let's train our model for `4` epochs.

In [None]:
learner.fit_one_cycle(4, max_lr=max_lr_choice)

That's a pretty good fit.

Plot of training & validation losses:

In [None]:
learner.recorder.plot_losses()

Let's save our model's weights.

In [None]:
learner.save('imgsize224-stage1')

---

<a id="section_4"></a>
# 4. Validation Set Results

Let's create a `ClassificationInterpretation` object.

In [None]:
interp = ClassificationInterpretation.from_learner(learner)

Let's plot the confusion matrix.

In [None]:
interp.plot_confusion_matrix()

Accuracy:

In [None]:
accuracy = (interp.confusion_matrix()[0, 0] + interp.confusion_matrix()[1, 1]) / len(data.valid_ds)

accuracy

Finally, let's use the `plot_top_losses()` method to examine images which have the biggest losses along with:

- predicted class
- actual class
- loss
- probability assigned by model to actual class

In [None]:
interp.plot_top_losses(20, figsize=(16, 16))

As we can see, some of the misclassified images are noisy / irrelevant.

---

<a id="section_5"></a>
# 5. Improving the Classifier

<a id="section_5_1"></a>
## 5.1. Data Cleaning

Now, we'll use `fastai`'s `ImageCleaner` Jupyter widget to re-label / delete images which are mislabeled / noisy / irrelevant.

First, let's create a new `ImageDataBunch` without a training, validation split.

In [None]:
data_no_split = ImageDataBunch.from_lists(
    path='/kaggle/working/train', 
    fnames=train_fnames, 
    labels=labels, 
    valid_pct=0, # Don't put any images in the validation set.
    ds_tfms=get_transforms(flip_vert=False),
    size=224,
    bs=32
).normalize(imagenet_stats)

Next, let's create a new learner with `data_no_split` and load the `'imgsize224-stage1'` weights.

In [None]:
%%capture
learner_no_split = cnn_learner(data_no_split, models.resnet50).to_fp16()
learner_no_split.load('imgsize224-stage1')

Let's pass in the new learner to `DatasetFormatter.from_toplosses()`. It will return a *formatted* dataset and file indices in descending order of top losses.

In [None]:
dataset, file_indices = DatasetFormatter.from_toplosses(learner_no_split)

Now, we can use ImageCleaner to:

1. **Re-label:** Mis-labeled images (`'cat'` as `'dog'` or `'dog'` as `'cat'`).
2. **Delete:**
    - Images containing both a cat and a dog (since this is not a multi-label classification problem).
    - Images where it isn't clear whether the animal is a cat or a dog (e.g., due to the animal's posture / image blur).
    - Clip art / cartoons.
    - Irrelevant images (e.g., house pictures, landscapes, company logos, etc.)

In [None]:
# Use in interactive mode only. When committing notebook, cleaning isn't possible.
ImageCleaner(dataset, file_indices, Path('/kaggle/working/train'))

A new CSV file `'cleaned.csv'` has been created in the `'/kaggle/working/train'` folder. Let's read it in and take a look.

In [None]:
# Use in interactive mode only.
cleaned = pd.read_csv('/kaggle/working/train/cleaned.csv')

cleaned.head()

**Note:** Since it isn't possible to use the `ImageCleaner` Jupyter widget when committing the notebook, we need to save the `cleaned` data frame to a persistent storage location (e.g., Google Cloud Storage / AWS S3 / Dropbox). We'll then continue our workflow with this persistent file (instead of the interactive one above).

Let's create a function to download the data frame.

In [None]:
# Source: https://www.kaggle.com/rtatman/download-a-csv-file-from-a-kernel
def create_download_link(df, title="Download CSV file", filename="dogs_vs_cats_cleaned.csv"):
    csv = df.to_csv(index=False)
    b64 = base64.b64encode(csv.encode())
    payload = b64.decode()
    html = '<a download="{filename}" href="data:text/csv;base64,{payload}" target="_blank">{title}</a>'
    html = html.format(payload=payload, title=title, filename=filename)
    return HTML(html)

We can now download the data frame.

In [None]:
# Use in interactive mode only.
create_download_link(cleaned)

# See 'Download CSV file' link below.

Next, let's download the file, and upload it to a bucket in Google Cloud Storage (or AWS S3 / Dropbox / other similar service).

After the file has been uploaded, we can read it in.

In [None]:
# Make sure Internet is on.
cleaned = pd.read_csv('https://storage.googleapis.com/cleaned-data/dogs_vs_cats_cleaned.csv')

cleaned.head()

Let's recreate our `ImageDataBunch` using the `cleaned` data frame.

In [None]:
np.random.seed(123)
data_cleaned = ImageDataBunch.from_df(
    path='/kaggle/working/train', 
    df=cleaned, 
    valid_pct=0.2, # Put 20% of the images in the validation set.
    ds_tfms=get_transforms(flip_vert=False),
    size=224,
    bs=32
).normalize(imagenet_stats)

Let's create a new learner with `data_cleaned`.

In [None]:
learner = cnn_learner(data_cleaned, models.resnet50, metrics=error_rate).to_fp16()

Next, let's run `fastai`'s learning rate finder.

In [None]:
learner.lr_find(start_lr=1e-6)

In [None]:
learner.recorder.plot()

After eyeballing the graph, let's choose a maximum learning rate.

In [None]:
max_lr_choice = 5e-4

Now, let's train our model for `4` epochs.

In [None]:
learner.fit_one_cycle(4, max_lr=max_lr_choice)

Let's save our model's weights.

In [None]:
learner.save('imgsize224-stage2')

<a id="section_5_2"></a>
## 5.2. Progressive Image Re-sizing

Now we have a model (`'imgsize224-stage2'`) that is pretty good at classifying dogs vs. cats.

**Trick to create an even better model:** 

1. Re-size all images to 300px by 300px and create a new `ImageDataBunch`.
2. Perform transfer learning on this new `ImageDataBunch` using `'imgsize224-stage2'` as our pre-trained model.

By using a larger image size, we'll lose most of the overfitting of the previous model, but transfer its 'learning' to the new model.

Let's create the new `ImageDataBunch`.

In [None]:
np.random.seed(123)
data = ImageDataBunch.from_df(
    path='/kaggle/working/train', 
    df=cleaned, 
    valid_pct=0.2,
    ds_tfms=get_transforms(flip_vert=False),
    size=300, # Re-size all images to 300px by 300px.
    bs=32
).normalize(imagenet_stats)

Let's create a new learner with the new `ImageDataBunch` and load the previous model's weights.

In [None]:
%%capture
learner = cnn_learner(data, models.resnet50, metrics=error_rate).to_fp16()
learner.load('imgsize224-stage2')

Next, let's run `fastai`'s learning rate finder.

In [None]:
learner.lr_find(start_lr=1e-6, end_lr=1e-2, stop_div=False)

In [None]:
learner.recorder.plot()

After eyeballing the graph, let's choose a maximum learning rate.

In [None]:
max_lr_choice = 5e-5

Now, let's train our model for `4` epochs.

In [None]:
learner.fit_one_cycle(4, max_lr=max_lr_choice)

Let's save our model's weights.

In [None]:
learner.save('imgsize300-stage1')

<a id="section_5_3"></a>
## 5.3. Unfreezing & Discriminative Layer Training

Let's unfreeze our model and run `fastai`'s learning rate finder.

In [None]:
learner.unfreeze()
learner.lr_find(start_lr=1e-6, end_lr=1e-2, stop_div=False)

In [None]:
learner.recorder.plot()

We shall use `slice(1e-5, 1e-4)` as our sequence of learning rates.

Let's perform *discriminative layer training* for `4` epochs. (This will apply a different learning rate to each *layer group*.)

In [None]:
learner.fit_one_cycle(4, max_lr=slice(1e-5, 1e-4))

Let's save our model's weights.

In [None]:
learner.save('imgsize300-stage2')

Finally, let's export our model as a pickle.

In [None]:
learner = learner.to_fp32() # Convert back to default precision for safe export.
learner.export()

---

<a id="section_6"></a>
# 6. Test Set Predictions & Submission

Let's get the filenames in the `'test'` folder.

In [None]:
test_fnames = get_image_files('/kaggle/working/test')

len(test_fnames)

There are `12500` files. Let's examine the first `5` filenames.

In [None]:
test_fnames[:5]

Let's use a list comprehension to extract the ids.

In [None]:
ids = [int(re.findall(r'\d+', str(fname))[0]) for fname in test_fnames]

ids[:5]

Next, let's load our learner from the exported pickle (specifying the test set this time).

In [None]:
learner.path # Location of pickle.

In [None]:
learner = load_learner(path=learner.path, test=test_fnames)

Now, let's obtain our test set predictions.

In [None]:
preds, labels = learner.get_preds(ds_type=DatasetType.Test)

preds[:5]

The first column is the probability of `'cat'` and the second column is the probability of `'dog'`.

Let's create our submission data frame and sort its rows by `id`.

In [None]:
d = {'id': ids, 'label': preds[:, 1]}
submission = pd.DataFrame(data=d)
submission = submission.sort_values(by='id')

submission.head()

Finally, let's write it to disk.

In [None]:
submission.to_csv('submission.csv', index=False)