**Kernel description:**

This kernel demonstrates application of transfer learning with VGG-16 to the given whale multi-class classfication problem. Please note that this is a solution to my university course assignment which differs in the objective from the original stated problem. The difference is that the original training data set is filtered by removing all whale individuals for which the number of images is smaller than the given threshold `NUM_IMAGES_THRESHOLD`. Also, all pictures annotated with 'new whale' label are excluded from the dataset. Last, there is no submission file produced by this kernel.

Anyway, please consider upvoting this kernel if you liked it!

PS. The Table of Contents was generated using ToC2 extension for Jupyter Notebook.

Articles about transfer learning:
 * https://machinelearningmastery.com/how-to-use-transfer-learning-when-developing-convolutional-neural-network-models/
 * https://keras.io/guides/transfer_learning/

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Data-preprocessing" data-toc-modified-id="Data-preprocessing-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Data preprocessing</a></span><ul class="toc-item"><li><span><a href="#Read-in-the-training-set" data-toc-modified-id="Read-in-the-training-set-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Read in the training set</a></span></li><li><span><a href="#Plot-a-few-images-along-with-the-corresponding-whale-IDs" data-toc-modified-id="Plot-a-few-images-along-with-the-corresponding-whale-IDs-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Plot a few images along with the corresponding whale IDs</a></span></li><li><span><a href="#Check-how-many-images-there-are-for-all-whale-IDs" data-toc-modified-id="Check-how-many-images-there-are-for-all-whale-IDs-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Check how many images there are for all whale IDs</a></span></li><li><span><a href="#Filter-whale-IDs" data-toc-modified-id="Filter-whale-IDs-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Filter whale IDs</a></span></li><li><span><a href="#Split-the-filtered-data-into-the-training-and-test-(validation)-set" data-toc-modified-id="Split-the-filtered-data-into-the-training-and-test-(validation)-set-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Split the filtered data into the training and test (validation) set</a></span></li></ul></li><li><span><a href="#Data-augmentation" data-toc-modified-id="Data-augmentation-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Data augmentation</a></span><ul class="toc-item"><li><span><a href="#Define-a-Keras-data-generator-for-the-data-augmentation-and-apply-this-generator-to-the-splitted-data" data-toc-modified-id="Define-a-Keras-data-generator-for-the-data-augmentation-and-apply-this-generator-to-the-splitted-data-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Define a Keras data generator for the data augmentation and apply this generator to the splitted data</a></span></li><li><span><a href="#Plot-some-augmented-data" data-toc-modified-id="Plot-some-augmented-data-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Plot some augmented data</a></span></li></ul></li><li><span><a href="#Model-training" data-toc-modified-id="Model-training-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Model training</a></span><ul class="toc-item"><li><span><a href="#Define-a-model" data-toc-modified-id="Define-a-model-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Define a model</a></span></li><li><span><a href="#Train-only-the-new-top-layers-of-the-model" data-toc-modified-id="Train-only-the-new-top-layers-of-the-model-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Train only the new top layers of the model</a></span></li><li><span><a href="#Apply-fine-tuning,-that-is,-unfreeze-the-base-model-and-train-the-whole-model-with-a-small-learning-rate" data-toc-modified-id="Apply-fine-tuning,-that-is,-unfreeze-the-base-model-and-train-the-whole-model-with-a-small-learning-rate-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Apply fine-tuning, that is, unfreeze the base model and train the whole model with a small learning rate</a></span></li><li><span><a href="#Plot-the-training-history" data-toc-modified-id="Plot-the-training-history-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>Plot the training history</a></span></li></ul></li></ul></div>

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np  # linear algebra
import pandas as pd  # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    print(dirname)
#     for filename in filenames:
#         print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Set parameters for plotting.
# plt.rc('figure', figsize=(8, 6))
sns.set(font_scale=1)

# Data preprocessing

## Read in the training set

In [None]:
df = pd.read_csv('../input/whale-categorization-playground/train.csv')
df

In [None]:
# Print the number of missing entries.
df.isna().sum()

## Plot a few images along with the corresponding whale IDs

In [None]:
IMAGES_DIR = '../input/whale-categorization-playground/train/train/'
NUM_IMAGES_TO_PLOT = 3
image_filenames = os.listdir(IMAGES_DIR)

for i in range(NUM_IMAGES_TO_PLOT):
    image_filename = image_filenames[i]
    image_path = os.path.join(IMAGES_DIR, image_filename)
    image_np = plt.imread(image_path)
    whale_id = df.query(f"Image == '{image_filename}'").Id.item()
    plt.subplots(figsize=(8, 6))
    plt.imshow(image_np)
    plt.title(whale_id)

## Check how many images there are for all whale IDs

In [None]:
df2 = df.groupby('Id').agg('count').rename({'Image': 'NumImages'}, axis=1)
df2.sort_values('NumImages', ascending=False, inplace=True)
df2

Plot s histogram describing number of cases for varios numbers of whale images (that is, how frequently a certain number of whale images we have for a single whale ID).

In [None]:
plt.figure(figsize=(6, 4))
df2.NumImages.hist(bins=list(range(0, 11, 2)))
plt.title('Number of cases for various numbers of whale images ')
plt.xlabel('Number of images')
plt.ylabel('Number of cases')
plt.show()

In [None]:
plt.figure(figsize=(6, 4))
df2.NumImages.hist(bins=list(range(10, 71, 10)))
plt.title('Number of cases for various numbers of whale images ')
plt.xlabel('Number of images')
plt.ylabel('Number of cases')
plt.show()

## Filter whale IDs

Remove the `new_whale` entry from consideration.

In [None]:
df2.drop('new_whale', inplace=True)
df2

Filter (and leave) only those whale IDs, for which the number of corresponding images is greater than `NUM_IMAGES_THRESHOLD`.

In [None]:
NUM_IMAGES_THRESHOLD = 20

df3 = df2.query(f'NumImages > {NUM_IMAGES_THRESHOLD}')

print('shape:', df3.shape)
print('total number of images:', df3.NumImages.sum())
df3

Obtain a dataframe containing entries only for the filtered whale IDs.

In [None]:
ids_to_leave = list(df3.index)

filtered_df = df.query(f'Id in {ids_to_leave}')
filtered_df

Get rid of the image with the ship.

In [None]:
filtered_df.drop(filtered_df.query("Image == '496b52ff.jpg'").index,
                 axis=0, inplace=True)

filtered_df

## Split the filtered data into the training and test (validation) set

For splitting the data, we could use the `train_test_split` function from `sklearn`. Or, we could specify the `validation_split` parameter for an `ImageDataGenerator` instance and use it to create two augmented image generators: one for the training set, and the other one for the test (validation) set. However, these two approaches do not guarantee a balanced split of the data, i.e. it could happen that not all the whale IDs present in the training set are present in the test set as well. To avoid this problem, we can obtain a stratified split of the data by using a `StratifiedShuffleSplit` object from `sklearn` module.

Note that this splitter can throw exceptions in the following cases:
* if the `test_size` is too small to include all the whale IDs into the test set (thus, it is not possible to accomplish a truly stratified splitting of the data)
* if it is not possible to include some whale IDs into the test set because there is only one image for such whale IDs

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit

X, y = filtered_df.Image, filtered_df.Id
splitter = StratifiedShuffleSplit(n_splits=1, test_size=0.25, random_state=0)
train_indices, test_indices = list(splitter.split(X, y))[0]

print('shapes:', train_indices.shape, test_indices.shape)

Check if there is an equal number of unique whale ids in the training and test set.

In [None]:
train_df = filtered_df.iloc[train_indices]
test_df = filtered_df.iloc[test_indices]

train_df.Id.nunique() == test_df.Id.nunique()

For each whale id, print the percentage (i.e., proportion) of the corresponding entries in the test dataframe with respect to the training dataframe.

In [None]:
train_entries_counts = train_df.Id.value_counts().sort_index()
test_entries_counts = test_df.Id.value_counts().sort_index()

print('percentage of entries:')
print()
print(test_entries_counts / train_entries_counts)

# Data augmentation

## Define a Keras data generator for the data augmentation and apply this generator to the splitted data

Check the list of all possible image transformations in Keras here: https://keras.io/api/preprocessing/image/#imagedatagenerator-class

Also, you can find description of all the parameters of the `flow_from_dataframe` method under the following link: https://keras.io/api/preprocessing/image/#flow_from_dataframe-method

In [None]:
from tensorflow import keras
from keras.preprocessing.image import ImageDataGenerator

TRAIN_DIR = '/kaggle/input/whale-categorization-playground/train/train'

# Specify image transformations for the data augmentation here. 
datagen = ImageDataGenerator(
    rescale=1./255,
    shear_range=0.2,
    zoom_range=0.2,
    width_shift_range=0.2,
    height_shift_range=0.2,
    rotation_range=30,
    brightness_range=(0.5, 1.5),
    fill_mode='nearest',
    horizontal_flip=True
)

BATCH_SIZE = 32
TARGET_SIZE = (224, 224)

train_set_generator = datagen.flow_from_dataframe(
    dataframe=train_df,
    directory=TRAIN_DIR,
    x_col='Image',
    y_col='Id',
    target_size=TARGET_SIZE,
    batch_size=BATCH_SIZE,
    class_mode='categorical'  # ensures one-hot encoding of class labels
)

test_set_generator = datagen.flow_from_dataframe(
    dataframe=test_df,
    directory=TRAIN_DIR,
    x_col='Image',
    y_col='Id',
    target_size=TARGET_SIZE,
    batch_size=BATCH_SIZE,
    class_mode='categorical'  # ensures one-hot encoding of class labels
)

## Plot some augmented data

In [None]:
NUM_IMAGES_TO_PLOT = 5  # must be less than the BATCH_SIZE

# Get one batch of the training data.
for X, y in train_set_generator:
    break

# Get a list of unique whale IDs.
unique_whale_ids = train_df.Id.unique()

images_subbatch = X[:NUM_IMAGES_TO_PLOT]
one_hot_class_labels_subbatch = y[:NUM_IMAGES_TO_PLOT]

for image, one_hot_class_labels in zip(images_subbatch,
                                       one_hot_class_labels_subbatch):
    plt.subplots()
    plt.imshow(image)
    whale_id = unique_whale_ids[np.argmax(one_hot_class_labels)]
    plt.title(whale_id)
    plt.show()

# Model training

## Define a model

In [None]:
from keras.applications.vgg16 import VGG16

# Form the correct input shape for the model in case the `TARGET_SIZE` 
# is not square (e.g. (224, 224)).
INPUT_SHAPE = (TARGET_SIZE[0], TARGET_SIZE[1], 3)

base_model = VGG16(
    weights='imagenet',  # load weights pretrained on the ImageNet
    include_top=False,  # do not include the ImageNet classifier at the top
    input_shape=INPUT_SHAPE,
    pooling='max'  # add a global max pooling layer after the base model
)

base_model.summary()

In [None]:
from keras.layers import Dropout, Dense

# Freeze the base model so that only the new top layers are trained.
base_model.trainable = False

num_classes = len(unique_whale_ids)

model = keras.Sequential([
    base_model,
    Dropout(0.2),
    Dense(128, activation='relu'),
    # Dense(128, activation='relu'),
    Dropout(0.2),
    Dense(num_classes, name='predictions')
])

model.summary()

## Train only the new top layers of the model

In [None]:
from keras.losses import CategoricalCrossentropy

EPOCHS = 30

# As the output of the model is real-numbered, set the `from_logits` 
# parameter of the crossentropy loss to True.
model.compile(
    optimizer='adam',
    loss=CategoricalCrossentropy(from_logits=True),  
    metrics=['accuracy']
)

history = model.fit(train_set_generator, 
                    epochs=EPOCHS,
                    validation_data=test_set_generator,
                    verbose=2   # don't display the progress bar
) 

## Apply fine-tuning, that is, unfreeze the base model and train the whole model with a small learning rate 

Note that if the base model contains the batch normalization layers, they will still remain frozen (check the number of non-trainable params in the summary below) so that their learned values (i.e. the mean and stddev) are not "destroyed" by the backpropagation process. You can learn more about various nuances related to transfer learning here: https://keras.io/guides/transfer_learning/

In [None]:
base_model.trainable = True

model.summary()

In [None]:
from keras.optimizers import Adam

FINE_TUNING_EPOCHS = 30

model.compile(optimizer=Adam(1e-5),  # set the learning rate to a low value
              loss=CategoricalCrossentropy(from_logits=True),
              metrics=['accuracy']
)

fine_tuning_history = model.fit(train_set_generator, 
                                epochs=FINE_TUNING_EPOCHS,
                                validation_data=train_set_generator,
                                verbose=2  # don't display the progress bar
)

## Plot the training history

In [None]:
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']

fine_tuning_acc = fine_tuning_history.history['accuracy']
fine_tuning_val_acc = fine_tuning_history.history['val_accuracy']

top_layers_training_epochs = list(range(1, EPOCHS + 1))
fine_tuning_epochs = list(range(EPOCHS + 1, 
                                EPOCHS + FINE_TUNING_EPOCHS + 1))

ax = plt.figure(figsize=(10, 6))

plt.plot(top_layers_training_epochs, acc, label='acc', color='orange')
plt.plot(top_layers_training_epochs, val_acc, label='val_acc', 
         color='cornflowerblue')

plt.plot(fine_tuning_epochs, fine_tuning_acc, color='orange')
plt.plot(fine_tuning_epochs, fine_tuning_val_acc, color='cornflowerblue')
plt.plot([EPOCHS, EPOCHS + 1], [acc[-1], fine_tuning_acc[0]], 
         color='orange')

plt.plot([EPOCHS, EPOCHS + 1], [val_acc[-1], fine_tuning_val_acc[0]], 
         color='cornflowerblue')

plt.vlines(EPOCHS, ymin=0, ymax=1, linestyles='dashed',
           label='fine-tuning started')

plt.legend(loc='best')
plt.title('Accuracy during training')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.show()

In [None]:
loss = history.history['loss']
val_loss = history.history['val_loss']

fine_tuning_loss = fine_tuning_history.history['loss']
fine_tuning_val_loss = fine_tuning_history.history['val_loss']

top_layers_training_epochs = list(range(1, EPOCHS + 1))

ax = plt.figure(figsize=(10, 6))

plt.plot(top_layers_training_epochs, loss, label='loss', color='orange')
plt.plot(top_layers_training_epochs, val_loss, label='val_loss', 
         color='cornflowerblue')

plt.plot(fine_tuning_epochs, fine_tuning_loss, color='orange')
plt.plot(fine_tuning_epochs, fine_tuning_val_loss, color='cornflowerblue')
plt.plot([EPOCHS, EPOCHS + 1], [loss[-1], fine_tuning_loss[0]], 
         color='orange')

plt.plot([EPOCHS, EPOCHS + 1], [val_loss[-1], fine_tuning_val_loss[0]], 
         color='cornflowerblue')

max_loss = max(max(loss), max(val_loss), 
               max(fine_tuning_loss), max(fine_tuning_val_loss))

plt.vlines(EPOCHS, ymin=0, ymax=max_loss, linestyles='dashed', 
           label='fine-tuning started')

plt.legend(loc='best')
plt.title('Loss during training')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.show()