<a href="https://colab.research.google.com/github/vrvarma/humpback-whale/blob/master/udacity_capstone.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Domain Background

After centuries of intense whaling, recovering whale populations still have a hard time adapting to warming oceans and struggle to compete every day with the industrial fishing industry for food.
To aid whale conservation efforts, scientists use photo surveillance systems to monitor ocean activity. They use the shape of whales’ tails and unique markings found in footage to identify what species of whale they’re analyzing and meticulously log whale pod dynamics and movements. For the past 40 years, most of this work has been done manually by individual scientists, leaving a huge trove of data untapped and underutilized.
There have been research done on identifying a whale using photos, which uses whale pictures which is similar to this effort, just that it uses the actual whale pictures.
I chose this dataset as it looked interesting and it allowed me to focus more on the deep-learning techniques, and my interest in image classification.  The challenge is to identify individual whales in images given the image of its tail fin. We will analyze Happywhale’s database of over 25,000 images, gathered from research institutions and public contributors.


# Download the kaggle data set.

Download the dataset from Kaggle, by following these steps to install [Kaggle API]: https://github.com/Kaggle/kaggle-api.  Once the kaggle api is installed, do the following.

* cd humpback-whale
* kaggle competitions download -c humpback-whale-identification
* mkdir -p input
* unzip -d input/train train.zip

In [0]:
# !mkdir -p ~/.kaggle
# !cp kaggle.json ~/.kaggle
# !chmod 600 ~/.kaggle
# !kaggle competitions download -c humpback-whale-identification 
# !mkdir -p input 
# !unzip -q -d input/train train.zip

# Import python packages

In [0]:
import numpy as np 
import pandas as pd 
import os
import gc
import math
import matplotlib.pyplot as plt
import matplotlib.image as mplimg
from matplotlib.pyplot import imshow

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

from sklearn.model_selection import train_test_split


from keras import layers
from keras.utils import np_utils
from keras.preprocessing import image
from keras.applications.vgg16 import VGG16
from keras.applications import ResNet50, InceptionV3, Xception,VGG19
from keras.applications.imagenet_utils import preprocess_input
from keras.layers import Input, Dense, Activation, BatchNormalization, Flatten, Conv2D
from keras.layers import AveragePooling2D, MaxPooling2D, Dropout, GlobalAveragePooling2D
from keras.models import Model
from keras.callbacks import ModelCheckpoint, EarlyStopping

import keras.backend as K
from keras.models import Sequential

import warnings
warnings.simplefilter("ignore", category=DeprecationWarning)

# Explore the training data.
In this kaggle dataset, it has train.csv, train.zip and test.zip.  But for the scope of this project, I will not be using the test.zip (as there is no way to validate the results).  I will use the train.csv and train.zip (train folder after extracting the archive to a folder) and split it into training and test datasets.  


In [2]:
train_df = pd.read_csv("train.csv")
train_df.head()

Unnamed: 0,Image,Id
0,0000e88ab.jpg,w_f48451c
1,0001f9222.jpg,w_c3d896a
2,00029d126.jpg,w_20df2c5
3,00050a15a.jpg,new_whale
4,0005c1ef8.jpg,new_whale


In [3]:
print('Number of rows in train.csv', len(train_df))

Number of rows in train.csv 25361


# Identify the data points

## train.csv
There are 25361 rows in train.csv.  Which corresponds to the image entries in train.zip
We can see that the train.csv file has two data fields.  
* Image : The whale image file name
* Id is the whale Id.
Each whale is assigned a unique Id.  The unidentified whale's are assigned an Id new_whale.  


## train.zip
There are 25361 image files in train.zip file.  It has been extracted to input/train folder.  The filename corresponds to the Image column in train.csv file.

# Data pre-processing

## Label Encoding

In [4]:
labels = train_df.Id

# Encode labels to integers using sklearning.preprocessing.LabelEncoder
# Convert the integer encoded array to category
le = LabelEncoder()
le.fit(labels)
# Number of unique labels.
num_classes = len(labels.value_counts())
print('Number of unique whales {}'.format(len(le.classes_)))


Number of unique whales 5005


## Encode the labels into categorical value

In [0]:
y_transform = np_utils.to_categorical(le.transform(labels), num_classes=num_classes)

## Split the data into training, validation & test datasets

In [6]:
X_train, X_tmp, Y_train, Y_tmp = train_test_split(train_df, y_transform, test_size=0.2, random_state=5)

X_val, X_test, Y_val, Y_test   = train_test_split(X_tmp, Y_tmp, test_size=0.5, random_state=5)

print('Training, Validation & testing data size', len(X_train),len(X_val), len(X_test))
gc.collect()

Training, Validation & testing data size 20288 2536 2537


20

In [0]:
image_height=100
image_width=100

In [0]:
def prepare_images(data):
    print("Preparing images")
    images = np.zeros((len(data),image_height , image_width, 3))
    count = 0
    
    for fig in data.Image:
        #load images into images of size 100X100X3
        img = image.load_img("input/train/"+fig, target_size=(image_height, image_width, 3))
        x = image.img_to_array(img)
        x = preprocess_input(x)
        images[count] = x
        if (count%500 == 0):
            print("Processing image: ", count+1, ", ", fig)
        count += 1
    count = 0
    print("Finished!")      
    return images

# Create a CNN to create a base line model

## Build Model

In [9]:
model = Sequential()

model.add(Conv2D(filters = 16, kernel_size = 3, padding = 'same', activation = 'relu', 
          input_shape = (image_height, image_width, 3))) #RGB image
model.add(MaxPooling2D(pool_size=3))

model.add(Conv2D(filters = 32, kernel_size = 3,  padding = 'same', activation = 'relu'))
model.add(MaxPooling2D(pool_size=3))
model.add(Dropout(0.25))

model.add(Conv2D(filters = 64, kernel_size = 3, padding = 'same', activation = 'relu'))
model.add(MaxPooling2D(pool_size=3))
model.add(GlobalAveragePooling2D())
model.add(Dropout(0.5))
model.add(Dense(num_classes, activation='softmax'))

model.summary()


W0720 08:06:36.843943 140085875619712 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W0720 08:06:36.859598 140085875619712 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0720 08:06:36.862548 140085875619712 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

W0720 08:06:36.876597 140085875619712 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:3976: The name tf.nn.max_pool is deprecated. Please use tf.nn.max_pool2d instead.

W0720 08:06:36.891286 140085875619712 deprecation_wrapp

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 100, 100, 16)      448       
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 33, 33, 16)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 33, 33, 32)        4640      
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 11, 11, 32)        0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 11, 11, 32)        0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 11, 11, 64)        18496     
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 3, 3, 64)          0         
__________

## Compile Model

In [0]:
from keras.optimizers import adam
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['categorical_accuracy'])

## Prepare Images

In [12]:
from PIL import ImageFile                            
ImageFile.LOAD_TRUNCATED_IMAGES = True 

x_train_images = prepare_images(X_train)
x_train_images /= 255

print("Shape X-train: ", x_train_images.shape)

x_val_images = prepare_images(X_val)
x_val_images /= 255

print("Shape X-val: ", x_val_images.shape)

x_test_images = prepare_images(X_test)
x_test_images /= 255

print("Shape X-test: ", x_test_images.shape)

Preparing images
Processing image:  1 ,  5e2572252.jpg
Processing image:  501 ,  b728ef1e9.jpg
Processing image:  1001 ,  942ab5de3.jpg
Processing image:  1501 ,  dd4cfa29f.jpg
Processing image:  2001 ,  614f10ee7.jpg
Processing image:  2501 ,  db9667359.jpg
Processing image:  3001 ,  86c9aa515.jpg
Processing image:  3501 ,  7f3aafbd2.jpg
Processing image:  4001 ,  6f0c3deb4.jpg
Processing image:  4501 ,  444b09aca.jpg
Processing image:  5001 ,  f532c9318.jpg
Processing image:  5501 ,  f2d3d0d0f.jpg
Processing image:  6001 ,  6ca37fe7c.jpg
Processing image:  6501 ,  3394e12db.jpg
Processing image:  7001 ,  feddb3aa9.jpg
Processing image:  7501 ,  3a8173905.jpg
Processing image:  8001 ,  16ddf58df.jpg
Processing image:  8501 ,  64b519010.jpg
Processing image:  9001 ,  c2a02f80e.jpg
Processing image:  9501 ,  770cb755e.jpg
Processing image:  10001 ,  803515118.jpg
Processing image:  10501 ,  5e8632b10.jpg
Processing image:  11001 ,  5f37d323c.jpg
Processing image:  11501 ,  204823b38.jpg

## Train the Model

In [0]:
os.makedirs('saved_models', exist_ok=True)

In [52]:
gc.collect()

471

In [15]:
checkpointer = ModelCheckpoint(filepath='saved_models/weight.best.from_scratch.hdf5',
                               verbose=1, save_best_only = True)
model.fit(x_train_images, Y_train, epochs=20, batch_size=500, verbose=1,
                   validation_data=(x_val_images, Y_val), callbacks=[checkpointer])
gc.collect()


W0720 08:10:22.183230 140085875619712 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_grad.py:1250: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


Train on 20288 samples, validate on 2536 samples
Epoch 1/20

Epoch 00001: val_loss improved from inf to 6.16271, saving model to saved_models/weight.best.from_scratch.hdf5
Epoch 2/20

Epoch 00002: val_loss improved from 6.16271 to 6.07634, saving model to saved_models/weight.best.from_scratch.hdf5
Epoch 3/20

Epoch 00003: val_loss did not improve from 6.07634
Epoch 4/20

Epoch 00004: val_loss did not improve from 6.07634
Epoch 5/20

Epoch 00005: val_loss did not improve from 6.07634
Epoch 6/20

Epoch 00006: val_loss improved from 6.07634 to 6.06392, saving model to saved_models/weight.best.from_scratch.hdf5
Epoch 7/20

Epoch 00007: val_loss did not improve from 6.06392
Epoch 8/20

Epoch 00008: val_loss did not improve from 6.06392
Epoch 9/20

Epoch 00009: val_loss improved from 6.06392 to 6.06146, saving model to saved_models/weight.best.from_scratch.hdf5
Epoch 10/20

Epoch 00010: val_loss did not improve from 6.06146
Epoch 11/20

Epoch 00011: val_loss did not improve from 6.06146
Epoc

65

In [16]:
model.load_weights('saved_models/weight.best.from_scratch.hdf5')
pred = model.predict(x_test_images, verbose=1)
print(pred.shape)

(2537, 5005)


## MAP@5 for Base CNN Model

In [33]:
def map5_per_image(label, predictions):
    try:
#         print(label,predictions)
        return 1 / (predictions[:5].index(label) + 1)
    except ValueError:
        return 0.0


def map5_per_set(labels, predictions):
    a = [map5_per_image(l, p) for l, p in zip(labels, predictions)]
    print(a)
    return np.mean([map5_per_image(l, p) for l, p in zip(labels, predictions)])

predictions=[]
for i, p in enumerate(pred):
    predictions.append(le.inverse_transform(p.argsort()[-5:][::-1]).tolist())
print(X_train.Id.__class__)
print('MAP@5 score for Base model = %.5f' %(map5_per_set(X_test.Id.tolist(), predictions )))

<class 'pandas.core.series.Series'>
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.3333333333333333, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.25, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.3333333333333333, 1.0, 0.3333333333333333, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.25, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0

# Using VGG16 and Transfer learning

To reduce training time without sacrificing accuracy, lets train a CNN using transfer learning.



In [0]:
def get_steps_size(generator, batch_size):
    nb_samples = len(generator.filenames) 
    return int(math.ceil(nb_samples / batch_size))

In [53]:
vgg_model = VGG19(include_top=False, weights='imagenet')

train_datagen = image.ImageDataGenerator(
    rescale=1./255)
batch_size=100

generator = train_datagen.flow_from_dataframe(dataframe=X_train,
                                              directory='input/train',
                                              x_col="Image",
                                              y_col="Id", 
                                              classes=le.classes_.tolist(),
                                              target_size=(image_height, image_width),
                                              batch_size=batch_size, shuffle=False, drop_duplicates=False) 


train_bc_features = vgg_model.predict_generator(generator, steps=get_steps_size(generator, batch_size),verbose=1)

test_datagen = image.ImageDataGenerator(rescale=1./255)
generator = test_datagen.flow_from_dataframe(dataframe=X_val,
                                              directory='input/train',
                                              x_col="Image",
                                              y_col="Id", 
                                              classes=le.classes_.tolist(),
                                              target_size=(image_height, image_width),
                                              batch_size=batch_size,
                                              shuffle=False,
                                              drop_duplicates=False)  
val_bc_features = vgg_model.predict_generator(generator, steps=get_steps_size(generator, batch_size),verbose=1)

generator = test_datagen.flow_from_dataframe(dataframe=X_test,
                                              directory='input/train',
                                              x_col="Image",
                                              y_col="Id", 
                                              classes=le.classes_.tolist(),
                                              target_size=(image_height, image_width),
                                              batch_size=batch_size,
                                              shuffle=False,
                                              drop_duplicates=False)  
test_bc_features = vgg_model.predict_generator(generator, steps=get_steps_size(generator, batch_size),verbose=1)

Downloading data from https://github.com/fchollet/deep-learning-models/releases/download/v0.1/vgg19_weights_tf_dim_ordering_tf_kernels_notop.h5
Found 20288 validated image filenames belonging to 5005 classes.
Found 2536 validated image filenames belonging to 5005 classes.
Found 2537 validated image filenames belonging to 5005 classes.


## Build the Model

In [55]:
print(train_bc_features.shape)
model = Sequential()

model.add(Flatten(input_shape = train_bc_features.shape[1:]))
model.add(Dense(512, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes, activation='softmax'))

model.summary()

(20288, 3, 3, 512)
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_5 (Flatten)          (None, 4608)              0         
_________________________________________________________________
dense_10 (Dense)             (None, 512)               2359808   
_________________________________________________________________
dropout_7 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_11 (Dense)             (None, 5005)              2567565   
Total params: 4,927,373
Trainable params: 4,927,373
Non-trainable params: 0
_________________________________________________________________


## Compile the Model

In [0]:
from keras.metrics import top_k_categorical_accuracy,categorical_crossentropy

def top_5_accuracy(y_true, y_pred):
    return top_k_categorical_accuracy(y_true, y_pred, k=5)

model.compile(optimizer='rmsprop',  
              loss='categorical_crossentropy', metrics=['accuracy'])

## Train the Model

In [57]:
checkpointer = ModelCheckpoint(filepath='saved_models/bottleneck_fc_model.hdf5',
                               verbose=1, save_best_only = True)

history = model.fit(train_bc_features, Y_train,  
          epochs=40,  
          batch_size=batch_size,  
          validation_data=(val_bc_features, Y_val), callbacks=[checkpointer])  
   


Train on 20288 samples, validate on 2536 samples
Epoch 1/40

Epoch 00001: val_loss improved from inf to 5.86927, saving model to saved_models/bottleneck_fc_model.hdf5
Epoch 2/40

Epoch 00002: val_loss improved from 5.86927 to 5.77023, saving model to saved_models/bottleneck_fc_model.hdf5
Epoch 3/40

Epoch 00003: val_loss did not improve from 5.77023
Epoch 4/40

Epoch 00004: val_loss did not improve from 5.77023
Epoch 5/40

Epoch 00005: val_loss did not improve from 5.77023
Epoch 6/40

Epoch 00006: val_loss did not improve from 5.77023
Epoch 7/40

Epoch 00007: val_loss did not improve from 5.77023
Epoch 8/40

Epoch 00008: val_loss did not improve from 5.77023
Epoch 9/40

Epoch 00009: val_loss did not improve from 5.77023
Epoch 10/40

Epoch 00010: val_loss did not improve from 5.77023
Epoch 11/40

Epoch 00011: val_loss did not improve from 5.77023
Epoch 12/40

Epoch 00012: val_loss did not improve from 5.77023
Epoch 13/40

Epoch 00013: val_loss did not improve from 5.77023
Epoch 14/40

E

In [58]:
(eval_loss, eval_accuracy) = model.evaluate(val_bc_features, Y_val, batch_size=batch_size, verbose=1)

print("[INFO] accuracy: {:.2f}%".format(eval_accuracy * 100))  
print("[INFO] Loss: {}".format(eval_loss))  

[INFO] accuracy: 38.25%
[INFO] Loss: 6.07927381653891


## Test the Model

In [59]:
model.load_weights('saved_models/bottleneck_fc_model.hdf5')
pred = model.predict(test_bc_features, verbose=1)

predictions=[]
for i, p in enumerate(pred):
    predictions.append(le.inverse_transform(p.argsort()[-5:][::-1]).tolist())
print('MAP@5 score for VGG16 model = %.5f' %(map5_per_set(X_test.Id, predictions )))
print(X_test.Id[1::])

[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0,