# Take home test _ Thanisorn Sriudomporn

The data set is provided on https://www.kaggle.com/c/whale-categorization-playground/overview

## 1. Briefly introduction

In this test, the objective is to match the humpback whale with their ID by it tail. The Images and Ids are giving in this data set. The mainly challenge here happened when looking at the data, which will be demonstrated later. There are only few training example for each of 4000+ Ids. In this jupyter notebook, the step by step of training algorithm to reach the objective will be shown and described.

## 2. Import library

First, the basic libraries are imported, such as numpy, pandas, opencv, os. Then, one import keras to train and validate model as a main framework one will working on from now. The sklearn is used here to give the training output (Ytrain) as a integer type. To manage the data file, one choose shutil in the notebook.

In [1]:
import numpy as np
import pandas as pd
import cv2
import os

from keras.models import Model
from keras.layers import Input, Conv2D, Activation, Dense, Lambda, Flatten, Embedding, PReLU, BatchNormalization, GlobalAveragePooling2D, Dropout
from keras.optimizers import Adam
from keras.callbacks import EarlyStopping, ModelCheckpoint
from keras.utils import to_categorical
from keras.preprocessing.image import ImageDataGenerator
from keras import backend as K
from keras.utils.vis_utils import plot_model

from sklearn import preprocessing
le = preprocessing.LabelEncoder()

import shutil


Using TensorFlow backend.


## 3. Data manipulating and Pre-processing

Before working with the data set, one randomly pick image to see and get dimensions using Opencv.

In [2]:
# Load Image
img = cv2.imread('./Images/train/train/00d6b82e.jpg')

# Show Image
cv2.imshow('image', img)

# Get the dimension and print out
dimensions= img.shape
print(dimensions)

# wait for the key to continue
cv2.waitKey(0)
cv2.destroyAllWindows()

(391, 682, 3)


Then, one create variables to store the path where one want to split and store datas.

In [3]:
#Directory of Train and Test Images
raw_train_dir = './Images/train/train/'
#Directory of Train, validade, Test Image
train_dir = './Images/original/train/'
validate_dir = './Images/original/validade/'
test_dir = './Images/original/test/'

#Directory for Processed Trainm validade
p_train_dir = './Images/processed/train/'
p_validate_dir = './Images/processed/validade/'
p_test_dir = './Images/processed/test/'

The step which is worth mentioned here is to take a look on the datas to get roughly idea about the procedure one is going to use. One use Pandas to read .csv file as DataFrame type. Then, the Ids are counted and printed out to see what is going on with this data set. As one can see, there is a big difference of image number between the 'new_whale' and some of Ids that have only 1 image. In this step, one realize that the data is not enough to train the accurate model straightforward. Then, one extract the Ids and images into array. After that, the Ids array are turn into one hot encoding and the number of Id are printed out.

In [4]:
train_data_dir = './label/train.csv'

# read .csv as DataFrame
train_datas = pd.read_csv(train_data_dir)

# Count the number of image for each Ids
id_count = (train_datas.groupby('Id').count()).sort_values('Image',ascending = False)
print('Checking the number of datas')
print(id_count)

# Store Ids and Image's names
id_array_value = train_datas['Id'].values
image_array = train_datas['Image'].values

# Convert the Ids to One hot encoding
onehot = pd.get_dummies(id_array_value)
id_onehot = onehot.values

# Get the number of Ids
nclass = id_onehot.shape[1]
print('Total Number of Class : ' + str(nclass))

Checking the number of datas
           Image
Id              
new_whale    810
w_1287fbc     34
w_98baff9     27
w_7554f44     26
w_1eafe46     23
...          ...
w_7e48479      1
w_7e728d8      1
w_7e8305f      1
w_7e841fa      1
w_ffdab7a      1

[4251 rows x 1 columns]
Total Number of Class : 4251


In this step, one use the preprocesser from sklearn to create the Ids from string to integer.

In [5]:
le.fit(id_array_value)
id_array_value_int = le.transform(id_array_value)

One go through all image in train folder. In this project, one split the image in train folder into 3 parts which are training images, validation images ,and testing images. The reason one including testing image in this part because in the actual testing images in test folder are not provided with Ids. Since it is the modelling problem, one need the result to verify the model. The ratio between training, validation, and testing images are set as 70%, 15%, and 15% respectively.

In [6]:
# Get the number of total training images
list = os.listdir(raw_train_dir)
nbr_train = len(list)

# Set the ratios of train/validate/test to 75%/15%/15%
train_ratio = 0.70
val_ratio = 0.15
test_ratio = 0.15

# get the number of validation and testing images
nbr_validate = round(val_ratio*nbr_train)
nbr_test = round(test_ratio*nbr_train)

# Random indexes of validation and testing images
index_random = np.random.choice(nbr_train,nbr_validate+nbr_test)
index_validate = index_random[0:nbr_validate-1]
index_test = index_random[nbr_validate:]
index_validate_list = index_validate.tolist()
index_test_list = index_test.tolist()

# Go through all images in train folder and split some of them into validation folder and test folder using the randomed indexes.
index_temp = 1
for filename in os.listdir(raw_train_dir):
    if index_temp in index_validate_list:
        shutil.copy2(raw_train_dir+filename,validate_dir+filename)
    elif index_temp in index_test_list:
        shutil.copy2(raw_train_dir+filename,test_dir+filename)
    else:
        shutil.copy2(raw_train_dir+filename,train_dir+filename)
    index_temp = index_temp+1
    

In order to use images as training input, one make a function here to resize image into the selected size related to the model selected. After look carefully in those images in train folder, one notice that there are images with explaination text in the bottom of the image, which is not going to be beneficial to the model. One use Opencv to do the image processing in order to get rid of those text, i.e. crop it out. The roughly procedure in crop_text_out function is started with convert image into grayscale and blur it. Then, fill the top 70% of image with black and do the thresholding on filled image. Then one use Morphological Transformations to make the thresholded area a little bit easier to contour and see that if the area one detected in this mask, i.e. the white area around the text, is bigger than 15% of image's area. One crop it out.

In [7]:
def shrink_img(img,sizex,sizey):
    dsize = (sizex,sizey)
    img_shrink = cv2.resize(img,dsize)
    return img_shrink

In [8]:
def crop_text_out(img):
    # Convert image into gray scale and blur it
    img_gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    img_gray = cv2.blur(img_gray, (5,5))
    (h,w) = img_gray.shape
    
    # fill the top 70% of image with black 
    y_to_crop = int(round(0.7*h))
    img_gray = cv2.rectangle(img_gray,(0,0),(w,y_to_crop),(0,0,0),-1)
    
    # Threadhold the filled image to get white
    ret, thresh = cv2.threshold(img_gray, 200, 255, cv2.THRESH_BINARY)
    
    # Morphological Transformations
    rect_kernel = cv2.getStructuringElement(cv2.MORPH_RECT,(2,2))
    thresh = cv2.morphologyEx(thresh,cv2.MORPH_CLOSE, rect_kernel)
    
    # Find contours
    contours, hierarchy = cv2.findContours(thresh, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
    
    # crop the image if the detected contour is bigger than 15% of image
    got_one = 0
    for cnt in contours:
        if cv2.contourArea(cnt) >= 0.15 * w * h:
            xt,yt,wt,ht = cv2.boundingRect(cnt)
            got_one = 1
    if got_one:
        img_crop = img[0:yt,0:w,:]
    else:
        img_crop = img

    return img_crop

In this step, one process through training, validation, and testing images with crop and resize function. Then, the images are saved into processed folder.

In [9]:
dirs = [train_dir,validate_dir,test_dir]
target_dirs = [p_train_dir,p_validate_dir,p_test_dir]
sizex = 224 
sizey = 224 
for i in range(0,3):
    dir_temp = dirs[i]
    target_temp = target_dirs[i]
    for filename in os.listdir(dir_temp):
        img = cv2.imread(dir_temp+filename)
        img_crop = crop_text_out(img)
        img_shrink = shrink_img(img_crop,sizex,sizey)
        cv2.imwrite(target_temp+filename,img_shrink)


In this stack_from_dir function, the directory of images in the argument or function. One read the image in the directory and stack it up in one array called Xstack. For Ystack, one use the filename, i.e. image name, to look into the image array to get the index and use that index to get the one hot encoding. The same index is also use to get the Id in the integer form from Id_attay_value_int.

In [10]:
def stack_from_dir(dir):
    Xstack = []
    Ystack = []
    Ystack_value = []
    for filename in os.listdir(dir):
        img = cv2.imread(dir+filename)
        
        # stack the image to Xstack
        Xstack.append(img)
        
        # get one hot encoding
        id_temp = id_onehot[image_array == filename]
        
        # get Id in integer format
        id_value = id_array_value_int[image_array == filename]
        
        # stack those y to Ystack and Ystack_value
        Ystack.append(id_temp.T)
        Ystack_value.append(id_value.T)
    
    Xstack = np.asarray(Xstack)
    Ystack = np.asarray(Ystack)
    Ystack = Ystack[:,:,0]
    Ystack_value = np.asarray(Ystack_value)
    
    return [Xstack,Ystack,Ystack_value];

In [11]:
# stack arrays from 3 directories defined in previous
[Xtrain,Ytrain, Ytrain_value] = stack_from_dir(p_train_dir)
[Xval , Yval,Yval_value] = stack_from_dir(p_validate_dir)
[Xtest , Ytest,Ytest_value] = stack_from_dir(p_test_dir)

# Normalize image vector
Xtrain = Xtrain/255.
Xval = Xval/255.
Xtest = Xtest/255.

# Get the exact number of training, validation and testing image
nTrain = Xtrain.shape[0]
nVal = Xval.shape[0]
nTest = Xtest.shape[0]



# Let print some values
print('X Training shape: ' + str(Xtrain.shape))
print('Y Training shape: ' + str(Ytrain.shape))
print('Number of train images: ' + str(nTrain))
print('X Validate shape: ' + str(Xval.shape))
print('Y Validate shape: ' + str(Yval.shape))
print('Number of validate images: ' + str(nVal))
print('X Test shape: ' + str(Xtest.shape))
print('Y Test shape: ' + str(Ytest.shape))
print('Number of test images: ' + str(nTest))

X Training shape: (7285, 224, 224, 3)
Y Training shape: (7285, 4251)
Number of train images: 7285
X Validate shape: (1367, 224, 224, 3)
Y Validate shape: (1367, 4251)
Number of validate images: 1367
X Test shape: (1198, 224, 224, 3)
Y Test shape: (1198, 4251)
Number of test images: 1198


## 4. Modelling

<p>When it come to modelling time, one have to consider the model to work on. There are difficulties in choosing model since most of them have their own advantages and disadvantages. One way to do this is to research and try many kinds of model. After test and evaluate it a little bit, one might have some idea to go on or change the model.</p>

<p>In this task, I decided to use the pre-trained model without the top to be my base model since pre-trained model come with pre-trained weight which might make my training faster. The base model's parameters can be set as trainable and non-trainable. To make it trainable, one need to train it on powerful resource because of the much more parameters to train.  </p>

<p>I decided to do the multitask learning with 2 inputs and 2 outputs here because I also want to use the center loss in this task. The first input and output is the image and Id (in one hot encoding form). Another is the Ids in integer form and some dummy array in the same size. The reason, why I selected center-loss, is that center-loss is one of the popular method for face recognition. There are another options such as Siamese Neural network or Triplet loss to try out. Anyway, I chose center-loss to perform in this task. The loss of both softmax and center will be considered. One issue here is that the center-loss layer use the output y1 as it input to compute loss. In the future, one can maybe use x instead of y2. This is the experimental work, i.e. it is to trial and error.</p>
The network flow is shown in the image here >> https://drive.google.com/open?id=1SRqULMGk2jeUuf558yBZG-bcIRcuLgQV


In [12]:
from keras.applications import ResNet50
def first_model(input_shape1,input_shape2, nclass, dropout, learning_rate = 0.001):
    
    # Use ResNet50 as a base model to start this model
    base_model = ResNet50(weights = 'imagenet', include_top=False)
    
    # One can choose to set the base model's parameters as trainable or non-trainable.
    # In this case, I set as trainable.
    
#     for layer in base_model.layers[:]:
#         layer.trainable = False
    
    # pass input 1 through base model, fully connected and dropout
    model_input_1 = Input(input_shape1)
    x = base_model(model_input_1)
    x = GlobalAveragePooling2D()(x)
    
    y1 = Dense(1024,activation='relu')(x)
    y1 = Dropout(dropout)(y1)
    y1 = Dense(1024,activation='relu')(y1)
    y1 = Dropout(dropout)(y1)
    
    y1_h = Dense(nclass, activation = 'softmax', name = 'Id')(y1)
    
    
    # Here one start from the second output (Center loss) using the y1 and the input 2
    model_input_2 = Input(input_shape2)
    centers = Embedding(nclass,1024)(model_input_2) 
    # This l2_loss is center-loss
    l2_loss = Lambda(lambda x: K.sum(K.square(x[0]-x[1][:,0]),1,keepdims=True),name='l2_loss')([y1,centers])
    
    # Define mofel
    model = Model(inputs=[model_input_1,model_input_2], outputs= [y1_h,l2_loss])
    
    
    # set the optimizer as Adam optimizer
    optimizer = Adam(learning_rate=learning_rate)
    
    # Complie model
    model.compile(loss=['categorical_crossentropy',lambda y_true,y_pred:y_pred], optimizer=optimizer,loss_weights=[1,0.4],metrics=['accuracy'])
    
    
    return model

In this step one create, i.e. compile, model with training image shape and the Y shape. The dropout is set as 0.2. 

In [13]:
model = first_model(input_shape1 = (224,224,3), input_shape2 = (1,), nclass = nclass, dropout = 0.2)
model.summary()



Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_2 (InputLayer)            (None, 224, 224, 3)  0                                            
__________________________________________________________________________________________________
resnet50 (Model)                multiple             23587712    input_2[0][0]                    
__________________________________________________________________________________________________
global_average_pooling2d_1 (Glo (None, 2048)         0           resnet50[1][0]                   
__________________________________________________________________________________________________
dense_1 (Dense)                 (None, 1024)         2098176     global_average_pooling2d_1[0][0] 
____________________________________________________________________________________________

One issue worth mentioned here is the **data augmentation**. One realize that the number of images for each Ids are relatively small. One use data augmentation to get more training image using some technique, such as rotation, flip, or shift. The ImageDataGenerator in the generator that help us generate the augmented data.

In [14]:
generator = ImageDataGenerator(
    rotation_range=20,
    width_shift_range=0.2,
    height_shift_range=0.2,
    horizontal_flip=True)

The generate_data_generator is use to be an object that generate inputs and outputs to the fitting function since one have 2 inputs and 2 outputs in the model and one of the input is better to do the augmentation before training. The batch size is set as 8 since I run it on my personal laptop and this is the maximum the memory can go for. This function is to work with model.fit_generator(...)

In [15]:
def generate_data_generator(generator, Xtrain, Ytrain):
    # data_gen as the data generator
    data_gen = generator.flow(Xtrain, Ytrain,batch_size = 8)
    while True:
        # get generated data from data_gen
        Xtrain_gen, Ytrain_gen = data_gen.next()
        # covert the one hot encoding Ytrain_gen to be a integer form
        Ytrain_int = np.where(Ytrain_gen == 1)[1]
        # just another dummy here
        Ytrain_dummy = np.random.rand(Ytrain_int.shape[0],1)

        yield [Xtrain_gen, Ytrain_int], [Ytrain_gen,Ytrain_dummy]
    
    

The model is set to early stop when the loss in not decrease to prevent model from being overfit after 5 epochs and the check point is to save the best model found in the training. Steps per epoch are set related to the size as 100. Here one use [Xval,Yval_value], [Yval,Yval_dummy] to validate data.

In [20]:
Yval_dummy = np.random.rand(Xval.shape[0],1)

# es is the esrly stop condition
es = EarlyStopping(monitor='val_Id_loss', mode='min',patience=5)

# the model check point is similar to saving model when meet the condition
mc = ModelCheckpoint('best_model.h5', monitor='val_Id_loss', mode='min', verbose=1)

# Train model
model.fit_generator(generate_data_generator(generator,Xtrain,Ytrain),steps_per_epoch = 100 ,epochs = 20,validation_data=([Xval,Yval_value], [Yval,Yval_dummy]), callbacks=[es,mc])


Epoch 1/20

Epoch 00001: saving model to best_model.h5
Epoch 2/20

Epoch 00002: saving model to best_model.h5
Epoch 3/20

Epoch 00003: saving model to best_model.h5
Epoch 4/20

Epoch 00004: saving model to best_model.h5
Epoch 5/20

Epoch 00005: saving model to best_model.h5
Epoch 6/20

Epoch 00006: saving model to best_model.h5


<keras.callbacks.callbacks.History at 0x1cad8457518>

In this step, one load best weights saved in previous training and evaluate the model using testing data set.

In [21]:
Ytest_dummy = np.random.rand(Xtest.shape[0],1)
model.load_weights('best_model.h5')
preds_1 = model.evaluate([Xtest,Ytest_value], [Ytest,Ytest_dummy], batch_size = 32)



In [22]:
print(model.metrics_names)
print(preds_1)

['loss', 'Id_loss', 'l2_loss_loss', 'Id_accuracy', 'l2_loss_accuracy']
[8.479139447411233, 8.366284370422363, 0.27185383439064026, 0.0826377272605896, 0.0]


Then, the accurracy and loss are printed out in this step.

In [23]:
print()
print("Loss = " + str(preds_1[1]))
print("Test Accuracy = " + str(preds_1[3]))


Loss = 8.366284370422363
Test Accuracy = 0.0826377272605896


## 4. Conclusion

<p>In the conclusion, there are many things to discussed here. First, I would like to start with the data. As one can see, this data set is similar to the face verification problem, which only few data provided and to match with Id. One thing one should keep in mind is that one should do the <strong>pre-processing</strong> before training with these images. In general, I would crop out the non-interesting thing, i.e. detect and crop only the object, using any technique, such as color detection, or even hand-crop it. In some application, one might convert the channel of the image to another channel (HSV, Grayscale, etc.). For this project, I found that the image is quite satisfied after crop the explaination text out. So, I didn't do any detection to select only the whale's tail. Another thing is the <strong>data augmentation</strong> can help with training set when it come to small number of image for each class.</p> 

<p>The next topic here is the model designing. After did some research, I found that there is another model which might work better for this kind of application. <strong>The Siamese Neural Networks</strong> are well-known for <strong>"One Shot Learning"</strong> which have a potential to deliver the better result than the one I trained. I consider this task as "One shot learning" since the number of images for each class is clearly considered small. </p>
    
    
<p>The last issue I would like to mentioned here is the computation resource. As you may know, I am running this on my personal laptop and it is not able to deliver such a high performance. With powerful comtutational resource, I might use the bigger batchsize, train the model longer, or use a bigger network, such as VGG19.</p>

<p>In summary, I would say that the algorithm in this notebook might not considered as a great model which is giving promising result but it is the good start to truly get familiar with this data set. In my opinion, modelling the model is the task one need to iterate the work as fast as possible, i.e. research, set up and train fast, so one can make a change or moving forward faster. This model is going to be a jump start for better version or any other model, which deliver a good result for this project.</p> 