# Predicting dog breed
In the booming dog insurance industry, the very first step everyone need is to give a accurate quote. 
Breed is the first thing we need to give a quote. Can you tell the dog breed simply from a photo? Let's find out!

Looking at our dataset, we have 8351 photos from 133 classes of dogs, is that a lot of photos? Yes!
Is that enough to train a deep learning model that can determine the breed of the dog effectively？  
Sadly, No.  
Training an effective deep learning models usually take millions of photos running on multiple GPUs and take weeks.  
But let's do it anyway!!  
Let's build a model from scratch and see how well it works.

## Convolutional Neural Network
Building a convolutional neural network is easy, but build a good one that has good performance is not.

#### First let's load file names and targets from data

In [1]:
from sklearn.datasets import load_files       
from keras.utils import np_utils
import numpy as np
import time

def load_dataset(path):
    data = load_files(path)
    dog_files = np.array(data['filenames'])
    # One-hot-encoding the targets
    dog_targets = np_utils.to_categorical(np.array(data['target']), len(set(data.target)))
    return dog_files, dog_targets

t = time.time()
data_files, data_targets = load_dataset('../complete/')
num_class = data_targets.shape[1]

print("Time used:", time.time()-t)
print("%i images was detected in %s classes." %(len(data_files),num_class))

Using TensorFlow backend.


Time used: 1.220766544342041
8351 images was detected in 133 classes.


We can see we have altogether 8351 photos from 133 classes of dogs. There are on average 63 photos for each class.

#### Shuffle and Split data into train and test sets.

In [3]:
from sklearn.model_selection import train_test_split
# Split the filenames for train and test
F_train, F_test, y_train, y_test = train_test_split(data_files, data_targets, 
                                                    test_size=0.15, shuffle=True, random_state=2)

Now we have split the data into train and test groups. We use 15% percent data as test group.   
What need to be noticed is we only split the file names. We still need too load file names into images before we train the model.

#### Now, let's build our own CNN and see how it works.

In [3]:
from keras.layers import Conv2D, BatchNormalization, MaxPooling2D, Dropout, Dense, Flatten, GlobalAveragePooling2D
from keras.models import Model, Sequential

def my_cnn(IMG_SIZE):
    model = Sequential()
    
    model.add(Conv2D(32, kernel_size = (3, 3), activation='relu', input_shape=(IMG_SIZE, IMG_SIZE, 3), padding='same'))
    model.add(MaxPooling2D(pool_size=(2,2)))
    model.add(BatchNormalization())
    
    model.add(Conv2D(64, kernel_size=(3,3), activation='relu'))
    model.add(MaxPooling2D(pool_size=(2,2)))    
    model.add(BatchNormalization())
    
    model.add(Conv2D(64, kernel_size=(3,3), activation='relu'))
    model.add(MaxPooling2D(pool_size=(2,2)))
    model.add(BatchNormalization())
    
    model.add(Conv2D(96, kernel_size=(3,3), activation='relu'))
    model.add(MaxPooling2D(pool_size=(2,2)))
    model.add(BatchNormalization())
    
    model.add(Conv2D(32, kernel_size=(3,3), activation='relu'))
    model.add(MaxPooling2D(pool_size=(2,2)))
    model.add(BatchNormalization())
    model.add(Dropout(0.3))
    
    model.add(GlobalAveragePooling2D())
    model.add(Dense(256, activation='relu'))
    # model.add(Dropout(0.5))
    
    model.add(Dense(num_class, activation = 'softmax'))
    return model

my_model = my_cnn(224)

I start with a relatively simple structure. 5 blocks of convolution with max pooling. A Global average pooling layer after them to reduce overfitting. Batch Normalization is used after every block to prevent overfitting and also improve the result. I only add a Dropout layer after the last batch normalization layer. Dropout layers don't usually go well with batch normalization, but you can use it after the deepest one.

#### Now let's see what the model looks like.

In [4]:
my_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 224, 224, 32)      896       
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 112, 112, 32)      0         
_________________________________________________________________
batch_normalization_1 (Batch (None, 112, 112, 32)      128       
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 110, 110, 64)      18496     
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 55, 55, 64)        0         
_________________________________________________________________
batch_normalization_2 (Batch (None, 55, 55, 64)        256       
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 53, 53, 64)        36928     
__________

We keep the model relatively simple. With Global Average Pooling, we can keep the number of parameters low. We can see from the model, we only have 182k parameters to tune. Since we have a small data set, this reduce the chance of overfitting.

#### Loading the data
Now we can load filenames into images.  
Since there are only around 8000 images, we can just load them into memory. As our own model doesn't require a specific image size, we can just go with anything.

In [5]:
from keras.preprocessing import image                  
from tqdm import tqdm
from PIL import ImageFile 
ImageFile.LOAD_TRUNCATED_IMAGES = True 

def path_to_tensor(img_path, img_size):
    img = image.load_img(img_path, target_size=(img_size, img_size))
    x = image.img_to_array(img)
    return np.expand_dims(x, axis=0)

def paths_to_tensor(img_paths, img_size):
    list_of_tensors = [path_to_tensor(img_path, img_size) for img_path in tqdm(img_paths)]
    return np.vstack(list_of_tensors)

X_train = paths_to_tensor(F_train, 224).astype('float32')/255
X_test = paths_to_tensor(F_test, 224).astype('float32')/255

100%|█████████████████████████████████████████████████████████████████████████████| 7098/7098 [00:48<00:00, 146.19it/s]
100%|█████████████████████████████████████████████████████████████████████████████| 1253/1253 [00:08<00:00, 147.14it/s]


#### Now we can start training the model.

In [6]:
import numpy as np
from keras.callbacks import ModelCheckpoint 
from keras.optimizers import SGD

np.random.seed(2018)
EPOCH = 12

my_model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])
checkpointer = ModelCheckpoint(filepath='../models/best_CNN.hdf5', 
                               verbose=1, save_best_only=True)

t=time.time()
my_model.fit(X_train, y_train, validation_split=0.15, epochs=EPOCH, batch_size=16, 
             callbacks=[checkpointer], verbose=2)
print('Training takes %i seconds.' % (time.time() - t))

my_model.load_weights('../models/best_CNN.hdf5')
(loss, accuracy) = my_model.evaluate(X_test, y_test, batch_size=10, verbose=2)
print("[INFO] loss={:.4f}, accuracy: {:.4f}%".format(loss,accuracy * 100))

Train on 6033 samples, validate on 1065 samples
Epoch 1/12
 - 20s - loss: 4.6420 - acc: 0.0380 - val_loss: 5.5107 - val_acc: 0.0338

Epoch 00001: val_loss improved from inf to 5.51069, saving model to ../models/best_CNN.hdf5
Epoch 2/12
 - 18s - loss: 4.1542 - acc: 0.0744 - val_loss: 4.2626 - val_acc: 0.0601

Epoch 00002: val_loss improved from 5.51069 to 4.26264, saving model to ../models/best_CNN.hdf5
Epoch 3/12
 - 17s - loss: 3.8413 - acc: 0.1062 - val_loss: 4.1141 - val_acc: 0.0667

Epoch 00003: val_loss improved from 4.26264 to 4.11414, saving model to ../models/best_CNN.hdf5
Epoch 4/12
 - 18s - loss: 3.5954 - acc: 0.1358 - val_loss: 4.3513 - val_acc: 0.0939

Epoch 00004: val_loss did not improve from 4.11414
Epoch 5/12
 - 18s - loss: 3.4038 - acc: 0.1672 - val_loss: 4.6115 - val_acc: 0.0629

Epoch 00005: val_loss did not improve from 4.11414
Epoch 6/12
 - 18s - loss: 3.2304 - acc: 0.1898 - val_loss: 3.8461 - val_acc: 0.1239

Epoch 00006: val_loss improved from 4.11414 to 3.84614, 

Running the model on a GPU doesn't take much time.  
The result is actually not bad. I thinks it's way better than what I could do. But definitely not enough for a real-world application.  
I'm sure we can improve it more by more tuning on the model. But it won't get to the point where it's really useful. 

## Transfer learning

When we don't have a lot of data to train our model effectively, but we still want the job done, we can take advantage of the existing models and weights others have already trained. VGG19, Inception, Xception, ResNet are all very good models and people have trained huge datasets on them.  
  
  
There are several approaches when working with pre-trained models depending on the size of data and similarity between your data and the data that trained the model.

1. New dataset is small and similar to original dataset. Since the data is small, it is not a good idea to fine-tune the ConvNet due to overfitting concerns. Since the data is similar to the original data, we expect higher-level features in the ConvNet to be relevant to this dataset as well. Hence, the best idea might be to train a linear classifier on the CNN codes.

2. New dataset is large and similar to the original dataset. Since we have more data, we can have more confidence that we won’t overfit if we were to try to fine-tune through the full network.

3. New dataset is small but very different from the original dataset. Since the data is small, it is likely best to only train a linear classifier. Since the dataset is very different, it might not be best to train the classifier form the top of the network, which contains more dataset-specific features. Instead, it might work better to train the SVM classifier from activations somewhere earlier in the network.

4. New dataset is large and very different from the original dataset. Since the dataset is very large, we may expect that we can afford to train a ConvNet from scratch. However, in practice it is very often still beneficial to initialize with weights from a pretrained model. In this case, we would have enough data and confidence to fine-tune through the entire network.

In our case, our images are similar to the "imagenet", and our data set is relatively small. The best way is to use method 1.

### Use CNN as feature extractor
When extracting features, we can use multiple trained networks as feature extractors, and then combine the features to train a classifier.  

In [44]:
import h5py
from keras.layers import Input, Lambda
from keras.applications import inception_v3
from keras.applications import xception
from keras.applications import resnet50

def feature_extractor(MODEL, img_size, pre_process=None, save_name=None):
    input_tensor = Input((img_size, img_size, 3))
    base_model = MODEL(input_tensor=input_tensor, weights='imagenet', include_top=False)
    x = base_model.output
    x = GlobalAveragePooling2D()(x)
    model = Model(base_model.input, x)
    
    X_train = paths_to_tensor(F_train, img_size).astype('float32')
    X_test = paths_to_tensor(F_test, img_size).astype('float32')
    
    if pre_process:
        X_train = pre_process(X_train)
        X_test = pre_process(X_test)
    
    Fea_train = model.predict(X_train, batch_size=32)
    Fea_test= model.predict(X_test, batch_size=32)
    with h5py.File('../models/feature_%s.h5'%save_name, 'w') as h:
        h.create_dataset("train", data=Fea_train)
        h.create_dataset("test", data=Fea_test)
    
feature_extractor(inception_v3.InceptionV3, 299, inception_v3.preprocess_input, 'inception')
feature_extractor(xception.Xception, 299, xception.preprocess_input, 'xception')
feature_extractor(resnet50.ResNet50, 244, resnet50.preprocess_input, 'resnet50')

100%|█████████████████████████████████████████████████████████████████████████████| 7098/7098 [00:48<00:00, 145.18it/s]
100%|█████████████████████████████████████████████████████████████████████████████| 1253/1253 [00:08<00:00, 148.01it/s]
100%|█████████████████████████████████████████████████████████████████████████████| 7098/7098 [00:49<00:00, 143.36it/s]
100%|█████████████████████████████████████████████████████████████████████████████| 1253/1253 [00:08<00:00, 142.93it/s]
100%|█████████████████████████████████████████████████████████████████████████████| 7098/7098 [00:48<00:00, 147.86it/s]
100%|█████████████████████████████████████████████████████████████████████████████| 1253/1253 [00:08<00:00, 150.29it/s]


We save the features into different files, so we don't need to extract them every time.

Now with the feature extracted, we can load them and build a linear classifier.

In [45]:
import os
Fea_train = []
Fea_test = []

files = ['feature_inception.h5','feature_resnet50.h5','feature_xception.h5']
filenames = [os.path.join('../models/', file) for file in files]
for filename in filenames:
    with h5py.File(filename,'r') as h:
        Fea_train.append(np.array(h['train']))
        Fea_test.append(np.array(h['test']))
Fea_train = np.concatenate(Fea_train, axis=1)
Fea_test = np.concatenate(Fea_test, axis=1)

In [61]:
def second_model():
    input_tensor = Input(Fea_train.shape[1:])
    x = Dropout(0.5)(input_tensor)
    x = Dense(num_class, activation='sigmoid')(x)
    model =Model(input_tensor, x)  
    return model

second_model = second_model()
second_model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['accuracy'])

checkpointer = ModelCheckpoint(filepath='../models/best_feature.hdf5',
                               verbose=1, save_best_only=True)
second_model.fit(Fea_train, y_train, batch_size=12, epochs=10, validation_split=0.15,callbacks=[checkpointer], verbose=2)
second_model.load_weights('../models/best_feature.hdf5')

Train on 6033 samples, validate on 1065 samples
Epoch 1/10
 - 8s - loss: 0.0223 - acc: 0.9941 - val_loss: 0.0112 - val_acc: 0.9967

Epoch 00001: val_loss improved from inf to 0.01115, saving model to ../models/best_feature.hdf5
Epoch 2/10
 - 3s - loss: 0.0086 - acc: 0.9974 - val_loss: 0.0098 - val_acc: 0.9973

Epoch 00002: val_loss improved from 0.01115 to 0.00979, saving model to ../models/best_feature.hdf5
Epoch 3/10
 - 3s - loss: 0.0069 - acc: 0.9980 - val_loss: 0.0095 - val_acc: 0.9975

Epoch 00003: val_loss improved from 0.00979 to 0.00952, saving model to ../models/best_feature.hdf5
Epoch 4/10
 - 3s - loss: 0.0060 - acc: 0.9983 - val_loss: 0.0085 - val_acc: 0.9978

Epoch 00004: val_loss improved from 0.00952 to 0.00854, saving model to ../models/best_feature.hdf5
Epoch 5/10
 - 3s - loss: 0.0054 - acc: 0.9985 - val_loss: 0.0086 - val_acc: 0.9977

Epoch 00005: val_loss did not improve from 0.00854
Epoch 6/10
 - 3s - loss: 0.0051 - acc: 0.9986 - val_loss: 0.0083 - val_acc: 0.9980

E

Making predictions on test data set.

In [62]:
y_pred = second_model.predict(Fea_test)
pred = [np.argmax(prediction) for prediction in y_pred]

In [63]:
from sklearn.metrics import accuracy_score

pred = [np.argmax(prediction) for prediction in y_pred]
true = [np.argmax(true_label) for true_label in y_test]

print('Test accuracy: %.4f%%' % (accuracy_score(true, pred) * 100))


Test accuracy: 90.6624%


To further improve the result, there are a few tricks we can use.
1. Expand the data set
We can find more data set with dog breed tags and combine them to have more data to train.
2. Data augmentation
Data augmentation is very commonly used in CNN problems, the basic idea is to modify the original data, such as rotate, shift, flip or expand. This expand the data set itself without external data.  

But for our case, I don't think it's required. As a customer, uploading a photo is not necessarily easier then choosing the breed from a drop off list. And we don't have a effective model predicting mixed breed dogs, which is a lot more difficult to identify.




Instead, we should focus more on how to make an accurate quote base on the information they provide.   
I will share my thought on that in another document "Dog insurance quote estimation.docx"