在開始之前, 先從 Kaggle 下載訓練以及測試用的資料:

```
kg download -u `user_name` -p `password` -c dogs-vs-cats-redux-kernels-edition
```

不過在下載之前, 必須先到 Kaggle 註冊帳號, 以及同意 Competition 的規則才能下載檔案:
https://www.kaggle.com/c/dogs-vs-cats-redux-kernels-edition

Todo:
1. 建立實驗樣本(Sample)與驗證(Valid)目錄
2. 將 Kaggle 的檔案放置到符合 Keras 的目錄結構下
3. 載入 VGG16 model, finetune 以及重新對 dogs & cats 作訓練
4. 預測 Kaggle 的測試資料
5. 驗證測試結果
6. 在 Kaggle 上送交結果

## 建立實驗樣本與驗證目錄

In [None]:
%pwd
%matplotlib inline

In [None]:
import os, sys
current_dir = os.getcwd()
ROOT_PATH = current_dir
DATA_HOME_PATH = current_dir + '/data/redux'
DATA_HOME_PATH

In [None]:
import numpy as np
from glob import glob
from shutil import copyfile

In [None]:
%cd $DATA_HOME_PATH
%mkdir valid
%mkdir results
%mkdir sample
%mkdir -p sample/valid
%mkdir -p sample/train
%mkdir -p sample/test/unknown
%mkdir -p sample/results

In [None]:
# 準備驗證資料

%cd $DATA_HOME_PATH/train
all_training_files = glob("*.jpg")

# 打亂檔案列表
shuffles = np.random.permutation(all_training_files)
if len(shuffles) > 0:
    # 取其中 2000 個檔案作為驗證資料用
    for i in range(0, 2000):
        os.rename(shuffles[i], DATA_HOME_PATH + '/valid/' + shuffles[i])

In [None]:
# 為解省開發上時間的耗費, 會建立一個資料量相對小的 Sample 目錄, 程式開發完之後再轉移到完整的資料上
# 準備樣本的訓練資料
%cd $DATA_HOME_PATH/train
all_training_files = glob("*.jpg")

# 打亂檔案列表
shuffles = np.random.permutation(all_training_files)
if len(shuffles) > 0:
    # 取其中 200 個作為 Sample 的訓練資料
    for i in range(0, 200):
        copyfile(shuffles[i], DATA_HOME_PATH + '/sample/train/' + shuffles[i])

In [None]:
# 準備樣本的驗證資料
%cd $DATA_HOME_PATH/valid

all_valid_files = glob("*.jpg")
shuffles = np.random.permutation(all_valid_files)
if len(shuffles) > 0:
    # 拿其中 50 個檔案作為 Sample 的驗證資料
    for i in range(0, 50):
        copyfile(shuffles[i], DATA_HOME_PATH + '/sample/valid/' + shuffles[i])

In [None]:
# 準備樣本的測試資料
%cd $DATA_HOME_PATH/test
all_test_files = glob("*.jpg")
shuffles = np.random.permutation(all_test_files)
if len(shuffles) > 0:
    for i in range(0, 100):
        copyfile(shuffles[i], DATA_HOME_PATH + "/sample/test/unknown/" + shuffles[i])

## 將 Kaggle 的檔案放置到符合 Keras 的目錄結構下

Keras 的目錄結構用「類別」名稱作來命名子目錄, 從 Kaggle 下載下來的檔案則是用檔名的區分, 例如 cats.3111.jpg, 所以在這個步驟我們要建立 cats 跟 dogs 子目錄, 並將這些圖檔搬移到相對應的子目錄裡

In [None]:
%cd $DATA_HOME_PATH/train
%mkdir cats
%mkdir dogs
%mv cat.*.jpg cats/
%mv dog.*.jpg dogs/

%cd $DATA_HOME_PATH/valid
%mkdir cats
%mkdir dogs
%mv cat.*.jpg cats/
%mv dog.*.jpg dogs/

%cd $DATA_HOME_PATH/sample/train
%mkdir cats
%mkdir dogs
%mv cat.*.jpg cats/
%mv dog.*.jpg dogs/

%cd $DATA_HOME_PATH/sample/valid
%mkdir cats
%mkdir dogs
%mv cat.*.jpg cats/
%mv dog.*.jpg dogs/

In [None]:
%cd $DATA_HOME_PATH/test
%mkdir unknown
%mv *.jpg unknown/

## 載入 VGG16 pretrained model, finetune 以及重新對 dogs & cats 作訓練

VGG16 是 Visual Geometry Group 的縮寫, 通常有分作 16 層跟 19 層 Neuon Network 的兩種版本, 它可以辨識 ImageNet 中 1500 個影像類別, 是個十分強大的 CNN 演算法, 網路上也可以下載到 pre-trained 的 model, 省下一開始找資料以及 training 上的時間, 可以直接拿來應用

這裡用的 VGG16 是直接拿 fast.ai 的實作版本, 這個實作版本與 fast.ai github 上的版本不同, 採用 Python3 跟 Keras2.0

In [None]:
%cd $ROOT_PATH

from vgg16 import Vgg16

path = DATA_HOME_PATH + '/sample/'
valid_path = path + '/valid/'
train_path = path + '/train/'
test_path = path + '/test/'
result_path = path + '/results/'

In [None]:
# 初始化 vgg 物件, 第一次初始會下載 Vgg16 pre-trained 的 weights, 下載檔案會放在 ~/.keras/models/
vgg = Vgg16()

In [None]:
batch_size = 4
epoch_num = 2

原本 Vgg16 model 可以偵測 1500 種類別, 但是在這裡我們只有兩種類別, 所以透過 Keras 的 finetune 機制將原本的 model mapping 到這兩種類別上

get_batches 會使用 Keras API - [Image Preprocessing](https://keras.io/preprocessing/image/), 從指定的目錄中批次將圖片讀出, 並對圖片作正規化, 每張圖片縮放成 244x244 大小

In [None]:
batches = vgg.get_batches(train_path, batch_size=batch_size)
vgg.finetune(batches)

接下來, 我們可以跑幾個 epoch 來 retrain model

In [None]:
train_batches = vgg.get_batches(train_path, batch_size=batch_size)
vgg.fit(batches, train_batches, batch_size, nb_epoch=epoch_num)
latest_weights_filename = 'ft0.h5' 
vgg.model.save_weights(result_path + latest_weights_filename)

## 測試

In [None]:
batches, probs = vgg.test(test_path, batch_size=batch_size)

In [None]:
batches.filenames[:5]

In [None]:
from utils import *

In [None]:
np.set_printoptions(suppress=True, precision=4)
print(probs[:5])
preds = probs[:,0]
labels = np.round(1-preds)

In [None]:
imgs = []
titles = []
idx = 0

for f in batches.filenames[:5]:
    img = Image.open(test_path + f)
    imgs.append(img)
    if labels[idx] == 0:
        titles.append("cat")
    else:
        titles.append("dog")
    idx = idx + 1
plots(imgs, titles=titles)

## 驗證結果

In [None]:
batches, probs = vgg.test(valid_path, batch_size=batch_size)

In [None]:
filenames = batches.filenames
expected_labels = batches.classes
preds = probs[:, 0]
labels = np.round(1-preds)
print(labels)
print(expected_labels)

In [None]:
def plot_idx(data_path, indexes, filenames, labels, size=5):
    imgs = []
    titles = []
    for c in permutation(indexes)[:size]:
        img = Image.open(data_path + "/" + filenames[c])
        imgs.append(img)
        titles.append(labels[c])
    plots(imgs, titles=titles)

In [None]:
# 顯示正確判斷的圖片
correct = np.where(labels == expected_labels)[0]
print(correct)

plot_idx(valid_path, correct, filenames, labels)

In [None]:
# 顯示錯誤判斷的圖片
incorrect = np.where(labels != expected_labels)[0]
print(incorrect)

plot_idx(valid_path, incorrect, filenames, labels)

In [None]:
# 顯示正確判為貓咪的圖片
correct_cats = np.where((labels == expected_labels) & (labels == 0))[0]

plot_idx(valid_path, correct_cats, filenames, labels)

In [None]:
# 顯示正確判斷為狗狗的圖片
correct_dogs = np.where((labels == expected_labels) & (labels == 1))[0]

plot_idx(valid_path, correct_dogs, filenames, labels)

In [None]:
# 顯示貓咪誤判的圖片
incorrect_cats = np.where((labels != expected_labels) & (labels == 0))[0]

plot_idx(valid_path, incorrect_cats, filenames, labels)

In [None]:
# 顯示狗狗誤判的圖片
incorrect_dogs = np.where((labels != expected_labels) & (labels == 1))[0]

plot_idx(valid_path, incorrect_dogs, filenames, labels)

In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(expected_labels, labels)

In [None]:
plot_confusion_matrix(cm, val_batches.class_indices)