http://ankivil.com/kaggle-first-steps-with-julia-chars74k-first-place-using-convolutional-neural-networks/

https://github.com/erhwenkuo/deep-learning-with-keras-notebooks/blob/master/2.0-first-steps-with-julia.ipynb

http://florianmuellerklein.github.io/cnn_streetview/

# Introduction

In this article, I will describe how to design a Convolutional Neural Network (CNN) with Keras to score over 0.86 accuracy in the Kaggle competition First Steps With Julia. I will explain precisely how to get to this result, from data to submission. All the python code is, of course, included. This work is inspired by Florian Muellerklein’s Using deep learning to read street signs.

The goal of the Kaggle competition First Steps With Julia is to classify images of characters taken from natural images. These images come from a subset of the Chars74k data set. This competition normally serves as a tutorial on how to use the Julia language but a CNN is the tool of choice to tackle this kind of problem.

http://florianmuellerklein.github.io/cnn_streetview/

In [3]:
import os
os.listdir()

['.ipynb_checkpoints',
 'source-code-files',
 'test',
 'testResized',
 'train',
 'trainResized',
 'Untitled.ipynb']

# Data Preprocessing: Image Color

Almost all images in the train and test sets are color images. The first step in the preprocessing is to convert all images to grayscale. It simplifies the data fed to the network and makes it easier to generalize, a blue letter being equivalent to a red letter. This preprocessing should have almost no negative impact on the final accuracy because most texts have high contrast with their background.

# Data Preprocessing: Image Resizing

As the images have different shapes and size, we have to normalize them for the model. There are two main questions for this normalization: which size do we choose? and do we keep the aspect ratio?

Initially, I thought keeping the aspect ratio would be better because it would not distort the image arbitrarily. It could also lead to confusion between O and 0 (capital o and zero). However, after some tests, it seems that the results are better without keeping the aspect ratio. Maybe my filling strategy (see the code below) is not the best one.

Concerning the image size, 16×16 images allow very fast training but don’t give the best results. These small images are perfect to rapidly test ideas. Using 32×32 images makes the training quite fast and gives good accuracy. Finally, using 64×64 images makes the training quite slow and marginally improves the results compared to 32×32 images. I chose to use 32×32 images because it is the best trade-off between speed and accuracy.

In [16]:
import csv
#fn = open('trainLabels.csv', 'r')
#train_label = [dict(i) for i in csv.DictReader(fn)]
#for i in csv.reader(fn):
#    print(i)
#fn.close()
#import pandas as pd
#pd.DataFrame(train_label)

# Data Preprocessing: Label Conversion

We also have to convert the labels from characters to one-hot vectors. This is mandatory to feed the labels information to the network. This is a two-step procedure. First, we have to find a way to convert characters to consecutive integers and back. Second, we have to convert each integer to a one-hot vector.

In [48]:
def label2int(ch):
    asciiVal = ord(ch)
    if(asciiVal<=57): #0-9
        asciiVal-=48
    elif(asciiVal<=90): #A-Z
        asciiVal-=55
    else: #a-z
        asciiVal-=61
    return asciiVal
    
def int2label(i):
    if(i<=9): #0-9
        i+=48
    elif(i<=35): #A-Z
        i+=55
    else: #a-z
        i+=61
    return chr(i)

# Code for processing data

In [112]:
path = "."
os.path.exists( path + "/trainResized" )
if not os.path.exists( path + "/trainResized" ):
    os.makedirs( path + "/trainResized" )
if not os.path.exists( path + "/testResized" ):
    os.makedirs( path + "/testResized" )

In [113]:
import glob
import numpy as np
import pandas as pd
from skimage.transform import resize
from skimage.io import imread, imsave

#trainFiles = glob.glob( path + "/train/*" )
#for i, nameFile in enumerate(trainFiles):

#    image = imread( nameFile )
#    imageResized = resize( image, (20,20) )
#    newName = "/".join( nameFile.split("/")[:-1] ) + "Resized/" + nameFile.split("/")[-1]
#    print("/".join( nameFile.split("/")[:-1] ) + 'Resized/' + nameFile.split("/")[-1])
#    imsave ( newName, imageResized )
#    if i == 1:
#        print(image.shape) # (89, 71, 3)
#        print(imageResized.shape) # (20, 20, 3)

#testFiles = glob.glob( path + "/test/*" )
#for i, nameFile in enumerate(testFiles):
#    image = imread( nameFile )
#    imageResized = resize( image, (20,20) )	
#    newName = "/".join( nameFile.split("/")[:-1] ) + "Resized/" + nameFile.split("/")[-1]
#    imsave ( newName, imageResized )

In [114]:
import os
import glob
import pandas as pd
import math

import numpy as np
from scipy.misc import imread, imsave, imresize
from natsort import natsorted

# Path of data files
path = "."

# Input image dimensions
img_rows, img_cols = 32, 32

# Keep or not the initial image aspect ratio
keepRatio = False

# Create the directories if needed
if not os.path.exists( path + "/trainResized"):
    os.makedirs(path + "/trainResized")
if not os.path.exists( path + "/testResized"):
    os.makedirs(path + "/testResized")
    
    
### Images preprocessing ###

for setType in ["train", "test"]:
    # We have to make sure files are sorted according to labels, even if they don't have trailing zeros
    files = natsorted(glob.glob(path + "/"+setType+"/*"))
    
    data = np.zeros((len(files), img_rows, img_cols)) #will add the channel dimension later
    
    for i, filepath in enumerate(files):
        image = imread(filepath, True) #True: flatten to grayscale
        if keepRatio:
            # Find the largest dimension (height or width)
            maxSize = max(image.shape[0], image.shape[1])
            
            # Size of the resized image, keeping aspect ratio
            imageWidth = math.floor(img_rows*image.shape[0]/maxSize)
            imageHeigh = math.floor(img_cols*image.shape[1]/maxSize)
            
            # Compute deltas to center image (should be 0 for the largest dimension)
            dRows = (img_rows-imageWidth)//2
            dCols = (img_cols-imageHeigh)//2
                        
            imageResized = np.zeros((img_rows, img_cols))
            imageResized[dRows:dRows+imageWidth, dCols:dCols+imageHeigh] = imresize(image, (imageWidth, imageHeigh))
            
            # Fill the empty image with the median value of the border pixels
            # This value should be close to the background color
            val = np.median(np.append(imageResized[dRows,:],
                                      (imageResized[dRows+imageWidth-1,:],
                                      imageResized[:,dCols],
                                      imageResized[:,dCols+imageHeigh-1])))
                                      
            # If rows were left blank
            if(dRows>0):
                imageResized[0:dRows,:].fill(val)
                imageResized[dRows+imageWidth:,:].fill(val)
                
            # If columns were left blank
            if(dCols>0):
                imageResized[:,0:dCols].fill(val)
                imageResized[:,dCols+imageHeigh:].fill(val)
        else:
            imageResized = imresize(image, (img_rows, img_cols))
        
        # Add the resized image to the dataset
        data[i] = imageResized
        
        #Save image (mostly for visualization)
        filename = filepath.split("/")[-1]
        filenameDotSplit = filename.split(".")
        newFilename = str(int(filenameDotSplit[0])).zfill(5) + "." + filenameDotSplit[-1].lower()  #Add trailing zeros
        newName = "/".join(filepath.split("/")[:-1] ) + 'Resized' + "/" + newFilename
        imsave(newName, imageResized)
        
    # Add channel/filter dimension
    data = data[:,:,:, np.newaxis] 
    
    # Makes values floats between 0 and 1 (gives better results for neural nets)
    data = data.astype('float32')
    data /= 255
    
    # Save the data as numpy file for faster loading
    np.save(path+"/"+setType+ 'ResizedData' +".npy", data)



# Load Resized images to data for the input of network

In [154]:
# Load data from reSized images
for i_type in ['train', 'test']:
    files = natsorted(glob.glob('./' + i_type + 'Resized/*'))
    data = np.zeros((len(files), img_rows, img_cols))

    for i, i_path in enumerate(files):
        data[i] = imread(i_path, True)
    data = data[:, :, :, np.newaxis]
    data = data.astype('float32')
    data /= 255
    np.save(path+"/"+i_type+ 'ResizedData' +".npy", data)

In [155]:
### Labels preprocessing ###

# Load labels
y_train = pd.read_csv(path+"/trainLabels.csv").values[:,1] #Keep only label

# Convert labels to one-hot vectors
Y_train = np.zeros((y_train.shape[0], len(np.unique(y_train))))

for i in range(y_train.shape[0]):
    Y_train[i][label2int(y_train[i])] = 1 # One-hot

# Save preprocessed label to nupy file for faster loading
np.save(path+"/"+"labelsPreproc.npy", Y_train)

# Data Augmentation

Instead of using the training data as it is, we can apply some augmentations to artificially increase the size of the training set with “new” images. Augmentations are random transformations applied to the initial data to produce a modified version of it. These transformations can be a zoom, a rotation, etc. or a combination of all these.

https://keras.io/preprocessing/image/#imagedatagenerator


# Using ImageDataGenerator



The ImageDataGenerator constructor takes several parameters to define the augmentations we want to use. I will only go through the parameters useful for our case, see the documentation if you need other modifications to your images:

**featurewise_center , featurewise_std_normalization and zca_whitening are not used as they don’t increase the performance of the network. If you want to test these options, be sure to compute the relevant quantities with fit and apply these modifications to your test set with standardize .

**rotation_range Best results for values around 20.

**width_shift_range Best results for values around 0.15.

**height_shift_range Best results for values around 0.15.

**shear_range Best results for values around 0.4.

**zoom_range Best results for values around 0.3.

**channel_shift_range Best results for values around 0.1.

Of course, I didn’t test all the combinations, so there must be others values which increase the final accuracy. Be careful though, too much augmentation (high parameter values) will make the learning slow or even impossible.

I also added the possibility for the ImageDataGenerator to randomly invert the values, the code is below. The parameters are:

**channel_flip Best set to True.

**channel_flip_max Should be set to 1. as we normalized the data between 0 and 1.


使用 ImageDataGenerator
ImageDataGenerator構建函數需要幾個參數來定義我們想要使用的增強效果。我只會通過對我們的案例有用的參數進行設定，如果您需要對您的圖像進行其他修改，請參閱Keras文檔。

featurewise_center，featurewise_std_normalization和zca_whitening不使用，因為在本案例裡它們不會增加網絡的性能。如果你想測試這些選項，一定要合適地計算相關的數量，並將這些修改應用到你的測試集中進行標準化。

rotation_range 20左右的值效果最好。

width_shift_range 0.15左右的值效果最好。

height_shift_range 0.15左右的值效果最好。

shear_range 0.4 左右的值效果最好。

zoom_range 0.3 左右的值效果最好。

channel_shift_range 0.1左右的值效果最好。

當然，我沒有測試所有的組合，所以可能還有其他值的組合可以用來提高最終的準確度。但要小心，太多的增量（高參數值）會使學習變得緩慢甚至跑不出來。


# 模型學習 (Learning)

對於模型的訓練，我使用了分類交叉熵(cross-entropy)作為損失函數(loss function)，最後一層使用softmax的激勵函數。

# 演算法 (Algorithm)

在這個模型裡我選擇使用AdaMax和AdaDelta來作為優化器(optimizer)，而不是使用經典的隨機梯度下降（SGD）算法。 同時我發現AdaMax比AdaDelta在這個問題上會給出更好的結果。但是，對於具有眾多濾波器和大型完全連接層的複雜網絡，AdaMax在訓練循環不太收斂，甚至無法完全收斂。因此在這次的網絡訓練過程我拆成二個階段。 第一個階段，我先使用AdaDelta進行了20個循環的前期訓練為的是要比較快速的幫忙卷積網絡的模型收斂。第二個階段，則利用AdaMax來進行更多訓練循環與更細微的修正來得到更好的模型。如果將網絡的大小除以2，則不需要使用該策略。

# 訓練批次量 (Batch Size)
在保持訓練循環次數不變的同時，我試圖改變每次訓練循環的批量大小(batch size)。大的批量(batch)會使算法運行速度更快，但結果效能不佳。 這可能是因為在相同數量的數據量下，更大的批量意味著更少的模型權重的更新。無論如何，在這個範例中最好的結果是在批量(batch size) 設成 128的情況下達到的。

# 網絡層的權重初始 (Layer Initialization)

如果網絡未正確初始化，則優化算法可能無法找到最佳值。我發現使用he_normal來進行初始化會使模型的學習變得更容易。在Keras中，你只需要為每一層使用kernel_initializer='he_normal'參數。

# 學習率衰減 (Learning Rate Decay)

在訓練期間逐漸降低學習率(learning rate)通常是一個好主意。它允許算法微調參數，並接近局部最小值。 但是，我發現使用AdaMax的optimizer，在
沒有設定學習速率衰減的情況下結果更好，所以我們現在不必擔心。

# 訓練循環 (Number of Epochs)

使用128的批量大小，沒有學習速度衰減，我測試了200到500個訓練循環。即使運行到第500個訓練循環，整個網絡模型似乎也沒出現過擬合(overfitting)的情形。 我想這肯定要歸功於Dropout的設定發揮了功效。我發現500個訓練循環的結果比300個訓練循環略好。最後的模型我用了500個訓練循環，但是如果你在CPU上運行，300個訓練循環應該就足夠了。

# 交叉驗證 (Cross-Validation)

為了評估不同模型的質量和超參數的影響，我使用了蒙特卡洛交叉驗證：我隨機分配了初始數據1/4進行驗證，並將3/4進行學習。 我還使用分裂技術，確保在我們的例子中，每個類別約有1/4圖像出現在測試集中。這導致更穩定的驗證分數。

# Code

In [118]:
import numpy as np
import os
from keras.preprocessing.image import ImageDataGenerator
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation, Flatten
from keras.layers.convolutional import Convolution2D, MaxPooling2D
from keras.callbacks import ModelCheckpoint
from sklearn.model_selection import train_test_split

Using TensorFlow backend.
  return f(*args, **kwds)


In [156]:
# setting parameters for the network
batch_size = 128 # 訓練批次量 (Batch Size)
nb_classes = 62  # A-Z, a-z, 0-9共有62個類別
nb_epoch = 500   # 進行500個訓練循環

# Input image dimensions
# 要輸入到第一層網絡的圖像大小 (32像素 x 32像素)
img_height, img_width = 32, 32

In [157]:
# 相關資料的路徑
path = "."
# 載入預處理好的訓練資料與標籤
X_train_all = np.load(path+"/trainResizedData.npy")
Y_train_all = np.load(path+"/labelsPreproc.npy")
# 將資料區分為訓練資料集與驗證資料集
X_train, X_val, Y_train, Y_val = train_test_split(X_train_all, Y_train_all, test_size=0.25, stratify=np.argmax(Y_train_all, axis=1))

In [158]:
# For each image data, what dimension does it have?
print(X_train.shape)
print(Y_train.shape)

(4712, 32, 32, 1)
(4712, 62)


# 設定圖像增強(data augmentation)的設定

In [159]:
datagen = ImageDataGenerator(
    rotation_range = 20,
    width_shift_range = 0.15,
    height_shift_range = 0.15,
    shear_range = 0.4,
    zoom_range = 0.3,                    
    channel_shift_range = 0.1)

# Build CNN

In [160]:
### 卷積網絡模型架構 ###
model = Sequential()

# 25 filter, each one has size 3*3

model.add(Convolution2D(128,(3, 3), padding='same', kernel_initializer='he_normal', activation='relu', 
                        input_shape=(img_height, img_width, 1)))

model.add(Convolution2D(128,(3, 3), padding='same', kernel_initializer='he_normal', activation='relu'))

model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Convolution2D(256,(3, 3), padding='same', kernel_initializer='he_normal', activation='relu'))
model.add(Convolution2D(256,(3, 3), padding='same', kernel_initializer='he_normal', activation='relu'))

model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Convolution2D(512,(3, 3), padding='same', kernel_initializer='he_normal', activation='relu'))
model.add(Convolution2D(512,(3, 3), padding='same', kernel_initializer='he_normal', activation='relu'))
model.add(Convolution2D(512,(3, 3), padding='same', kernel_initializer='he_normal', activation='relu'))

model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Flatten())
model.add(Dense(4096, kernel_initializer='he_normal', activation='relu'))
model.add(Dropout(0.5))

model.add(Dense(4096, kernel_initializer='he_normal', activation='relu'))
model.add(Dropout(0.5))

# output; we have nb_classes. Therefore, we put this dense layer with nb_classes nodes.
model.add(Dense(nb_classes, kernel_initializer='he_normal', activation='softmax')) 

# 展現整個模型架構
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_8 (Conv2D)            (None, 32, 32, 128)       1280      
_________________________________________________________________
conv2d_9 (Conv2D)            (None, 32, 32, 128)       147584    
_________________________________________________________________
max_pooling2d_4 (MaxPooling2 (None, 16, 16, 128)       0         
_________________________________________________________________
conv2d_10 (Conv2D)           (None, 16, 16, 256)       295168    
_________________________________________________________________
conv2d_11 (Conv2D)           (None, 16, 16, 256)       590080    
_________________________________________________________________
max_pooling2d_5 (MaxPooling2 (None, 8, 8, 256)         0         
_________________________________________________________________
conv2d_12 (Conv2D)           (None, 8, 8, 512)         1180160   
__________

# Training Setting

In [None]:
# First, we use AdaDelta to train our model.
model.compile(loss='categorical_crossentropy', 
              optimizer='adadelta',  
              metrics=["accuracy"])

# We take epochs = 20.
model.fit(X_train, Y_train, batch_size=batch_size,
                    epochs=20, 
                    validation_data=(X_val, Y_val),
                    verbose=1)
# Second, we use AdaMax to train our model subsequently.
model.compile(loss='categorical_crossentropy', 
              optimizer='adamax',  
              metrics=["accuracy"])
# Here, we will save the better model with great validation during our training.
saveBestModel = ModelCheckpoint("best.kerasModelWeights", monitor='val_acc', verbose=1, save_best_only=True, save_weights_only=True)

# Moreover, in this training step, we will generate images from ImageDataGenrator to add our second training process.
history = model.fit_generator(datagen.flow(X_train, Y_train, batch_size=batch_size),
                    steps_per_epoch=len(X_train)/batch_size,
                    epochs=nb_epoch, 
                    validation_data=(X_val, Y_val),
                    callbacks=[saveBestModel],
                    verbose=1)

### 進行預測 ###

# 載入訓練過程中驗證結果最好的模型
model.load_weights("best.kerasModelWeights")

# 載入Kaggle測試資料集
X_test = np.load(path+"/testPreproc.npy")

# 預測字符的類別
Y_test_pred = model.predict_classes(X_test)

In [None]:
# 從類別的數字轉換為字符
vInt2label = np.vectorize(int2label)
Y_test_pred = vInt2label(Y_test_pred) 

# 保存預測結果到檔案系統
np.savetxt(path+"/jular_pred" + ".csv", np.c_[range(6284,len(Y_test_pred)+6284),Y_test_pred], delimiter=',', header = 'ID,Class', comments = '', fmt='%s')



In [None]:
# 透過趨勢圖來觀察訓練與驗證的走向 (特別去觀察是否有"過擬合(overfitting)"的現象)
import matplotlib.pyplot as plt

# 把每個訓練循環(epochs)的相關重要的監控指標取出來
acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']

# 取得整個訓練循環(epochs)的總次數
epochs = range(len(acc))

# 把"訓練準確率(Training acc)"與"驗證準確率(Validation acc)"的趨勢線形表現在圖表上
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()

plt.figure()

# 把"訓練損失(Training loss)"與"驗證損失(Validation loss)"的趨勢線形表現在圖表上
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()

plt.show()



從"Training與validation accuracy"的線型圖來看, 訓練到50~60循環(epochs)之後驗證的準確率就提不上去了, 但是訓練的準確率確可以一直提高。 雖然說83%的預測準確率在Kaggle的competition裡己經是前10名左右了, 但如果想要繼續提升效果的話可的的方向:
增加更多的字符圖像
字符圖像的增強的調教(可以增加如原文提及的影像頻導channel的flip,在這個文章為了簡化起見移除了這個部份的實作)