# Digit Recognizer (Kaggle), 0.991 Accuracy with Keras
<hr>
In this tutorial we are going to use __*Convolutional Neural Networks*__ to classify images from the __*MNIST*__ dataset.

- You can find the competition [here](https://www.kaggle.com/c/digit-recognizer/data)
- Blog post [here](https://thelastdev.com/2018/07/09/digit-recognizer-kaggle-0-991-accuracy-with-keras/)

In [5]:
# Load libraries
%pylab inline

import keras
from keras.models import Sequential
from keras.utils import np_utils
from keras.preprocessing.image import ImageDataGenerator
from keras.layers import Dense, Activation, Flatten, Dropout, BatchNormalization
from keras.layers import Conv2D, MaxPooling2D, MaxPool2D
from keras.datasets import cifar10
from keras import regularizers
from keras.callbacks import LearningRateScheduler, ModelCheckpoint, ReduceLROnPlateau
import numpy as np
from sklearn.model_selection import train_test_split

import csv
from tqdm import tqdm
import numpy.random

Populating the interactive namespace from numpy and matplotlib


## Open the dataset
After downloading the dataset, we are going to do the following:

1. Open the file and load the data
2. Format the data and get the labels
3. Check for NaN values
4. Split the dataset to train and validation
5. Normalize the data

In [6]:
def open_train_data(path):
    
    train = [] 
    
    with open(path, 'r') as f:
        reader = csv.reader(f)
        lines = list(reader)
        for line in tqdm(lines[1:]):
            label = line[0]
            
            image = np.array([x for x in line[1:]])
            image = image.astype('float32')
            
            # Format the data to 28x28x1 (in grey scale)
            image = np.reshape(image, (28, 28, 1))
            train.append([image, label])
    
    return np.array(train)

In [7]:
def split_train_test(train):
    
    np.random.shuffle(train)
    
    features = [x[0] for x in train]
    labels = [x[1] for x in train]
    
    # Split the dataset to train and validation
    x_train, x_test, y_train, y_test = train_test_split(features, labels, test_size=0.025, random_state=42)
    
    # One-hot Encoding
    y_train = np_utils.to_categorical(y_train, 10)
    y_test = np_utils.to_categorical(y_test, 10)
    
    return (np.array(x_train), y_train), (np.array(x_test), y_test)
    

In [9]:
# Load the data, run only once
#train = open_train_data('dataset/train.csv')
#np.save('train.npy', train)

100%|██████████| 42000/42000 [00:11<00:00, 3694.66it/s]


In [10]:
# If you have already ran the the function open_train_data then run this
train = np.load('train.npy')

In [12]:
# Check for missing values
import pandas as pd

for idx, feature in enumerate(train):
    if pd.isnull(feature).any():
        print('Found NaN value in feature %d' % idx)
        break

In [13]:
(x_train, y_train), (x_test, y_test) = split_train_test(train)
x_train = x_train / 255.0
x_test = x_test / 255.0

In [14]:
x_train.shape, y_train.shape, x_test.shape, y_test.shape

((40950, 28, 28, 1), (40950, 10), (1050, 28, 28, 1), (1050, 10))

# Create the model

In [15]:
# Create the model
model = Sequential()
model.add(Conv2D(32, (2, 2), padding='same',
                 input_shape=x_train.shape[1:]))
model.add(Activation('relu'))
model.add(Conv2D(32, (2, 2)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(1, 1)))
model.add(Dropout(0.25))

model.add(Conv2D(64, (2, 2), padding='same'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

model.add(Conv2D(128, (2, 2), padding='same'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

model.add(Conv2D(256, (2, 2), padding='same'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

model.add(Flatten())
model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(64))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(10))
model.add(Activation('softmax'))

# Compile the model

In [16]:
# Compile the model
batch_size = 64

opt_rms = keras.optimizers.RMSprop(lr=0.0001, rho=0.9, epsilon=1e-08, decay=0.0)

# opt_rms = keras.optimizers.Adam(lr=0.001, decay=1e-6)
model.compile(loss='categorical_crossentropy', 
              optimizer=opt_rms, 
              metrics=['accuracy'])

# Train the model

In [18]:
from time import time
epochs = 100

tbCallBack = keras.callbacks.TensorBoard(log_dir='./Graph/{}'.format(time()), histogram_freq=0, write_graph=True, write_images=True)
checkpoint = ModelCheckpoint('model-{epoch:03d}.h5', verbose=1, monitor='val_acc', save_best_only=True, mode='auto')
learning_rate_reduction = ReduceLROnPlateau(monitor='val_acc', 
                                            patience=3, 
                                            verbose=1, 
                                            factor=0.5, 
                                            min_lr=0.00001)

model.fit(x_train, y_train, 
          batch_size=batch_size,
          epochs=epochs,
          verbose=1,
          validation_data=(x_test, y_test), 
          callbacks=[tbCallBack, checkpoint])

Train on 40950 samples, validate on 1050 samples
Epoch 1/100

Epoch 00001: val_acc improved from -inf to 0.87333, saving model to model-001.h5
Epoch 2/100

Epoch 00002: val_acc improved from 0.87333 to 0.94667, saving model to model-002.h5
Epoch 3/100

Epoch 00003: val_acc improved from 0.94667 to 0.95714, saving model to model-003.h5
Epoch 4/100

Epoch 00004: val_acc improved from 0.95714 to 0.96571, saving model to model-004.h5
Epoch 5/100

Epoch 00005: val_acc did not improve from 0.96571
Epoch 6/100

Epoch 00006: val_acc improved from 0.96571 to 0.96667, saving model to model-006.h5
Epoch 7/100

Epoch 00007: val_acc improved from 0.96667 to 0.97333, saving model to model-007.h5
Epoch 8/100

Epoch 00008: val_acc did not improve from 0.97333
Epoch 9/100

Epoch 00009: val_acc improved from 0.97333 to 0.97429, saving model to model-009.h5
Epoch 10/100

Epoch 00010: val_acc improved from 0.97429 to 0.97905, saving model to model-010.h5
Epoch 11/100

Epoch 00011: val_acc improved from 0.


Epoch 00041: val_acc did not improve from 0.98762
Epoch 42/100

Epoch 00042: val_acc improved from 0.98762 to 0.99048, saving model to model-042.h5
Epoch 43/100

Epoch 00043: val_acc did not improve from 0.99048
Epoch 44/100

Epoch 00044: val_acc did not improve from 0.99048
Epoch 45/100

Epoch 00045: val_acc did not improve from 0.99048
Epoch 46/100

Epoch 00046: val_acc did not improve from 0.99048
Epoch 47/100

Epoch 00047: val_acc did not improve from 0.99048
Epoch 48/100

Epoch 00048: val_acc did not improve from 0.99048
Epoch 49/100

Epoch 00049: val_acc did not improve from 0.99048
Epoch 50/100

Epoch 00050: val_acc did not improve from 0.99048
Epoch 51/100

Epoch 00051: val_acc did not improve from 0.99048
Epoch 52/100

Epoch 00052: val_acc did not improve from 0.99048
Epoch 53/100

Epoch 00053: val_acc did not improve from 0.99048
Epoch 54/100

Epoch 00054: val_acc did not improve from 0.99048
Epoch 55/100

Epoch 00055: val_acc did not improve from 0.99048
Epoch 56/100

Epoch


Epoch 00084: val_acc improved from 0.99048 to 0.99238, saving model to model-084.h5
Epoch 85/100

Epoch 00085: val_acc did not improve from 0.99238
Epoch 86/100

Epoch 00086: val_acc did not improve from 0.99238
Epoch 87/100

Epoch 00087: val_acc did not improve from 0.99238
Epoch 88/100

Epoch 00088: val_acc did not improve from 0.99238
Epoch 89/100

Epoch 00089: val_acc did not improve from 0.99238
Epoch 90/100

Epoch 00090: val_acc did not improve from 0.99238
Epoch 91/100

Epoch 00091: val_acc did not improve from 0.99238
Epoch 92/100

Epoch 00092: val_acc did not improve from 0.99238
Epoch 93/100

Epoch 00093: val_acc did not improve from 0.99238
Epoch 94/100

Epoch 00094: val_acc did not improve from 0.99238
Epoch 95/100

Epoch 00095: val_acc did not improve from 0.99238
Epoch 96/100

Epoch 00096: val_acc did not improve from 0.99238
Epoch 97/100

Epoch 00097: val_acc did not improve from 0.99238
Epoch 98/100

Epoch 00098: val_acc did not improve from 0.99238
Epoch 99/100

Epoch

<keras.callbacks.History at 0x7ff64cece278>

# Make predictions

In [19]:
model.load_weights('model-084.h5')

In [20]:
# Load the test data
def open_test_data(path):
    
    test = [] 
    
    with open(path, 'r') as f:
        reader = csv.reader(f)
        lines = list(reader)
        image_number = 1
        for line in tqdm(lines[1:]):
            
            image = np.array([x for x in line])
            image = image.astype('float32')
            image = np.reshape(image, (28, 28, 1))
            test.append([image, image_number])
            image_number += 1
    
    return np.array(test)

In [21]:
# test_data = open_test_data('dataset/test.csv')

100%|██████████| 28000/28000 [00:08<00:00, 3292.77it/s]


In [22]:
# np.save('test.npy', test_data)

In [23]:
test_data = np.load('test.npy')

In [None]:
import matplotlib.pyplot as plt

with open('submission.csv', 'w') as f:
    f.write('ImageId,Label\n')
    for data in tqdm(test_data):
        arr = numpy.expand_dims(data[0], axis=0)
        number = model.predict(arr)
        
        label = argmax(number)
        f.write(str(data[1]) + ',' + str(label) + '\n')

 97%|█████████▋| 27248/28000 [01:07<00:01, 403.25it/s]