# Neural Network Modeling on the EMNIST dataset

We are trying to create a good neural network model for the EMNIST dataset which contains 28x28 sized grayscale pixel images of the digits 0-9, and the letters a-z and A-Z. I will train and test my neural network on two versions of the dataset. 

## Description of the dataset

The balanced EMNIST dataset contains 112,800 training samples and 18,800 test samples. The byclass EMNIST dataset contains 697,932 training samples and 116,323 test samples. 

The balanced dataset contains equall proportion of all classes in the training and testing data but only contains 47 clases. This is because some of the letter classes are merged together. For example, 'o' and 'O' are very hard to distinguish between.

The byclass dataset cotains 62 classes with all the possible characters in a class of their own. Since, this dataset has 6 times more data than the balanced dataset, we might end up choosing this one over the balanced dataset.

## Loading required libraries

I will use the Keras library to build neural network models.

In [1]:
import pandas as pd
import numpy as np
import keras
from keras.models import Sequential, save_model, load_model
from keras.layers import Dense
from keras.layers import Activation
from keras.layers import Dropout
from keras.layers import Conv2D
from keras.layers import MaxPooling2D
from keras.layers import Convolution2D
from keras.layers import Flatten
from keras.layers import LSTM
from keras.utils import to_categorical
from keras.utils import np_utils
from keras.optimizers import SGD
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import confusion_matrix

Using Theano backend.


## Reading the data

Reading the balanced dataset.

In [2]:
train_data = pd.read_csv("../data/emnist-balanced-train.csv", header = None)
test_data = pd.read_csv("../data/emnist-balanced-test.csv", header = None)
train_data.head()
test_data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,775,776,777,778,779,780,781,782,783,784
0,41,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,39,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,9,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,26,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,44,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [3]:
print(train_data.shape)
print(test_data.shape)

(112800, 785)
(18800, 785)


Separating the response from the predictor variables in the training and testing data.

In [4]:
train_y = train_data[0]
train_y.head()
test_y = test_data[0]
test_y.head()

0    41
1    39
2     9
3    26
4    44
Name: 0, dtype: int64

In [5]:
train_X = train_data.iloc[:, 1:]
train_X.head()
test_X = test_data.iloc[:, 1:]
test_X.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,775,776,777,778,779,780,781,782,783,784
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Training and testing a simple neural network model

Creating a very simple neural network model with one hidden layer, training, and testing it on the data to see the performance and set up a baseline to improve upon with more complex neural nets later.

In [6]:
print(train_X.shape)

# use full dataset
Xtr = train_X
Ytr = train_y
print(Xtr.shape, Ytr.shape)

(112800, 784)
(112800, 784) (112800,)


### Building the structure of the neural network

Defining the structure of the neural net: the size of the input layer, one hidden layer and the size of that layer, the method used for optimizing, the metric and the loss function.

In [7]:
# For a single-input model with 47 classes (categorical classification):
num_classes = 47 # number of classes present in the data
inp_dim = train_X.shape[1]
model_simple = Sequential()
model_simple.add(Dense(32, activation='relu', input_dim=inp_dim)) # First hidden layer
model_simple.add(Dense(num_classes, activation='softmax')) # output layer
model_simple.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

# Convert the response to one-hot encoding
one_hot_labels = to_categorical(Ytr, num_classes=num_classes)
one_hot_labels

array([[0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

### Training the model

Fit the model on the training data.

In [8]:
# Train the model, iterating on the data in batches of 32 samples
epochs = 10
batch_size = 256
model_simple.fit(Xtr.values, one_hot_labels, epochs=epochs, batch_size=batch_size, verbose = 1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7ff53a509470>

### Testing the model performance on training data

Testing the performance of the model in the training data and looking at the training accuracy of the model using a confusion matrix. To predict the class using the neural net, we can use the predict_classes function.

In [10]:
# Accuracy on the training data
pred = model_simple.predict_classes(Xtr.values)

# taking the maximum output as the predicted class label and then building the confusion matrix
cmat = confusion_matrix(pred, Ytr.values)
print(cmat)
print("accuracy on training data =", cmat.diagonal().sum()/cmat.sum())

[[   0    0    0 ...    0    0    0]
 [   1 2311   64 ...   41   72  808]
 [   0    0    0 ...    0    0    0]
 ...
 [   0    0    0 ...    0    0    0]
 [   0    1    1 ...    0    0    0]
 [   0    0    0 ...    0    0    0]]
accuracy on training data = 0.15843971631205675


The accuracy on the training data for the simple neural network model is 15.84%. We should be able to do a lot better by having a more complex neural network model and later through convolutional neural network models.

### Saving and reloading the model

Keras offers the functionality to save the trained neural network model to disk so that once trained, the same model can be used again for prediction without any need to refit. If later we get more data, we can even continue training the saved neural network model on the new data using the current values of the weights as starting points. Saving the model to disk as an HDF5 file to be able to reload it later in the web application at the time of prediction.

In [12]:
model_simple.save('../models/model_simple.h5')  # creates a HDF5 file 'model_simple.h5'

# returns a compiled model identical to the previous one
model_simple2 = load_model('../models/model_simple.h5')

## Adding more nodes in the hidden layer and including a dropout rate

Increasing the number of nodes in the hidden layer to make the model more complex and checking the performance of this new model. I have also included a drop out rate of 0.2 to avoid overfitting since I have increased the number of nodes in the hidden layer to 1024 from 32 previously. Dropout of 0.2 means that at the time of training, 20% of the nodes will be removed at random from the hidden layer for each pass of the gradient descent.

In [13]:
# For a single-input model with 47 classes (categorical classification):
num_classes = 47 # number of classes present in the data
inp_dim = train_X.shape[1]
model_dropout = Sequential()
model_dropout.add(Dense(1024, activation='relu', input_dim=inp_dim)) # First hidden layer
model_dropout.add(Dropout(0.2))
model_dropout.add(Dense(num_classes, activation='softmax')) # output layer
model_dropout.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

# Convert the response to one-hot encoding
one_hot_labels = to_categorical(Ytr, num_classes=num_classes)
one_hot_labels

array([[0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [15]:
model_dropout.fit(Xtr.values, one_hot_labels, # Train the model using the training set...
          batch_size=batch_size, nb_epoch=epochs,
          verbose=1, validation_split=0.1) # ...holding out 10% of the data for validation



Train on 101520 samples, validate on 11280 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7ff4f4451320>

In [20]:
# Accuracy on the training data
pred_dropout = model_dropout.predict_classes(Xtr.values)

# taking the maximum output as the predicted class label and then building the confusion matrix
cmat = confusion_matrix(pred_dropout, Ytr.values)
print(cmat)
print("accuracy on training data =", cmat.diagonal().sum()/cmat.sum())

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
accuracy on training data = 0.02127659574468085


As we can see, the performance of simple neural networks is very poor when it comes to the accuracy of the model on the training set. Since the training accuracy itself is very low, we can conclude that the performance on the test set would also not be very good. Therefore, to increase the accuracy and hence, the performance of the model on the train as well as test data, we move on to convolutional neural networks in the next notebook.