###*Digit Recognizer*

In this Notebook we will try to solve the famous Digit Recognition Problem. Its based on the MNIST hand written digits dataset. The MNIST database of handwritten digits, available from this page, has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image.

It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting.  
Lets start!!

We start by reading the problem description on Kaggle.

###Competition Description

MNIST ("Modified National Institute of Standards and Technology") is the de facto “hello world” dataset of computer vision. Since its release in 1999, this classic dataset of handwritten images has served as the basis for benchmarking classification algorithms. As new machine learning techniques emerge, MNIST remains a reliable resource for researchers and learners alike.
In this competition, your goal is to correctly identify digits from a dataset of tens of thousands of handwritten images. We’ve curated a set of tutorial-style kernels which cover everything from regression to neural networks. We encourage you to experiment with different algorithms to learn first-hand what works well and how techniques compare.

###Practice Skills

    1. Computer vision fundamentals including simple neural networks
    2. Classification methods such as SVM and K-nearest neighbors

From here we get the idea of what our classifiers may possibly be. We will look at K-neighbors and SVM then compare them to ANNs and later on we will develop a Convolutional neural network.

Make sure you can Import all required libraries before we begin. 

In [1]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
from sklearn.metrics import accuracy_score
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score,KFold
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from keras.models import Sequential
from keras.layers import Dense,Dropout,Activation,Flatten
from keras.layers import Convolution2D, MaxPooling2D
from keras.utils import np_utils

from keras import regularizers

Using TensorFlow backend.


**Load Train and Test data**
============================

In [2]:
data = pd.read_csv("../input/train.csv").astype('float_')
test = pd.read_csv("../input/test.csv").astype('float_')
test = test.values
print(data.keys())
print(data.shape)

Index(['label', 'pixel0', 'pixel1', 'pixel2', 'pixel3', 'pixel4', 'pixel5',
       'pixel6', 'pixel7', 'pixel8',
       ...
       'pixel774', 'pixel775', 'pixel776', 'pixel777', 'pixel778', 'pixel779',
       'pixel780', 'pixel781', 'pixel782', 'pixel783'],
      dtype='object', length=785)
(42000, 785)


Seperating Labels and features from Train data and splitting train data for validation

In [3]:
labels = data['label']
data.drop('label',axis=1,inplace=True)
features=np.array(data)
features_train, features_test, labels_train, labels_test = train_test_split(features, labels, test_size=0.25,
                                                                               random_state=42)

**Preprocessing the digit images**
==================================

**Feature Scaling**
-------------------------------------

Image data consists of pixels which do not have uniformity of distribution around origin we use feature scaling to normalize the data to for better estimation 
It is used to centre the data around zero mean and unit variance.

In [4]:
scale = np.max(features_train)
features_train /= scale
features_test /= scale
test/=scale
mean = np.std(features_train)
features_train -= mean
features_test -= mean
test-=mean

Now we can use a support vector machine classifier to predict the classes.
But it would be better to reduce some features by using principal component analysis

In [5]:

pca = PCA(n_components=2).fit(features_train)
reduced_features_train = pca.transform(features_train)
reduced_features_test = pca.transform(features_test)
reduced_test = pca.transform(test)

Now we can fit an SVM on these reduced features to validate we will use KFold corss validation

In [6]:
clf=SVC()
clf.fit(reduced_features_train,labels_train)
#Validating by KFold Cross Validation
print(cross_val_score(clf,features,labels,cv=4))

A Good result but we can do more.


1. We could try to use GridsearchCV to Tune the Hyper parameters for better performance
2. We could use SelectKBest to find the most informative features and train our classifier on them this helps our classifier to not cater to noise.
3. Use another Classifier 

In [None]:
pipeline = Pipeline(steps=[("clf",SVC())])
#using grid earchCV on SVM
param_grid = [
  {'C': [1, 10, 100, 1000], 'kernel': ['linear']},
  {'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
 ]
#rbf kernel is default so we as well try with linear too
clf = GridSearchCV(pipeline, param_grid)
clf.fit(reduced_features_train,labels_train)
print(clf.best_score_)
print(clf.best_estimator_)

This does not significantly improve performance Lets try to use the best features from all by using k bests scores

In [None]:
k_best.fit(features_train, labels_train)
scores = k_best.scores_
i=0

nan_scores = [x for x in scores if x>0]
nan_scores = np.array(nan_scores)
print(min(nan_scores))

features_list=[]
for i in range(len(scores)):
    if scores[i]>min(nan_scores):
        features_list.append(i)
print(features_list)

Using those features you can make a prediction but we found that it does not improve score. Next we will use a simple MLP to model the system. First we have to one-hot-encode the labels 

*One Hot encoding of labels.*
-----------------------------

A one-hot vector is a vector which is 0 in most dimensions, and 1 in a single dimension. In this case, the nth digit will be represented as a vector which is 1 in the nth dimension. 

For example, 3 would be [0,0,0,1,0,0,0,0,0,0].

In [None]:

labels_train = np_utils.to_categorical(labels_train)
labels_test = np_utils.to_categorical(labels_test)

**Designing Neural Network Architecture**
=========================================

In [None]:
# fix random seed for reproducibility
seed = 43
np.random.seed(seed)
#Set input dimension
input_dim = features_train.shape[1]

*Sequential Model*
--------------

Lets create a simple model from Keras Sequential layer.

1.  In 1st layer of the model we have to define input dimensions of our data in (rows,columns,color channel) format.
 (In theano color channel comes first) but in tesorflow color channel comes at last so make sure you use corect format according to your keras backend

2. Flatten will transform input into 1D array.

3. Dense is fully connected layer that means all neurons in previous layers will be connected to all neurons in fully connected layer.
 In the last layer we have to specify output dimensions/classes of the model.
 Here it's 10, since we have to output 10 different digit labels.

4. Regularization can be used to improve performance its important however to choose a suitable hyperparameter for Norm loss

In [None]:
model = Sequential()
model.add(Dense(512, input_dim=input_dim,kernel_regularizer=regularizers.l2(0.0001)))
model.add(Activation('relu'))
model.add(Dropout(0.25))
model.add(Dense(128,kernel_regularizer=regularizers.l2(0.0001)))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(nb_classes))
model.add(Activation('softmax'))

***Compile network***
-------------------

Before making network ready for training we have to make sure to add below things:

 1.  A loss function: to measure how good the network is
    
 2.  An optimizer: to update network as it sees more data and reduce loss
    value
    
 3.  Metrics: to monitor performance of network

In [None]:
model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])

In [None]:
print("Training...")
model.fit(features_train, labels_train, validation_split=0.1, verbose=1)
print(model.evaluate(features_test,labels_test,batch_size=batch_size,verbose=0))
print("Generating test predictions...")
preds = model.predict_classes(test, verbose=0)

we do get a pretty good result with this almost 97 percent!
Thats really good


But we can still do better if we use a model specifically designed to deal with image data.
Next we will use a convolutional neural network because they are better equipped to deal with image features.

first we will reshape the features into image dimensions currently they are in the form of (len() , image _size* image_size) we will reshape into a form of (len(),image_size,image_size,depth)

In [None]:

features_train = features_train.reshape(features_train.shape[0],28,28,1)
features_test = features_test.reshape(features_test.shape[0],28,28,1)

Next we will define our model.


We will use maxpooling and dropout for regularization. Dropout proabililty is another customizable hyper parameter that can be ueful to tune

In [None]:
#define our model
model=Sequential()
#declare input layer use tensorflow backend meaning depth comes at the end
model.add(Convolution2D(32,(3,3),activation='relu',input_shape=(28,28,1)))
print(model.output_shape)

model.add(Convolution2D(32,(3,3),activation='relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Dropout(0.25))
#conver to 1D by flatten
model.add(Flatten())
model.add(Dense(128,activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(10,activation='softmax'))

In [None]:
model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])

model.fit(features_train,labels_train,validation_split=0.1)
score = model.evaluate(features_test,labels_test,verbose=0)

In [None]:
test = test.reshape(test.shape[0],28,28,1)

pred = model.predict_classes(test,verbose=0)

pred = pd.DataFrame(pred)

More to come . Please upvote if you find it useful.

You can increase number of epochs on your local machine to get better results.