# Problem
The purpose of this notebook is to visualise how important preprocessing of the data is. In this notebook we are not doing any preprocessing at all, so each of the models will recieve vectorized raw data.


The main task is to classify grayscale images of handwritten digits (28 pixels by 28 pixels), into their 10 
categories (0 to 9). The dataset we will use is the MNIST dataset, a classic dataset in the machine learning community, which has been 
around for almost as long as the field itself and has been very intensively studied. It's a set of 60,000 training images, plus 10,000 test 
images, assembled by the National Institute of Standards and Technology (the NIST in MNIST) in the 1980s. You can think of "solving" MNIST 
as the "Hello World" of deep learning -- it's what you do to verify that your algorithms are working as expected. As you become a machine 
learning practitioner, you will see MNIST come up over and over again, in scientific papers, blog posts, and so on.

In [1]:
import cv2
import numpy as np
import matplotlib.pyplot as plt

# Data preprocessing

Raw data

In [2]:
from keras.datasets import mnist

(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [3]:
all_images = [train_images[i] for i in range(0,train_images.shape[0])] + [test_images[i] for i in range(0,test_images.shape[0])]

all_labels = list(train_labels) + list(test_labels)

print("Shape of all images: {}\nLen of all labels: {}".format(len(all_images), len(all_labels)))

Shape of all images: 70000
Len of all labels: 70000


In [4]:
from sklearn.model_selection import train_test_split
from collections import Counter

train_images, test_images, train_labels, test_labels = train_test_split(all_images, all_labels, train_size=0.014, random_state=42, stratify=all_labels)

print("Training size: {}\nTest size: {}".format(len(train_images), len(test_images)))
print("Class counter\nIn train data: {}\n In test data: {}".format(Counter(train_labels), Counter(test_labels)))

Training size: 980
Test size: 69020
Class counter
In train data: Counter({1: 110, 7: 102, 3: 100, 2: 98, 0: 97, 9: 97, 6: 96, 4: 96, 8: 96, 5: 88})
 In test data: Counter({1: 7767, 7: 7191, 3: 7041, 2: 6892, 9: 6861, 0: 6806, 6: 6780, 8: 6729, 4: 6728, 5: 6225})




Vectorised raw data

In [5]:
train_raw = np.array(train_images).reshape(len(train_images), 28 * 28)
test_raw = np.array(test_images).reshape(len(test_images), 28 * 28)

# Model SVM

In [8]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, StratifiedKFold

Training models with HOG descriptors with and without deskewing images

In [12]:
from sklearn import svm

parameters = {'kernel':('linear', 'rbf'), 'C': np.linspace(start = 0.001, stop = 2, num = 100)}
svc = svm.SVC()
clf = GridSearchCV(svc, parameters, scoring='accuracy', cv=StratifiedKFold())
clf.fit(train_raw, train_labels)
params = clf.best_params_
my_svc = clf.best_estimator_
print("Chosing params: ", params)


pred_labels = my_svc.predict(test_raw)

Chosing params:  {'C': 0.001, 'kernel': 'linear'}


In [13]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
print("Accuracy for raw images: {}".\
      format(accuracy_score(test_labels, pred_labels)))

Accuracy for raw images: 0.880672268907563


In [14]:
cm_raw = confusion_matrix(test_labels, pred_labels)

print("Confusion matrix for raw images:\n{}".\
      format(cm_raw))

Confusion matrix for raw images:
[[6417    2   72   19   12  187   62   12   16    7]
 [   0 7599   39   43    6   21   12   10   16   21]
 [ 134   82 5851  158   66   25  226  103  216   31]
 [ 109   51  115 6139    4  322   46   53  132   70]
 [  26   39   83   10 5974   31   68   38    9  450]
 [ 157   99   28  463   54 5142  139   30   73   40]
 [ 120   57  182   24   68   81 6229    5   11    3]
 [  31  135  100  101  138   28    3 6409   16  230]
 [  72  268   64  328   48  484   79   55 5202  129]
 [  54   40   40  100  329   69    2  319   86 5822]]


In [15]:
cr = classification_report(test_labels, pred_labels)

print("Clasification report: \n{}".\
     format(cr))


Clasification report: 
             precision    recall  f1-score   support

          0       0.90      0.94      0.92      6806
          1       0.91      0.98      0.94      7767
          2       0.89      0.85      0.87      6892
          3       0.83      0.87      0.85      7041
          4       0.89      0.89      0.89      6728
          5       0.80      0.83      0.82      6225
          6       0.91      0.92      0.91      6780
          7       0.91      0.89      0.90      7191
          8       0.90      0.77      0.83      6729
          9       0.86      0.85      0.85      6861

avg / total       0.88      0.88      0.88     69020




*   **PRECISION** = TP / (TP+FP)
*   **RECALL** = TP + (TP+FN)
*   **F1 score** = 2*PRECISION*RECALL/(PRECISION+RECALL)
*   **ACCURACY** = SUM_OF_DIAGNONAL ELEMENTS/SUM OF ALL ELEMENTS
*   **Macro_AVG OF PRECISION** = SUM OF PRECISIONS/NUMBER OF CLASSES
*   **Weighted AVG OF PRECISION** = SUM OVER CLASSES PRECISION(CLASS)*WEIGHT*   (CLASS),
**WEIGHT** = CLASS SUPPORT/ALL ELEMENTS
*   **MICRO AVG OF PRECISION** = SUM (TP(CLASS))/SUM(TP(CLASS)+FP(CLASS))
   



# Model RandomForest

Training models with HOG descriptors with and without deskewing images

In [16]:
from sklearn.ensemble import RandomForestClassifier

parameters = {'n_estimators':[i for i in range(80, 130)] , 'max_depth': [None, 1, 2, 3, 4, 5, 6, 7, 8, 9]}

# for deskewed data
rf = RandomForestClassifier(class_weight='balanced')
clf_d = GridSearchCV(rf, parameters, scoring='accuracy', cv=StratifiedKFold())
clf_d.fit(train_raw, train_labels)
params = clf_d.best_params_
my_rf = clf_d.best_estimator_
print("Deskewed data:\nChosing params: ", params)

pred_labels = my_rf.predict(test_raw)

Deskewed data:
Chosing params:  {'max_depth': 9, 'n_estimators': 101}


In [17]:
print("Accuracy for raw images: {}".\
      format(accuracy_score(test_labels, pred_labels)))

Accuracy for raw images: 0.8924804404520429


In [18]:
cm_raw = confusion_matrix(test_labels, pred_labels)
 
print("Confusion matrix for raw images:\n{}".\
      format(cm_raw))

Confusion matrix for raw images:
[[6595    4   14   11    5   31   78   11   52    5]
 [   0 7565   47   28   10   25   23   20   35   14]
 [ 119   63 6007  140  102   24  126  156  119   36]
 [  92   81  149 6144   16  237   34   60  114  114]
 [  33   22   38    0 5923   13  103   25   49  522]
 [ 128  103   24  512   65 5047  165   47   45   89]
 [ 119   55   90    4  113   54 6311   15   19    0]
 [  22  147  149   12  110    5    1 6437   61  247]
 [  53  224  127  311   53  144   96   30 5503  188]
 [  59   35   56  108  234   45   12  197   48 6067]]


In [19]:
cr = classification_report(test_labels, pred_labels)

print("Clasification report: \n{}".\
     format(cr))

Clasification report: 
             precision    recall  f1-score   support

          0       0.91      0.97      0.94      6806
          1       0.91      0.97      0.94      7767
          2       0.90      0.87      0.88      6892
          3       0.85      0.87      0.86      7041
          4       0.89      0.88      0.89      6728
          5       0.90      0.81      0.85      6225
          6       0.91      0.93      0.92      6780
          7       0.92      0.90      0.91      7191
          8       0.91      0.82      0.86      6729
          9       0.83      0.88      0.86      6861

avg / total       0.89      0.89      0.89     69020



# Model Neural Network

In [29]:
from keras import models
from keras import layers

network_raw = models.Sequential()
network_raw.add(layers.Dense(512, activation='relu', input_shape=(784,)))
network_raw.add(layers.Dense(10, activation='softmax'))

In [30]:
network_raw.compile(optimizer='rmsprop',
                loss='categorical_crossentropy',
                metrics=['accuracy'])

Data vectorisation (HOG on deskewed images and HOG on non deskewed images)

In [32]:
from keras.utils import to_categorical

encoded_train_labels = to_categorical(train_labels)
encoded_test_labels = to_categorical(test_labels)

encoded_test_labels

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.]], dtype=float32)

Training our model

In [39]:
network_raw.fit(train_raw, encoded_train_labels, epochs=300, batch_size=128)

Epoch 1/300
Epoch 2/300
Epoch 3/300
Epoch 4/300
Epoch 5/300
Epoch 6/300
Epoch 7/300
Epoch 8/300
Epoch 9/300
Epoch 10/300
Epoch 11/300
Epoch 12/300
Epoch 13/300
Epoch 14/300
Epoch 15/300
Epoch 16/300
Epoch 17/300
Epoch 18/300
Epoch 19/300
Epoch 20/300
Epoch 21/300
Epoch 22/300
Epoch 23/300
Epoch 24/300
Epoch 25/300
Epoch 26/300
Epoch 27/300
Epoch 28/300
Epoch 29/300
Epoch 30/300
Epoch 31/300
Epoch 32/300
Epoch 33/300
Epoch 34/300
Epoch 35/300
Epoch 36/300
Epoch 37/300
Epoch 38/300
Epoch 39/300
Epoch 40/300
Epoch 41/300
Epoch 42/300
Epoch 43/300
Epoch 44/300
Epoch 45/300
Epoch 46/300
Epoch 47/300
Epoch 48/300
Epoch 49/300
Epoch 50/300
Epoch 51/300
Epoch 52/300
Epoch 53/300
Epoch 54/300
Epoch 55/300
Epoch 56/300
Epoch 57/300
Epoch 58/300
Epoch 59/300
Epoch 60/300
Epoch 61/300
Epoch 62/300
Epoch 63/300
Epoch 64/300
Epoch 65/300
Epoch 66/300
Epoch 67/300
Epoch 68/300
Epoch 69/300
Epoch 70/300
Epoch 71/300
Epoch 72/300
Epoch 73/300
Epoch 74/300
Epoch 75/300
Epoch 76/300
Epoch 77/300
Epoch 78

Epoch 154/300
Epoch 155/300
Epoch 156/300
Epoch 157/300
Epoch 158/300
Epoch 159/300
Epoch 160/300
Epoch 161/300
Epoch 162/300
Epoch 163/300
Epoch 164/300
Epoch 165/300
Epoch 166/300
Epoch 167/300
Epoch 168/300
Epoch 169/300
Epoch 170/300
Epoch 171/300
Epoch 172/300
Epoch 173/300
Epoch 174/300
Epoch 175/300
Epoch 176/300
Epoch 177/300
Epoch 178/300
Epoch 179/300
Epoch 180/300
Epoch 181/300
Epoch 182/300
Epoch 183/300
Epoch 184/300
Epoch 185/300
Epoch 186/300
Epoch 187/300
Epoch 188/300
Epoch 189/300
Epoch 190/300
Epoch 191/300
Epoch 192/300
Epoch 193/300
Epoch 194/300
Epoch 195/300
Epoch 196/300
Epoch 197/300
Epoch 198/300
Epoch 199/300
Epoch 200/300
Epoch 201/300
Epoch 202/300
Epoch 203/300
Epoch 204/300
Epoch 205/300
Epoch 206/300
Epoch 207/300
Epoch 208/300
Epoch 209/300
Epoch 210/300
Epoch 211/300
Epoch 212/300
Epoch 213/300
Epoch 214/300
Epoch 215/300
Epoch 216/300
Epoch 217/300
Epoch 218/300
Epoch 219/300
Epoch 220/300
Epoch 221/300
Epoch 222/300
Epoch 223/300
Epoch 224/300
Epoch 

<keras.callbacks.callbacks.History at 0x156ea47b860>

In [41]:
pred_probabilities = network_raw.predict(test_raw)

pred_labels = np.argmax(pred_probabilities,-1)

In [42]:
print("Accuracy score for deskewed data: {}".\
     format(accuracy_score(test_labels, pred_labels)))

Accuracy score for deskewed data: 0.10360764995653433


In [43]:
cm_raw = confusion_matrix(test_labels, pred_labels)

print("Confusion matrix for raw images:\n{}".\
      format(cm_raw))

Confusion matrix for raw images:
[[   0    0 6796    0    0   10    0    0    0    0]
 [   0    1 6584    0    0 1182    0    0    0    0]
 [   0    0 6879    0    0   13    0    0    0    0]
 [   0    0 6974    0    0   67    0    0    0    0]
 [   0    0 6368    0    0  360    0    0    0    0]
 [   0    0 5959    0    0  266    0    0    0    0]
 [   0    0 6632    0    0  148    0    0    0    0]
 [   0    0 6488    0    0  698    0    5    0    0]
 [   0    0 6668    0    0   61    0    0    0    0]
 [   0    0 5893    0    0  968    0    0    0    0]]


In [45]:
cr = classification_report(test_labels, pred_labels)

print("Clasification report for deskewed data: \n{}".\
     format(cr))

Clasification report for deskewed data: 
             precision    recall  f1-score   support

          0       0.00      0.00      0.00      6806
          1       1.00      0.00      0.00      7767
          2       0.11      1.00      0.19      6892
          3       0.00      0.00      0.00      7041
          4       0.00      0.00      0.00      6728
          5       0.07      0.04      0.05      6225
          6       0.00      0.00      0.00      6780
          7       1.00      0.00      0.00      7191
          8       0.00      0.00      0.00      6729
          9       0.00      0.00      0.00      6861

avg / total       0.23      0.10      0.02     69020



  'precision', 'predicted', average, warn_for)


# Conclusion

Only RandomForest algorithm was succesfull on raw data. SVC and neural network gave awful results. 
Conclusion is that learning standard (not deep learning) algorithms on raw data is not the best idea. Some of the methods might work not that awfully, but in general it is better to proprocess the data.