# MNIST Classification

**Recep Inanc, BSc**

1. Introduction  
2. Data Exploration  
    2.1 Load Data  
    2.2 Check for null values  
    2.3 Understand data
3. Data Preprocessing  
    3.1 Feature Scaling / Normalization  
    3.2 Label Encoding  
4. Build Models  
    4.1  SVM  
    4.2 KNeighbors  
    4.3 Random Forest  
    4.4 Neural Network
5. Evaluate Models  
    5.1 Cross Validation   
6. Hyperparameter Tuning  
7. Predict and Submit  
    7.1 Confusion Matrix  
    7.2 Precision, Recall and F1 Scores  
    7.3 Predict and Submit Results

# 1. Introduction

Hello everyone! I started this kernel right after I finished reading on **Classification**, and since they say "MNIST is the `hello world` of classification", I jumped into this competition to have some hands on experience on that.

This kernel consists of *7 main parts*, and 5th and 6th are a bit interchangeble. I will try to build 4 different models to classify MNIST images, SVM, KNN, Random Decision Forest and a Neural Network.

So let's get to work!

PS: I had to comment out many pieces of this notebook since I was not able to `Commit&Run` that way. I once run whole notebook on my local, so I copied the outputs for the parts that I have commented out. Feel free to remove the comments and run the code. 

# 2. Data Exploration

This is the part where I get to know the data, how is it formatted, what properties it has etc.

### 2.1 Load Data

In [2]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

%matplotlib inline 
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import os

In [6]:

test = pd.read_csv('test.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'test.csv'

In [5]:
train.info()

NameError: name 'train' is not defined

In [None]:
test.info()

Test set is missing one column, and that is the `label` column, since images are in the form of `28x28` we have `784` feature columns for each image.

### 2.2 Check for null values 

In [None]:
train.isnull().any().describe()

In [None]:
test.isnull().any().describe()

It seems like we do not have any missing values. Perfect!

### 2.3 Split the Data

As I learned; we should always put our test set aside when we are exploring dataset, to prevent our brain to mislead us. Since we are trying to create a solution that generalizes and not memorizes, it is important to modify our data by looking at only to train set and not the test set. Test set should only be used for final evaluation.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train_split = train.drop(['label'], axis=1).copy()
y_train_split = train['label'].copy()

X_train, X_validation, y_train, y_validation = train_test_split(X_train_split, y_train_split, test_size=0.1, random_state=42)

del X_train_split, y_train_split

print("Training Features:", X_train.shape)
print("Training Labels:", y_train.shape)
print("Validation Features:", X_validation.shape)
print("Validation Labels:", y_validation.shape)
print("Test Features:", test.shape)

### 2.4 Understand Data

In [None]:
X_train_explore = X_train.copy()
y_train_explore = y_train.copy()

del X_train, y_train

In [None]:
y_train_explore.value_counts().describe()

In [None]:
sns.set()
sns.countplot(x="label", data=y_train_explore.to_frame())

It looks like only 5 is little less than 4000 and the rest is almost evenly distributed.  


We can move on.

In [None]:
sample_digit = X_train_explore.iloc[2000] # a random instance
sample_digit_image = sample_digit.values.reshape(28, 28) # reshape it from (784,) to (28,28)
plt.imshow(sample_digit_image, # plot it as an image
           cmap = matplotlib.cm.binary,
           interpolation="nearest")
plt.axis("off")
plt.show()

As you can see our data is in this given format.

# 3. Data Preprocessing

### 3.1 Feature Scaling / Normalization

Working with numerical data that is in between `0-1` is more effective for most of the machine learning algortihms than `0-255`.  
We can easily scale our features to `0-1` range by dividing to `max` value (255).

We could use `MinMaxScaler` from `sklearn.preprocessing` but since the formula for that is `(x-min)/(max-min)` and our `min` is 0, we could directly calculate `x/max` and that is `x/255`.  

This is going to give the same result. So let's do it!

**PS: Do not forget to scale test and validation examples before prediction**

In [None]:
X_train_scaled = X_train_explore.copy()
X_train_scaled = X_train_scaled / 255.0

X_train_scaled.head()

In [None]:
X_train = X_train_scaled.copy()
y_train = y_train_explore.copy()

del X_train_explore, X_train_scaled, y_train_explore

# 4. Build Model

We are going to build the base models first, then we are going to try to  `fine-tune` them.

## 4.1 SVM

We are going to create the SVM model.  
We are going to call `fit()` method with training data.

SVM's SVC uses `One-versus-Rest/All (OvA/OvR)` by default, meaning that system trains 10 binary classifiers, one for each digit (a 0-detector, a 1-detector, a 2-detector, and so on). Then when you want to classify an image, you get the decision score from each classifier for that image and you select the class whose classifier outputs the highest score.

So to building 10 different classifiers going to take some time.

In [None]:
"""
from sklearn.svm import SVC # Support Vector Classification
"""

In [None]:
"""

I am commenting out this section since it is taking too much time, but feel free to uncomment and run it.

svc_clf = SVC(gamma='auto', random_state=42, verbose=True)
svc_clf.fit(X_train, y_train)

"""

## 4.2 KNeighbors

We are going to create a K-Nearest Neighbor Classifier.  
We are going to `fit()` the data to the model.  

KNNs asks for a parameter `n_neighbors` which tells how many neighbor points should it check around it, and classify itself according to the ones that are closest to it.

In [None]:
"""
from sklearn.neighbors import KNeighborsClassifier
"""

In [None]:
"""

I am commenting out this section since it is taking too much time, but feel free to uncomment and run it.

kn_clf = KNeighborsClassifier()
kn_clf.fit(X_train, y_train)

"""

## 4.3 Random Forest

We are going to build the Random Forest classifier.
We are going to call `fit()` to train it.  

Random Forest is an ensemble machine learning algorithm, it trains many trees under the hood and the picks the one that performs the best. Random Forest has 2-3 parameters that we are going to tune and the better we tune it the better results we get.

In [None]:
"""
from sklearn.ensemble import RandomForestClassifier
"""

In [None]:
"""

I am commenting out this section since it is taking too much time, but feel free to uncomment and run it.

rf_clf = RandomForestClassifier(random_state=42, verbose=True)
rf_clf.fit(X_train, y_train)

"""

## 4.4 Neural Network Classifier

We are going to create the MLP classifier.
We are going to `fit()` the training data.

Multi-layer perceptron is the one that requires most modification. Of course default values are already set in its **\_\_init\_\_** method but it is better if we customize it according to our needs. Of course there are no strict rules these parameters but we are going to try to do our best.


In [None]:
"""
from sklearn.neural_network import MLPClassifier
"""

In [None]:
# Important parameters
# hidden layer size
# activation function
# alpha -> learning rate
# random_state -> set to get remove randomness effect for different runs
# momentum
# max_iter

"""

I am commenting out this section since it is taking too much time, but feel free to uncomment and run it.


mlp_clf = MLPClassifier(random_state=42)
mlp_clf.fit(X_train, y_train)

"""

# 5. Evaluate Models

In [None]:
"""
from sklearn.metrics import accuracy_score
"""

In [None]:
"""
X_validation_scaled = X_validation.copy()
X_validation_scaled = X_validation_scaled / 255.0
"""

In [None]:
"""

I am commenting out this section since it is taking too much time, but feel free to uncomment and run it.

svc_prediction = svc_clf.predict(X_validation_scaled)
print("SVC Accuracy:", accuracy_score(y_true=y_validation ,y_pred=svc_prediction))

"""

SVC Accuracy: 0.934047619047619

In [None]:
"""

I am commenting out this section since it is taking too much time, but feel free to uncomment and run it.

knn_prediction = kn_clf.predict(X_validation_scaled)
print("KNN Accuracy:", accuracy_score(y_true=y_validation ,y_pred=knn_prediction))

"""

KNN Accuracy: 0.9654761904761905

In [None]:
"""

I am commenting out this section since it is taking too much time, but feel free to uncomment and run it.

rf_prediction = rf_clf.predict(X_validation_scaled)
print("Random Forest Accuracy:", accuracy_score(y_true=y_validation ,y_pred=rf_prediction))

"""

Random Forest Accuracy: 0.9419047619047619

In [None]:
"""

I am commenting out this section since it is taking too much time, but feel free to uncomment and run it.

mlp_prediction = mlp_clf.predict(X_validation_scaled)
print("MLP Accuracy:", accuracy_score(y_true=y_validation ,y_pred=mlp_prediction))

"""

MLP Accuracy: 0.9745238095238096

These are the accuracy results for the models in their base forms, I mean without any tuning, **RandomForest** and **Neural Network (MLP)** performed well.

Thanks to [archaeocharlie](https://www.kaggle.com/archaeocharlie) I realised a different type of modification to apply and I changed the scale from grayscale to only black and white.

I am going to apply this and check the results for all other models.

### Transform Images to Black and White

In [None]:
"""

I am commenting out this section since it is taking too much time, but feel free to uncomment and run it.

X_validation_bw = X_validation.copy()
X_train_bw = X_train.copy()

X_validation_bw[X_validation_bw > 0] = 1
X_train_bw[X_train_bw > 0] = 1

"""

In [None]:
"""

I am commenting out this section since it is taking too much time, but feel free to uncomment and run it.

svc_bw_clf = SVC(gamma='auto', random_state=42, verbose=True)
svc_bw_clf.fit(X_train_bw, y_train)

"""

In [None]:
"""

I am commenting out this section since it is taking too much time, but feel free to uncomment and run it.

kn_bw_clf = KNeighborsClassifier()
kn_bw_clf.fit(X_train_bw, y_train)

"""

In [None]:
"""

I am commenting out this section since it is taking too much time, but feel free to uncomment and run it.

rf_bw_clf = RandomForestClassifier(random_state=42, verbose=True)
rf_bw_clf.fit(X_train_bw, y_train)

"""

In [None]:
"""

I am commenting out this section since it is taking too much time, but feel free to uncomment and run it.

mlp_bw_clf = MLPClassifier(random_state=42)
mlp_bw_clf.fit(X_train_bw, y_train)

"""

In [None]:
"""

I am commenting out this section since it is taking too much time, but feel free to uncomment and run it.

svc_bw_prediction = svc_clf.predict(X_validation_bw)
print("SVC BW Accuracy:", accuracy_score(y_true=y_validation ,y_pred=svc_bw_prediction))

"""

SVC BW Accuracy: 0.9226190476190477

In [None]:
"""

I am commenting out this section since it is taking too much time, but feel free to uncomment and run it.

knn_bw_prediction = kn_clf.predict(X_validation_bw)
print("KNN Accuracy:", accuracy_score(y_true=y_validation ,y_pred=knn_bw_prediction))

"""

KNN Accuracy: 0.9630952380952381

In [None]:
"""

I am commenting out this section since it is taking too much time, but feel free to uncomment and run it.

rf_bw_prediction = rf_clf.predict(X_validation_bw)
print("Random Forest Accuracy:", accuracy_score(y_true=y_validation ,y_pred=rf_bw_prediction))

"""

Random Forest Accuracy: 0.9166666666666666

In [None]:
"""

I am commenting out this section since it is taking too much time, but feel free to uncomment and run it.

mlp_bw_prediction = mlp_clf.predict(X_validation_bw)
print("MLP Accuracy:", accuracy_score(y_true=y_validation ,y_pred=mlp_bw_prediction))

"""

MLP Accuracy: 0.9595238095238096

Converting to BW kinda overwrite Min-Max scaling effect. So these are totally different transformation, and clearly feature scaling is a better approach.

## 5.1 Cross Validation

In [None]:
"""
from sklearn.model_selection import cross_val_score

def display_scores(scores):
    print("Scores:", scores)
    print("")
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())
"""

### SVM

In [None]:
"""

I am commenting out this section since it is taking too much time, but feel free to uncomment and run it.

svm_scores = cross_val_score(svc_clf, X_train, y_train, scoring="neg_mean_squared_error", cv=10, verbose=10)
svm_rmse_scores = np.sqrt(-svm_scores)

print("SVM Scores\n")
display_scores(svm_rmse_scores)
"""

### KNN

In [None]:
"""

I am commenting out this section since it is taking too much time, but feel free to uncomment and run it.

kn_scores = cross_val_score(kn_clf, X_train, y_train, scoring="neg_mean_squared_error", cv=10, verbose=10)
kn_rmse_scores = np.sqrt(-kn_scores)

print("KNeighbor Scores\n")
display_scores(kn_rmse_scores)

"""

### Random Forest

In [None]:
"""

I am commenting out this section since it is taking too much time, but feel free to uncomment and run it.

rf_scores = cross_val_score(rf_clf, X_train, y_train, scoring="neg_mean_squared_error", cv=10, verbose=10)
rf_rmse_scores = np.sqrt(-rf_scores)

print("Random Forest Scores\n")
display_scores(rf_rmse_scores)

"""

### MLP 

In [None]:
"""

I am commenting out this section since it is taking too much time, but feel free to uncomment and run it.

mlp_scores = cross_val_score(mlp_clf, X_train, y_train, scoring="neg_mean_squared_error", cv=10, verbose=10)
mlp_rmse_scores = np.sqrt(-mlp_scores)

print("Neural Network Scores\n")
display_scores(mlp_rmse_scores)
"""

Scores: [0.63807008 0.75385403 0.70411055 0.6401844  0.81735991 0.72976497
 0.69986193 0.68349178 0.72595641 0.77671352]

Mean: 0.7169367594666385

Standard deviation: 0.05380741584858098

## 6. Hyperparameter Tuning

I am a big fan of GridSearch! You create a set of parameter combinations and you run your model with each of them and get the best parameter combination for that model. So let's do it!

### GridSearch

In [None]:
"""
from sklearn.model_selection import GridSearchCV
"""

### Neural Network

In [None]:
"""

I am commenting out this section since it is taking too much time, but feel free to uncomment and run it.

nn_parameter_grid = [
    {'hidden_layer_sizes': [(100, ), (200, ), (300, )],
     'solver': ['sgd', 'adam'],
     'learning_rate_init':[0.0001, 0.001]
    }
]

nn_grid_clf = MLPClassifier(random_state=42, verbose=True)
nn_grid_search = GridSearchCV(nn_grid_clf,
                              nn_parameter_grid,
                              cv=3,
                              scoring='neg_mean_squared_error',
                              verbose=3)
nn_grid_search.fit(X_train, y_train)

"""

In [None]:
"""

cvres = nn_grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

"""

1.3016777939893878 {'hidden_layer_sizes': (100,), 'learning_rate_init': 0.0001, 'solver': 'sgd'}  
0.7858874268106391 {'hidden_layer_sizes': (100,), 'learning_rate_init': 0.0001, 'solver': 'adam'}  
0.9632946111841942 {'hidden_layer_sizes': (100,), 'learning_rate_init': 0.001, 'solver': 'sgd'}  
0.7544751846368469 {'hidden_layer_sizes': (100,), 'learning_rate_init': 0.001, 'solver': 'adam'}  
1.2809553313236808 {'hidden_layer_sizes': (200,), 'learning_rate_init': 0.0001, 'solver': 'sgd'}  
0.7334956530313408 {'hidden_layer_sizes': (200,), 'learning_rate_init': 0.0001, 'solver': 'adam'  }  
0.8997648134800114 {'hidden_layer_sizes': (200,), 'learning_rate_init': 0.001, 'solver': 'sgd'}  
0.712455326734661 {'hidden_layer_sizes': (200,), 'learning_rate_init': 0.001, 'solver': 'adam'}  
1.2637811745483698 {'hidden_layer_sizes': (300,), 'learning_rate_init': 0.0001, 'solver': 'sgd'}  
0.7171187204774808 {'hidden_layer_sizes': (300,), 'learning_rate_init': 0.0001, 'solver': 'adam'}  
0.8738408800221489 {'hidden_layer_sizes': (300,), 'learning_rate_init': 0.001, 'solver': 'sgd'}  
0.7087510439696313 {'hidden_layer_sizes': (300,), 'learning_rate_init': 0.001, 'solver': 'adam'}

In [None]:
"""

nn_grid_search.best_params_

"""

{'hidden_layer_sizes': (300,), 'learning_rate_init': 0.001, 'solver': 'adam'}

### Random Forest

In [None]:
"""

I am commenting out this section since it is taking too much time, but feel free to uncomment and run it.

rf_parameter_grid = [
    {
        'n_estimators': [60, 100, 200, 500],
        'max_features': [12, 30, 100, 300, 'auto']
    }
]

rf_grid_clf = RandomForestClassifier(random_state=42, verbose=True)
rf_grid_search = GridSearchCV(rf_grid_clf,
                              rf_parameter_grid,
                              cv=None,
                              scoring='neg_mean_squared_error',
                              verbose=2)
rf_grid_search.fit(X_train, y_train)

"""

In [None]:
"""

cvres = rf_grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)
    
"""

0.8918650572947011 {'max_features': 12, 'n_estimators': 60}  
0.8688312407295935 {'max_features': 12, 'n_estimators': 100}  
0.8511274812964209 {'max_features': 12, 'n_estimators': 200}  
0.8394221998670452 {'max_features': 12, 'n_estimators': 500}  
0.8549575316659775 {'max_features': 30, 'n_estimators': 60}  
0.8316331863350311 {'max_features': 30, 'n_estimators': 100}  
0.8305668364684866 {'max_features': 30, 'n_estimators': 200}  
0.8183413344580241 {'max_features': 30, 'n_estimators': 500}  
0.8433210991019234 {'max_features': 100, 'n_estimators': 60}  
0.8353309390761112 {'max_features': 100, 'n_estimators': 100}  
0.8305827622023737 {'max_features': 100, 'n_estimators': 200}  
0.8200529083460241 {'max_features': 100, 'n_estimators': 500}  
0.8734775114237132 {'max_features': 300, 'n_estimators': 60}  
0.8598326402923824 {'max_features': 300, 'n_estimators': 100}  
0.8533469740973235 {'max_features': 300, 'n_estimators': 200}  
0.8530524091158835 {'max_features': 300, 'n_estimators': 500}  
0.85441585602472 {'max_features': 'auto', 'n_estimators': 60}  
0.838002790716521 {'max_features': 'auto', 'n_estimators': 100}  
0.8238668665243273 {'max_features': 'auto', 'n_estimators': 200}  
0.8244767460407266 {'max_features': 'auto', 'n_estimators': 500}  

In [None]:
"""
rf_grid_search.best_params_
"""

{'max_features': 30, 'n_estimators': 500}

# 7. Predict and Submit

## 7.1 Confusion Matrix

In [None]:
nn_tuned_clf = MLPClassifier(hidden_layer_sizes=(300,),
                            learning_rate_init=0.001,
                            solver='adam',
                            random_state=42,
                            verbose=True)
nn_tuned_clf.fit(X_train, y_train)

In [None]:
nn_tuned_pred = nn_tuned_clf.predict(X_validation_scaled)

## 7.2 Precision, Recall and F1 Scores

In [None]:
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score, precision_recall_fscore_support

In [None]:
nn_precisions, nn_recalls, nn_f_beta_scores, nn_support = precision_recall_fscore_support(y_validation, nn_tuned_pred)
print("Precision of each class:", nn_precisions, "\n")
print("Recall of each class:", nn_recalls, "\n")
print("F Scores of each class:", nn_f_beta_scores, "\n")
print("Support of each class:", nn_support, "\n")

See [Sklearn Documentation for precision_recall_fscore_support()](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html#sklearn.metrics.precision_recall_fscore_support)

The **precision** is the ratio `tp / (tp + fp)` where tp is the number of true positives and fp the number of false positives. _The precision is intuitively the ability of the classifier **not to label as positive a sample that is negative**._

The **recall** is the ratio `tp / (tp + fn)` where tp is the number of true positives and fn the number of false negatives. _The recall is intuitively the ability of the classifier **to find all the positive samples**._

The **F-beta score** can be interpreted as a weighted harmonic mean of the precision and recall, where an F-beta score reaches its `best value at 1 and worst score at 0`.

The F-beta score weights recall more than precision by a factor of beta. `beta == 1.0` means `recall and precision are equally important.`

In [None]:
f1_score(y_validation, nn_tuned_pred, average="micro") # This is the average

In [None]:
nn_conf_matrix = confusion_matrix(y_validation, nn_tuned_pred)

In [None]:
plt.matshow(nn_conf_matrix, cmap=plt.cm.gray)
plt.show()

Currently we can not talk about this graph so much, since these are the actual values and depending on the frequency of each digit this result may mislead us. It is better to look at the **error rates** and **not the actual number** of errors (mis-classified).

Because, if we had 1000 5s and 9999 1s, having 999 false classifications (errors) in 5s, and 1000 in 1s would be seen as "We had less error in 5s". But if we have used **rates** it would be clearer.

In [None]:
nn_row_sums = nn_conf_matrix.sum(axis=1, keepdims=True)
nn_norm_conf_mx = nn_conf_matrix / nn_row_sums

In [None]:
np.fill_diagonal(nn_norm_conf_mx, 0) # to keep only the errors we fill diagonal with 0s, since diagonal shows the ones that are correctly classified.
plt.matshow(nn_norm_conf_mx, cmap=plt.cm.gray)
plt.show()

The brighter squares represents higher values, meaning higher error rates.  

We can see that most of the **3s are mis-classified as 5**, some of the **4s are mis-classified as 9s**, and some of **7s are mis-classified as 2**. To solve these issues we may add some new features:

For example the main difference betweeen 5 and 3 is the postiion of the line in between, in 3s it closer to the middle, in 5s it is a bit higher.  

For 4 and 9 we can check to see if there are complete circles, which indicates that is a 9.  

For 7 and 2, the additional line in the bottom is what makes the most of the difference.

To get a better understanding of the error we may examine the mis-classified examples to what might be the reason for the error.

**PS: Increasing the total number of training samples for each digit would result in a increase in performance. We can also preprocess the images to make sure that they are not rotated and they fit well in the matrix.**

## 7.3 Predict and Submit Results

We are going to build the final classifier again, using the whole training set (including the validation set this time). Predict result for test set and submit our results.

### Get the dataset again and normalize it

In [None]:
"""

final_train = train.copy()

final_X_train = final_train.drop(['label'], axis=1).copy()
final_y_train = final_train['label'].copy()

del final_train

final_X_train_scaled = final_X_train.copy()
final_X_train_scaled = final_X_train_scaled / 255.0
"""


### Train The Best Model (MLP in this case)

In [None]:
"""
from sklearn.neural_network import MLPClassifier
"""

In [None]:
"""

final_nn_clf = MLPClassifier(hidden_layer_sizes=(300,),
                            learning_rate_init=0.001,
                            solver='adam',
                            random_state=42)
final_nn_clf.fit(final_X_train_scaled, final_y_train)
"""


### Predict on Test Set

In [None]:
"""
final_prediction = final_nn_clf.predict(test)
"""

### Submit Results

In [None]:
"""

submission = pd.DataFrame({"ImageId": list(range(1,len(final_prediction)+1)),
                          "Label": final_prediction})
"""

In [None]:
"""
submission.to_csv("cnn_mnist_submission.csv", index=False)
"""