1. Summary
2. Import the relevant libraries
3. Loading the MNIST data
4. Exploratory data analysis
5. Preprocess the data set

   5.1 Cleaning the data set
   
   5.2 Separate Features and Labels
   
6. Plotting the data set
7. Data Splitting Process
8. Training 

   8.1 Training a binary classifier

         # STOCHASTIQUE GRADIENT DESCENT
         # RANDOM FOREST ALGORITHM
         # Comparing with a dump classifier
         
9. Performance Measures

   9.1 Cross Validation
   
   9.2  Confusion Matrix
   
   9.3  Precision 
   
   9.4 Recall 
   
   9.5  F1 
   
   9.6 Precision/Recall Trade-off
   
10. The Test set

# 1. Summary

   
The goal of this notebook is to analyse a classification model with the MNIST data, so in this notebook, we will detect one number from MNIST data set using binary classifiers (A  classifier is an algorithm of machine learning that will determine the class to which the input data belongs to, based on a set of features). And then we will evaluate the measures of performance, and choose the model that have a great accuracy. 

# 2. Import the relevant libraries

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np  # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

# Data Splitting Process

from sklearn.model_selection import train_test_split

# Training Process

from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier

# Performance Measures 

from sklearn.base import BaseEstimator
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# 3. Loading the MNIST data

In [None]:
mnist_train = pd.read_csv("../input/digit-recognizer/train.csv")
mnist_test  = pd.read_csv("../input/digit-recognizer/test.csv")

In [None]:

#Take copies of the master dataframes

train = mnist_train.copy()
test = mnist_test.copy()

# 4. Exploratory data analysis

In [None]:
train.shape

In [None]:
test.shape

In [None]:
train.head()

In [None]:
train.tail()

In [None]:
test.head()

In [None]:
test.tail()

In [None]:
train.describe()

In [None]:
print(train.keys())

In [None]:
print(test.keys())

# 5. Preprocess the data set

### 5.1 Cleaning the data set

In [None]:
train.isnull().any().any()

the results means that the data is already clean, so we don't have any missing values

### 5.2 Separate Features and Labels

In [None]:
X, y = train.drop(labels = ["label"],axis = 1).to_numpy(), train["label"]
X.shape

In [None]:
X.shape

In [None]:
y.shape

# 6. Plotting the data set

 feature X[20] contains '8' (image_pixel data) pixels 784 = 28*28
 y[20] contain 8 value

In [None]:
some_digit = X[20]
some_digit_show = plt.imshow(X[20].reshape(28,28), cmap=mpl.cm.binary)
y[20]

In [None]:
y = y.astype(np.uint8)

# 7. Data Splitting Process

## 7. 1 Spliting Train and Test sets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

# 8.Training Process

## 8.1 Training a binary classifier

We were just training our model to predict 8.

In [None]:
y_train_8 = (y_train == 8)
y_test_8 = (y_test == 8)

#### STOCHASTIQUE GRADIENT DESCENT

In [None]:
sgd_clf = SGDClassifier(max_iter=1000,random_state = 42)
sgd_clf.fit(X_train, y_train_8)

In [None]:
sgd_clf.predict([some_digit])

#### RANDOM FOREST ALGORITHM

In [None]:
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train, y_train_8)

In [None]:
rf_clf.predict([some_digit])

so we can conclude that : some_digit X[20] == 8 is  True

# 9.1 Performance Measures

After we build our machine learning algorithm, we need to evaluate the performance for both models(SGD and Random Forest), there are many performance measures, in this notebook we will use the cross validation, Confusion Matrix, Precision/Recall/F1 score and ROC curve, and then we will analyze which model performs better.

### 9.1 Cross Validation

#### Stochastique Gradien Descent

To evaluate the performance of a classifier model we can use the cross validation, but the accuracy is generally not the preferred performance measure for classifiers especially when some classes are more frequent than others.

In [None]:
cv_score_sgd = cross_val_score(sgd_clf, X_train, y_train_8, cv = 3, scoring = "accuracy")

In [None]:
cv_score_sgd = np.mean(cv_score_sgd)
cv_score_sgd

#### Random Forest

In [None]:
cv_score_rf = cross_val_score(rf_clf, X_train, y_train_8, cv= 3, scoring = "accuracy")

In [None]:
cv_score_rf = np.mean(cv_score_rf)
cv_score_rf

#### Comparing with a dump classifier

In general 92% accuracy seems good but we need to create a dumb "Never8Classifier", by extending Scikit-Learn's BaseEstimator

In [None]:
class Never8Classifier(BaseEstimator):
    def fit(sef, X, y=None):
        pass
    def predict(self, X):
        return np.zeros((len(X), 1), dtype=bool)

never_8_clf = Never8Classifier()

cross_val_score(never_8_clf, X_train, y_train_8, cv=3, scoring="accuracy")

We notice that only 10% of the images are 8s, so if we guess that an image is not a 8 , we will be right about 90% of the time.

## 9.2 Confusion Matrix

#### Stochastique Gradien Descent

A good way to measure the performance of a classifier is to look at the confusion matrix. The confusion matrix is the number of correct predictions and incorrect predictions are summarized with a count values and broke down by each class.

To calculate the confusion matrix we need a set of predictions, so that they can be compared to the actual targets.

Instead, we can use the function of sklearn cross_val_predict().

In [None]:
y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_8, cv= 3)


confusion_matrix(y_train_8, y_train_pred)

Each row in the confusion matrix is an actual class, and each column represents a predicted class.

The first row of this matrix : 32,684 were correctly considred as non-8s (True Negatives)
The second row : 1,226, were wrongly classified as non-8s(there are called False Negative)
The first column : 1,456 we wrongly classifies as 8s (False Positive)
The second column : 2,434 we correcltly classifioed as 8s(True Positive)

## 9.3 Precision

The confusion matrix gives a good results but sometimes we might use another metric more concise like the accuracy of the positive predictions, this called PRECISION of the classifier,

Precision is the ratio of correctly predicted positive observations, to the total predicted positive observations.

In [None]:
precision_score(y_train_8, y_train_pred)

## 9.4 Recall

Recall is the ratio of correctly predicted positive observations to the all observations in actual class, Recall is also called sensitivity or true positive rate (TPR).

In [None]:
recall_score(y_train_8, y_train_pred)

Now the 8-detector does not look as the results of the accuracy, so when it claims an image represents a 8, it is correct only for 62.5%. More over, it only detects 66.5% of the 8s

## 9.5 F1 Score

F1 score is precision and recall combined into single metric. It's the harmonic mean of precision and recall

In [None]:
Score = f1_score(y_train_8, y_train_pred)
print(Score)

## 9.6 Precision/Recall Trade-off

we can plot the precision and recall ratio by using the decision score, because sklearn does not give us the access to set the threshold. So using decision_function() we can get score values and decide whether it should be classified as 8 or not 8.

#### Stochastique Gradien Descent

In [None]:
y_scores= cross_val_predict(sgd_clf, X_train, y_train_8, cv=3, method="decision_function")
print(y_scores)

In [None]:
precisions, recalls, thresholds = precision_recall_curve(y_train_8,y_scores)

# here we use matplotlib to plot recall and precision as functions of the thresholds

def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
    plt.plot(thresholds, precisions[:-1], "b--", label="Precision")
    plt.plot(thresholds, recalls[:-1], "g-", label="Recall")
    plt.xlabel("Threshold")
    plt.legend(loc="center left")
    plt.ylim([0, 1])
    plt.title('Precision and recall versus the decision threshold')

    
plot_precision_recall_vs_threshold(precisions, recalls, thresholds)
plt.show()

In [None]:
y_scores = sgd_clf.decision_function([X_train[0]])
print("Score for 1st digit: {0}".format(y_scores[0]))
print("Was this digit a real 8? {0}".format(y_train_8[0]))

digit_image = X_train[0].reshape(28,28)
plt.imshow(digit_image, cmap= matplotlib.cm.binary, interpolation="nearest")
plt.axis("off")
plt.title("Digit image")
plt.show()

So here we set thresold to a very low value -250000, 

In [None]:
threshold = -200000
y_some_digit_pred = (y_scores > threshold)
y_some_digit_pred

Another way to select the best value of the threshold is to plot precision directly against recall 

In [None]:
def print_recalls_precision(recalls, precisions, title):
    plt.figure(figsize=(8,6))
    plt.plot(recalls, precisions, "b-", linewidth=2)
    plt.xlabel("Recall", fontsize=16)
    plt.ylabel("Precision", fontsize=16)
    plt.title("Precision vs Recall plot - {0}".format(title), fontsize=16)
    plt.axis([0,1,0,1])
    plt.show()
print_recalls_precision(recalls, precisions, "stochastic gradient descend")

Let's use RandomForestClassifier and compare it with SGDClassifier

#### Random Forest

In [None]:
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
y_probas_forest = cross_val_predict(rf_clf, X_train, y_train_8, cv= 3, method= "predict_proba")
y_scores_forest = y_probas_forest[:,1]

# y_probas_forest contains 2 columns, one per class. Each row's sum of probabilities is equal to 1

precisions_forest, recalls_forest, thresholds = precision_recall_curve(y_train_8,y_scores_forest)
print_recalls_precision(recalls_forest, precisions_forest, "Random Forest Classifier")

The graph results that the Random Forest Classifier performs clearly better than the SGD classifier.

Otherwise we will plot the same graph for the dumb classifier, so that we can compare all the 3 classifiers; dumb classifier, Random Forest Classifier and SGD classifier.

#### dumb classifier

In [None]:
never_8_predictions = cross_val_predict(never_8_clf, X_train, y_train_8, cv=3)

precisions_dumb, recalls_dumb, thresholds = precision_recall_curve(y_train_8, never_8_predictions)

print_recalls_precision(recalls_dumb, precisions_dumb, "dumb classifier")

In [None]:
plt.figure(figsize=(8,6))
plt.plot(precisions_forest, recalls_forest, "-r", label="Random Forest")
plt.plot(precisions,recalls, "-g",label="stochastic gradient descend")
plt.plot(precisions_dumb, recalls_dumb, "-b", label="dumb classifier")
plt.plot([0, 1], [1,0], "k--", label="Random guess")

plt.xlabel("Recall", fontsize=16)
plt.ylabel("precision", fontsize=16)


plt.title("Precision vs Recall - model comparison", fontsize=16)
plt.axis([0,1,0,1])
plt.legend(loc="center left")
plt.ylim([0, 1])

In [None]:
print("F1 score for dumb classifier: {0}".format(f1_score(y_train_8, never_8_predictions)))
print("F1 score for SGD classifier: {0}".format(f1_score(y_train_8, y_train_pred)))
print("F1 score for Random Forest: {0}".format(f1_score(y_train_8, y_scores_forest > 0.5)))

we can conclude that the random forest classifier performs better than the other classifiers

# 10. the test set

In [None]:
predictions_sgd = sgd_clf.predict(X_test).astype(int)

In [None]:
Label = pd.Series(predictions_sgd,name = 'Label')
ImageId = pd.Series(range(1,28001),name = 'ImageId')
submission = pd.concat([ImageId,Label],axis = 1)
submission.to_csv('submission.csv',index = False)

In [None]:
# clf = RandomForestClassifier(n_estimators=100, random_state=42)

# clf.fit(X_train, y_train_8)


predictions_forest = clf.predict(X_test).astype(int)

Label = pd.Series(predictions_forest,name = 'Label')
ImageId = pd.Series(range(1,28001),name = 'ImageId')
submission = pd.concat([ImageId,Label],axis = 1)
submission.to_csv('submission_forest.csv',index = False)