**DS 301: Applied Data Modeling and Predictive Analysis**

**Lecture 5 – Binary Classification and Performance Measures**

# Binary Classifier with MNIST Dataset

Nok Wongpiromsarn, 31 August 2020

**Credit:** The large portion of the code has been taken from Chapter 3 of Aurélien Géron, Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow

## Get, Visualize, and Prepare the Data for Machine Learning

**Load the mnist handwritten digit dataset**

In [None]:
from sklearn.datasets import fetch_openml

mnist = fetch_openml('mnist_784', version=1)

In [None]:
print(mnist.keys())
print(mnist.DESCR)

In [None]:
x, y = mnist["data"], mnist["target"]
print(x.shape)
print(y.shape)

**Plot the data**

In [None]:
from common_plots import DigitPlotter

num_plots = 5
plotter = DigitPlotter((28, 28))
plotter.is_binary_cm = True
plotter.plot_multiple(x[0:num_plots], y[0:num_plots])

**Convert the target (y) from string to unsigned integer (0 to 255)**

In [None]:
import numpy as np

print(type(y[0]))
y = y.astype(np.uint8)
print(type(y[0]))

**Split the data into training and testing sets**

In [None]:
x_train, x_test, y_train, y_test = x[:60000], x[60000:], y[:60000], y[60000:]

## Train a model

**Use linear classifiers with stochastic gradient descent (SGD) training**

In [None]:
# Get the binary target (5 or not 5)
y_train_5 = (y_train == 5)
y_test_5 = (y_test == 5)

from sklearn.linear_model import SGDClassifier

# some hyperparameters will have a different defaut value in future versions of scikit-Learn, 
# such as max_iter and tol. 
# To be future-proof, we explicitly specify these hyperparameters.
sgd_clf = SGDClassifier(max_iter=1000, tol=1e-3, random_state=42)
sgd_clf.fit(x_train, y_train_5)

## Performance Measures

**1. Training Accuracy**

In [None]:
from sklearn import metrics
y_train_pred = sgd_clf.predict(x_train)
print(metrics.accuracy_score(y_train_5, y_train_pred))

**2. Cross-Validation Accuracy**

In [None]:
from sklearn.model_selection import cross_val_score
cross_val_score(sgd_clf, x_train, y_train_5, cv=3, scoring="accuracy")

**3. Confusion Matrix**

In [None]:
# Get the predictions made on each test fold
# With cross_val_predict, the data is split according to the cv parameter.
# Each sample belongs to exactly one test set, and its prediction is computed 
# with an estimator fitted on the corresponding training set.
from sklearn.model_selection import cross_val_predict
y_train_pred = cross_val_predict(sgd_clf, x_train, y_train_5, cv=3)

# Compute the confusion matrix by passing the target (y_train_5)
# and the prediction (y_train_pred)
from sklearn.metrics import confusion_matrix
confusion_matrix(y_train_5, y_train_pred)

**4. Precision, Recall and F1 Score**

In [None]:
from sklearn.metrics import precision_score, recall_score
precision = precision_score(y_train_5, y_train_pred)
recall = recall_score(y_train_5, y_train_pred)
print("precision: ", precision)
print("recall:    ", recall)

In [None]:
from sklearn.metrics import f1_score
f1 = f1_score(y_train_5, y_train_pred)
print("f1 score: ", f1)

Precision/Recall Trade-off

In [None]:
# Use precision_recall_curve to compute precision and recall 
# for all possible thresholds
from sklearn.metrics import precision_recall_curve
y_scores = cross_val_predict(sgd_clf, x_train, y_train_5, cv=3, 
                             method="decision_function")
precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)

In [None]:
# Plot precision and recall versus the decision threshold
# The length of precisions and the length of recalls is 1 more than that of thresholds.
# The last element of precisions is always 1 and the last element of recalls is always 0.
import matplotlib.pyplot as plt
plt.plot(thresholds, precisions[:-1], "b--", label="Precision")
plt.plot(thresholds, recalls[:-1], "g-", label="Recall")
plt.grid()
plt.legend()
plt.xlabel("Threshold")
plt.show()

In [None]:
# Plot precision versus recall
plt.plot(recalls, precisions)
plt.grid()
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.show()

In [None]:
# Get a lowest threshold that gives at least 90% precision
threshold_90_precision = thresholds[np.argmax(precisions >= 0.90)]

y_train_pred_90 = (y_scores >= threshold_90_precision)
precision = precision_score(y_train_5, y_train_pred_90)
recall = recall_score(y_train_5, y_train_pred_90)
print("precision: ", precision)
print("recall:    ", recall)

**5. The ROC Curve**

In [None]:
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_train_5, y_scores)

In [None]:
# Plot tpr versus fpr
import matplotlib.pyplot as plt
plt.plot(fpr, tpr, linewidth=2, label="SGDClassifier")
plt.plot([0, 1], [0, 1], 'k--', label="Purely random classifier") # Dashed diagonal
plt.grid()
plt.legend()
plt.xlabel("False Positive Rate (1-Specificity)")
plt.ylabel("True Positive Rate (Recall)")
plt.show()

In [None]:
# Measure the area under the curve (AUC)
from sklearn.metrics import roc_auc_score
roc_auc_score(y_train_5, y_scores)