In [1]:
import numpy as np
from sklearn.datasets import fetch_openml

mnist = fetch_openml('mnist_784', version=1)
X, y = mnist['data'], mnist['target']
y = y.astype(np.uint8)

X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

  warn(


In [2]:
# Start out by training binary classifier for 5 / not-5. 
y_train_5 = (y_train == 5)
y_test_5 = (y_test == 5)

In [3]:
from sklearn.linear_model import SGDClassifier
# SGDClassifier defaults to hinge-loss linear SVM with l2-regularizer w/ stochastic gd trainer. 
# Set random state to get consistent performance from SGD. First pass, just train the classifier
# on all the data. We'll do CV in the next step.
sgd_clf = SGDClassifier(random_state=42)
sgd_clf.fit(X_train, y_train_5)

In [5]:
# do a 3-fold CV on SVM.
from sklearn.model_selection import cross_val_score
cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring='accuracy')

array([0.95035, 0.96035, 0.9604 ])

Note how cross-validation scores are high with the binary classification problem. Compare this, though,
to a dummy classifier that always outputs not-5:

In [6]:
from sklearn.base import BaseEstimator
class Never5Classifier(BaseEstimator):
    def fit(self, X, y=None):
        pass
    def predict(self, X):
        return np.zeros((len(X), 1), dtype=bool)

never5_cls = Never5Classifier()
cross_val_score(never5_cls, X_train, y_train_5, cv=3, scoring='accuracy')

array([0.91125, 0.90855, 0.90915])

Results are also in the roughly 90% range. Maybe some small improvement from the linear model. The point
here is to illustrate that accuracy may not be a useful measure, esp. for a skewed dataset. I'm guessing since 5
shows up about 10% of the time, this dataset is balanced, but the binary classifier we're designing causes it to be heavily skewed. 

Instead, you can use cross_val_predict to do k-fold CV and return predicted values on each test fold. Then, those predictions can be used to build a confusion matrix. 

In [7]:
from sklearn.model_selection import cross_val_predict
y_train_predict = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)

In [8]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_train_5, y_train_predict)

array([[53892,   687],
       [ 1891,  3530]])

How to read the confusion matrix:
Rows correspond to actual class, columns correspond to predicted class.
          predicted 0   predicted 1
actual 0  TrueNeg       FalsePos
actual 1  FalseNeg      TruePos

Meaning with this predictor, we have:
TrueNeg: 53892, FalsePos: 687
FalseNeg: 1891, TruePos: 3530

Note this means a perfect predictor would have zeros outside the diagonal.

More informative than accuracy is to look at precision and recall

In [10]:
from sklearn.metrics import precision_score, recall_score
print(precision_score(y_train_5, y_train_predict))
print(recall_score(y_train_5, y_train_predict))

0.8370879772350012
0.6511713705958311


This indicates that 83% of 1 predictions are correct, and 65% of true 1's are predicted as such. F1 score combines prec and recall to one metric: their harmonic mean.

In [11]:
from sklearn.metrics import f1_score
f1_score(y_train_5, y_train_predict)

0.7325171197343846