In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Introduction 
This is a digits dataset, the  so-called "Hello World" of Machine Learning. Most of the code here is from Aurelien Geron's book *Hands-on Machine Learning*, but I've added some for my own learning. The dataset is different from the one in his book, just the sklearn load_digits() function with far few samples and features than the one Geron uses in the book.

In [None]:
from sklearn.datasets import load_digits
digits = load_digits()

Digits is an sklearn bunch object that contains both the data and metadata. The **data** and **target** attributes are both numpy arrays. The Bunch object extends the conventional python dictionary, so it's got keys and values.

In [None]:
print(digits.DESCR)

In [None]:
digits.keys()

Data has 1797 samples with 64 features that represent pixel intensities between 1 and 16.

In [None]:
digits.data.shape

In [None]:
len(digits.data[2])

Target consists of the 1797 labels, for whichever digit the sample represents.

In [None]:
digits.target.shape

In [None]:
# and they're in order:
for n in range(14):
    print(digits.target[n])

### Sample images
The bunch object also contains an image attribute that stores the images as 8x8 arrays of pixel intensities.

In [None]:
digits.images.shape

In [None]:
import matplotlib.pyplot as plt

def showDigitImage(n):
    axes=plt.subplot()
    image=plt.imshow(digits.images[n], cmap='binary')
    xticks = axes.set_xticks([])
    yticks = axes.set_yticks([])
    

In [None]:
showDigitImage(35)

In [None]:
digits.target[35]

## Prep

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target, test_size=0.2, random_state=13)

In [None]:
X_train.shape

# Modelling
### First: A Binary Classifier

This entire subsection is almost identical to the section in Geron's book. 

Going to train a simple binary classifier to try and identify the 7's. A binary classifier is one that identifies whether some sample is or is not the target. Geron uses the [SGDClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html). First we have to split off the 7's from the target vectors.

In [None]:
y_train_7 = (y_train == 7)
y_test_7 = (y_test == 7)

In [None]:
from sklearn.linear_model import SGDClassifier
sgd = SGDClassifier(random_state=10) # need to specify since the SGD classifier relies on randomness
sgd.fit(X_train, y_train_7)

And we can check if it correctly identifies our 5 from above as not being a 7:

In [None]:
sgd.predict([digits.data[35]])

And if correctly predicts a 7, since we know they're in order:

In [None]:
sgd.predict([digits.data[7]])

### Measuring the Binary Classifier with Cross-Validation
Cross-Validation involves randomly splitting the training set into subsets called *folds* and training and evaluating the model however many times we specify; for *n* folds it'll train on the other *n-1* folds and evaluate on that fold.

In [None]:
from sklearn.model_selection import cross_val_score

# give it 5 folds
scores = cross_val_score(sgd, X_train, y_train_7, cv=5, scoring="accuracy")
scores

In [None]:
print(f'Average over 5 folds: {100*scores.mean():.2f}%')

In [None]:
print(f'Accuracy Standard Deviation: {100*scores.std():.2f}%')

98.61% is very high. We can design and compare it to a dumb classifier that extends from sklearn's base estimator to guess not 7 every time and see that it is still good:

In [None]:
from sklearn.base import BaseEstimator

class Never7Classifier(BaseEstimator):
    def fit(self, X, y=None):
        return self
    
    def predict(self, X):
        return np.zeros((len(X),1),dtype=bool)

In [None]:
never7 = Never7Classifier()
cross_val_score(never7, X_train, y_train_7, cv=5, scoring="accuracy")

We can get some predictions with the cross validation as well. Instead of going through and returning evaluation scores, it returns predictions made for each fold.

In [None]:
from sklearn.model_selection import cross_val_predict

In [None]:
y_train_pred = cross_val_predict(sgd, X_train, y_train_7, cv=5)

**Confusion Matrix** for our binary classifier:.

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_train_7, y_train_pred)

* 1269 true negatives: correctly classified as not-a-7
* 15 false positives: incorrectly classified as 7
* 5 false negatives: 7's that were incorrectly classified as not-a-7
* 148 true positives: 7's correctly classified as 7

### Precision, Recall and the tradeoff
Precision, *p*: Accuracy of the positive predictions. $ p = \frac{TP}{TP+FP}$ 

Recall, *r*: ratio of positives that are correctly detected $ p = \frac{TP}{TP+FN}$ 

In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score
precision_score(y_train_7, y_train_pred)

In [None]:
recall_score(y_train_7, y_train_pred)

In [None]:
f1_score(y_train_7, y_train_pred)

There is an unavoidable trade-off between precision and recall. The SGD classifier makes it's decisions according to a *decision function*. Each instance gets a score based on this function, which declares it a 7 or not-a-7 based on whether that score is above or below some threshold. We cannot set this threshold directly, but we can call the decision function method to make predictions based on any threshold:

In [None]:
aDigit = digits.data[7]

In [None]:
y_scores = sgd.decision_function([aDigit])
y_scores

In [None]:
threshold = 0
y_predicted_aDigit = (y_scores > threshold)
y_predicted_aDigit

In [None]:
threshold = 5550
y_predicted_aDigit = (y_scores > threshold)
y_predicted_aDigit

So raising the threshold does decrease the recall, because it'll miss more true cases. To make the decision between precision and recall, we can use cross_val_predict again, but have it return decision scores rather than predictions, and then plot them on a precision-recall curve:

In [None]:
y_scores = cross_val_predict(sgd, X_train, y_train_7, cv=5, method="decision_function")
y_scores

In [None]:
from sklearn.metrics import precision_recall_curve

precisions, recalls, thresholds = precision_recall_curve(y_train_7, y_scores)

In [None]:
def plot_precision_recall_threshold(precisions, recalls, thresholds):
    plt.plot(thresholds, precisions[:-1], "b--", label="Precision")
    plt.plot(thresholds, recalls[:-1], "g-", label="Recalls")
    plt.legend(loc="center right")
    plt.grid(True)               
    
plot_precision_recall_threshold(precisions, recalls, thresholds)
plt.show()

### The ROC Curve
The receiver operating characteristic curve for a binary classifier. It plots the true positive rate (i.e., the recall) against the false positive rate, or the ratio of negative instances that are incorrectly classified as positive.

In [None]:
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_train_7, y_scores)

In [None]:
def plot_roc_curve(fpr, tpr, label=None):
    plt.plot(fpr, tpr, linewidth=2, label=label)
    plt.plot([0, 1], [0, 1], 'k--') # dashed diagonal 
    plt.grid(True)                                            

plt.figure(figsize=(8, 6))                                   
plot_roc_curve(fpr, tpr)

## Other Models
Now we're going to work on the entire dataset and see if we can train a model that can identify all of the digits. First, with a k-nearest neighbors classifier:

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
knn=KNeighborsClassifier()
knn.fit(X_train, y_train)

In [None]:
predicted = knn.predict(X_test)
test = y_test

In [None]:
# count the ones the knn classifier got wrong
wrong = [(p,e) for (p,e) in zip(predicted, test) if p != e]
wrong

## Evaluation
Examining the estimator's score and the confusion matrix. 

For classification model's the score method will return the accuracy score.

In [None]:
print(f'Score: {100*knn.score(X_test, y_test):.2f} %')

The confusion matrix goes through every prediction and class and show the correct and incorrect predictions for that class.

In [None]:
cm = confusion_matrix(test, predicted)

In [None]:
cm

In [None]:
# confusion matrices look real nice as seaborn heatmaps
import seaborn as sns

In [None]:
# seaborn needs the confusion matrix in a pandas DataFrame
cm_df = pd.DataFrame(cm, index=range(10), columns=range(10))

axes = sns.heatmap(cm_df, annot=True, cmap="nipy_spectral_r")


The classification report produces a table of classification metrics.

In [None]:
from sklearn.metrics import classification_report
names = [str(digit) for digit in digits.target_names]

print(classification_report(test, predicted, target_names=names))

In [None]:
scores = cross_val_score(knn, X_train, y_train, cv=5)
scores

In [None]:
print(f'Average over 5 folds: {100*scores.mean():.2f}%')

In [None]:
print(f'Accuracy Standard Deviation: {100*scores.std():.2f}%')

### HyperParameter Tuning: Finding the best *k*


In [None]:
def analyzeK(X_train, y_train):
    # lists to plot the results
    kvalues = []
    av_scores = []
    
    # loop over the odd k values from 1-20
    for k in range(1, 20, 2):
        knn =  KNeighborsClassifier(n_neighbors=k)
        scores = cross_val_score(knn, X_train, y_train, cv=5)
        print(f'{k}: mean accuracy = {100*scores.mean():.2f}% -- standard deviation = {100*scores.std():.2f}%')
        kvalues.append(k)
        av_scores.append(100*scores.mean())
        
    plt.figure(figsize=(15,10))
    kplot = sns.barplot(x=kvalues, y=av_scores)
    
    kplot.set(ylim=(96,100))
    

In [None]:
analyzeK(X_train, y_train)

to be continued....