# Chapter 3 – Classification

_This notebook contains sample code from chapter 3.  It has been modified for CSC4505._

# Setup

This project requires Python 3.7 or above:

In [None]:
import sys

assert sys.version_info >= (3, 7)

It also requires Scikit-Learn ≥ 1.0.1:

In [None]:
from packaging import version
import sklearn

assert version.parse(sklearn.__version__) >= version.parse("1.0.1")

Just like in the previous chapter, let's define the default font sizes to make the figures prettier:

In [None]:
import matplotlib.pyplot as plt

plt.rc('font', size=14)
plt.rc('axes', labelsize=14, titlesize=14)
plt.rc('legend', fontsize=14)
plt.rc('xtick', labelsize=10)
plt.rc('ytick', labelsize=10)

And let's create the `images/classification` folder (if it doesn't already exist), and define the `save_fig()` function which can be used through this notebook to save the figures in high-res:

In [None]:
from pathlib import Path

IMAGES_PATH = Path() / "images" / "classification"
IMAGES_PATH.mkdir(parents=True, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = IMAGES_PATH / f"{fig_id}.{fig_extension}"
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

# MNIST

In [None]:
from sklearn.datasets import fetch_openml

mnist = fetch_openml('mnist_784', as_frame=False)

In [None]:
# extra code – it's a bit too long
print(mnist.DESCR)

In [None]:
X, y = mnist.data, mnist.target
X

In [None]:
X.shape

In [None]:
28 * 28

In [None]:
y

In [None]:
y.shape

In [None]:
import matplotlib.pyplot as plt

def plot_digit(image_data):
    image = image_data.reshape(28, 28)
    plt.imshow(image, cmap="binary")
    plt.axis("off")

some_digit = X[0]
plot_digit(some_digit)
plt.show()

In [None]:
# How can I check what digit this is according to this dataset?





That digit did not look high quality.  Let's check more of them to see how good/bad these digit images are.

In [None]:
plt.figure(figsize=(9, 9))
for idx, image_data in enumerate(X[:100]):
    plt.subplot(10, 10, idx + 1)
    plot_digit(image_data)
plt.subplots_adjust(wspace=0, hspace=0)
plt.show()

These look pretty good.  Now we split train from test using the split that is standard for this dataset. 

NOTE: The dataset comes to us having already been shuffled.

In [None]:
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

# Training a Binary Classifier

Which is an easier task: identifying which digit an image shows or identifying if the image shows a 5?



This difference is "multiclass classification" vs "binary classification".  Some ML algorithms are designed for just one of these tasks, others can handle either task in a similar way.

In [None]:
# Let's transform this problem into a binary classification task
y_train_5 = (y_train == '5')  # True for all 5s, False for all other digits
y_test_5 = (y_test == '5')

#### Why did we not modify the x values at all?

Now let's train a binary classifier.  

Note: We will discuss stochastic gradient descent in the next notebook.

In [None]:
from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(random_state=42)
sgd_clf.fit(X_train, y_train_5)

In [None]:
sgd_clf.predict([some_digit])

In [None]:
# Make predictions for the first 100 digits



# Compare to the correct labels for these digits





---

# Performance Measures

## Measuring Accuracy Using Cross-Validation

In [None]:
from sklearn.model_selection import cross_val_score

cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy")

In [None]:
# If you want to change anything in the cross validation process, modify this code

from sklearn.model_selection import StratifiedKFold
from sklearn.base import clone

skfolds = StratifiedKFold(n_splits=3)  # add shuffle=True if the dataset is not
                                       # already shuffled
for train_index, test_index in skfolds.split(X_train, y_train_5):
    clone_clf = clone(sgd_clf)
    X_train_folds = X_train[train_index]
    y_train_folds = y_train_5[train_index]
    X_test_fold = X_train[test_index]
    y_test_fold = y_train_5[test_index]

    clone_clf.fit(X_train_folds, y_train_folds)
    y_pred = clone_clf.predict(X_test_fold)
    n_correct = sum(y_pred == y_test_fold)
    print(n_correct / len(y_pred))

### Is this a good classifier?

In [None]:
from sklearn.model_selection import StratifiedKFold
from sklearn.base import clone

skfolds = StratifiedKFold(n_splits=3)  # add shuffle=True if the dataset is not
                                       # already shuffled
for train_index, test_index in skfolds.split(X_train, y_train_5):
    clone_clf = clone(sgd_clf)
    X_train_folds = X_train[train_index]
    y_train_folds = y_train_5[train_index]
    X_test_fold = X_train[test_index]
    y_test_fold = y_train_5[test_index]

    
    # Modify this part of the code to test a much simpler classifier that will come close in performance
    clone_clf.fit(X_train_folds, y_train_folds)
    y_pred = clone_clf.predict(X_test_fold)
    n_correct = sum(y_pred == y_test_fold)
    print(n_correct / len(y_pred))

## Confusion Matrix

<img src="https://miro.medium.com/v2/resize:fit:1110/1*kH4S_ronPD0R4aL05fhwrA.png">

In [None]:
from sklearn.model_selection import cross_val_predict

y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)

In [None]:
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_train_5, y_train_pred)
cm

In [None]:
y_train_perfect_predictions = y_train_5  # pretend we reached perfection
confusion_matrix(y_train_5, y_train_perfect_predictions)

## Precision and Recall

In [None]:
from sklearn.metrics import precision_score, recall_score

precision_score(y_train_5, y_train_pred)  # == 3530 / (687 + 3530)

In [None]:
# extra code – this cell also computes the precision: TP / (FP + TP)
cm[1, 1] / (cm[0, 1] + cm[1, 1])

#### How could I design a system that gets high precision?




In [None]:
recall_score(y_train_5, y_train_pred)  # == 3530 / (1891 + 3530)

In [None]:
# extra code – this cell also computes the recall: TP / (FN + TP)
cm[1, 1] / (cm[1, 0] + cm[1, 1])

#### How could I design a system that gets high recall?

In [None]:
from sklearn.metrics import f1_score

f1_score(y_train_5, y_train_pred)

In [None]:
# extra code – this cell also computes the f1 score
cm[1, 1] / (cm[1, 1] + (cm[1, 0] + cm[0, 1]) / 2)

## Precision/Recall Trade-off

Sometimes, we value one over the other.  When might we value precision (high accuracy when we predict "True")?  When might we value recall (missing few "True" values)?

Internally, the ML model assigned each example a score.  We can modify the threshold for this score to predict "True" or "False" more often.

In [None]:
y_scores = sgd_clf.decision_function([some_digit])
y_scores

In [None]:
threshold = 0
y_some_digit_pred = (y_scores > threshold)

In [None]:
y_some_digit_pred

In [None]:
# extra code – just shows that y_scores > 0 produces the same result as
#              calling predict()
y_scores > 0

In [None]:
threshold = 3000
y_some_digit_pred = (y_scores > threshold)
y_some_digit_pred

In [None]:
y_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3,
                             method="decision_function")

In [None]:
from sklearn.metrics import precision_recall_curve

precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)

In [None]:
plt.figure(figsize=(8, 4))  # extra code – it's not needed, just formatting
plt.plot(thresholds, precisions[:-1], "b--", label="Precision", linewidth=2)
plt.plot(thresholds, recalls[:-1], "g-", label="Recall", linewidth=2)
plt.vlines(threshold, 0, 1.0, "k", "dotted", label="threshold")


idx = (thresholds >= threshold).argmax()  # first index ≥ threshold
plt.plot(thresholds[idx], precisions[idx], "bo")
plt.plot(thresholds[idx], recalls[idx], "go")
plt.axis([-50000, 50000, 0, 1])
plt.grid()
plt.xlabel("Threshold")
plt.legend(loc="center right")

plt.show()

In [None]:
import matplotlib.patches as patches  # extra code – for the curved arrow

plt.figure(figsize=(6, 5))  # extra code – not needed, just formatting

plt.plot(recalls, precisions, linewidth=2, label="Precision/Recall curve")


plt.plot([recalls[idx], recalls[idx]], [0., precisions[idx]], "k:")
plt.plot([0.0, recalls[idx]], [precisions[idx], precisions[idx]], "k:")
plt.plot([recalls[idx]], [precisions[idx]], "ko",
         label="Point at threshold 3,000")
plt.gca().add_patch(patches.FancyArrowPatch(
    (0.79, 0.60), (0.61, 0.78),
    connectionstyle="arc3,rad=.2",
    arrowstyle="Simple, tail_width=1.5, head_width=8, head_length=10",
    color="#444444"))
plt.text(0.56, 0.62, "Higher\nthreshold", color="#333333")
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.axis([0, 1, 0, 1])
plt.grid()
plt.legend(loc="lower left")

plt.show()

In [None]:
idx_for_90_precision = (precisions >= 0.90).argmax()
threshold_for_90_precision = thresholds[idx_for_90_precision]
threshold_for_90_precision

In [None]:
y_train_pred_90 = (y_scores >= threshold_for_90_precision)

In [None]:
precision_score(y_train_5, y_train_pred_90)

In [None]:
recall_at_90_precision = recall_score(y_train_5, y_train_pred_90)
recall_at_90_precision

<img src="https://miro.medium.com/v2/resize:fit:1400/1*8M8A63NsQnK87ELChDkefg.png">

---

## The ROC Curve

Receiver operating characteristic curve plots true positive rate (recall) against false positive rate.

In [None]:
from sklearn.metrics import roc_curve

fpr, tpr, thresholds = roc_curve(y_train_5, y_scores)

In [None]:
idx_for_threshold_at_90 = (thresholds <= threshold_for_90_precision).argmax()
tpr_90, fpr_90 = tpr[idx_for_threshold_at_90], fpr[idx_for_threshold_at_90]

plt.figure(figsize=(6, 5))  # extra code – not needed, just formatting
plt.plot(fpr, tpr, linewidth=2, label="ROC curve")
plt.plot([0, 1], [0, 1], 'k:', label="Random classifier's ROC curve")
plt.plot([fpr_90], [tpr_90], "ko", label="Threshold for 90% precision")


plt.gca().add_patch(patches.FancyArrowPatch(
    (0.20, 0.89), (0.07, 0.70),
    connectionstyle="arc3,rad=.4",
    arrowstyle="Simple, tail_width=1.5, head_width=8, head_length=10",
    color="#444444"))
plt.text(0.12, 0.71, "Higher\nthreshold", color="#333333")
plt.xlabel('False Positive Rate (Fall-Out)')
plt.ylabel('True Positive Rate (Recall)')
plt.grid()
plt.axis([0, 1, 0, 1])
plt.legend(loc="lower right", fontsize=13)

plt.show()

We often report the area under the ROC curve as an indicator of the quality of the system.

In [None]:
from sklearn.metrics import roc_auc_score

roc_auc_score(y_train_5, y_scores)

Let's say another group trained a different model and we want to compare ours to theirs.  Which looks better when comparing PR curves?

<img src="https://media.licdn.com/dms/image/v2/D4D12AQFPp1KLNQqPgw/article-cover_image-shrink_600_2000/article-cover_image-shrink_600_2000/0/1711907103441?e=2147483647&v=beta&t=H75OGttfjj68X8vqlrDZ1HNwuAmtzha4sDl_K36Zd0A">

---

# Multiclass Classification

Now what if we want to identify a digit rather than just if it is a "5" or not?  How could we turn our binary classifier (or many of them) into a multiclass classifier?

**Warning:** the following cells may take a few minutes each to run:

In [None]:
sgd_clf = SGDClassifier(random_state=42)
sgd_clf.fit(X_train[:2000], y_train[:2000])
sgd_clf.predict([some_digit])

In [None]:
sgd_clf.decision_function([some_digit]).round()

If you train on the entire dataset, you get the following output instead:

\>>> sgd_clf = SGDClassifier(random_state=42)<br>
\>>> sgd_clf.fit(X_train, y_train)<br>
\>>> sgd_clf.predict([some_digit])<br>
array(['3'], dtype='<U1')


\>>> sgd_clf.decision_function([some_digit]).round()<br>
array([[-31893., -34420., -9531., 1824., -22320., -1386., -26189.,
 -16148., -4604., -12051.]])

#### How many classifiers did it train?  What digits scored the highest?

In [None]:
from sklearn.multiclass import OneVsOneClassifier
ovo_sgd = OneVsOneClassifier(SGDClassifier())
cross_val_score(ovo_sgd, X_train[:2000], y_train[:2000], cv=3, scoring="accuracy")

The code above is training classifiers for 0 vs 1, 0 vs 2, 0 vs 3, ..., 1 vs 2, 1 vs 3, ...<br>
#### How many classifiers does it train?

In [None]:
cross_val_score(sgd_clf, X_train[:2000], y_train[:2000], cv=3, scoring="accuracy")

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train.astype("float64"))
cross_val_score(sgd_clf, X_train_scaled[:2000], y_train[:2000], cv=3, scoring="accuracy")

In [None]:
cross_val_score(ovo_sgd, X_train_scaled[:2000], y_train[:2000], cv=3, scoring="accuracy")

# Error Analysis

**Warning:** the following cell will take a few minutes to run:

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay

y_train_pred = cross_val_predict(sgd_clf, X_train_scaled[:2000], y_train[:2000], cv=3)
plt.rc('font', size=9)  # extra code – make the text smaller
ConfusionMatrixDisplay.from_predictions(y_train[:2000], y_train_pred[:2000])
plt.show()

In [None]:
plt.rc('font', size=10)  # extra code
ConfusionMatrixDisplay.from_predictions(y_train[:2000], y_train_pred[:2000],
                                        normalize="true", values_format=".0%")
plt.show()

Let's focus on what we got wrong

In [None]:
sample_weight = (y_train_pred[:2000] != y_train[:2000])
plt.rc('font', size=10)  # extra code
ConfusionMatrixDisplay.from_predictions(y_train[:2000], y_train_pred[:2000],
                                        sample_weight=sample_weight,
                                        normalize="true", values_format=".0%")
plt.show()

Let's put all plots in a couple of figures for the book:

In [None]:
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(9, 4))
plt.rc('font', size=9)
ConfusionMatrixDisplay.from_predictions(y_train[:2000], y_train_pred[:2000], ax=axs[0])
axs[0].set_title("Confusion matrix")
plt.rc('font', size=10)
ConfusionMatrixDisplay.from_predictions(y_train[:2000], y_train_pred[:2000], ax=axs[1],
                                        normalize="true", values_format=".0%")
axs[1].set_title("CM normalized by row")

plt.show()

In [None]:
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(9, 4))
plt.rc('font', size=10)
ConfusionMatrixDisplay.from_predictions(y_train[:2000], y_train_pred[:2000], ax=axs[0],
                                        sample_weight=sample_weight,
                                        normalize="true", values_format=".0%")
axs[0].set_title("Errors normalized by row")
ConfusionMatrixDisplay.from_predictions(y_train[:2000], y_train_pred[:2000], ax=axs[1],
                                        sample_weight=sample_weight,
                                        normalize="pred", values_format=".0%")
axs[1].set_title("Errors normalized by column")

plt.show()
plt.rc('font', size=14)  # make fonts great again

In [None]:
cl_a, cl_b = '3', '5'
X_aa = X_train[:2000][(y_train[:2000] == cl_a) & (y_train_pred[:2000] == cl_a)]
X_ab = X_train[:2000][(y_train[:2000] == cl_a) & (y_train_pred[:2000] == cl_b)]
X_ba = X_train[:2000][(y_train[:2000] == cl_b) & (y_train_pred[:2000] == cl_a)]
X_bb = X_train[:2000][(y_train[:2000] == cl_b) & (y_train_pred[:2000] == cl_b)]

In [None]:
size = 5
pad = 0.2
plt.figure(figsize=(size, size))
for images, (label_col, label_row) in [(X_ba, (0, 0)), (X_bb, (1, 0)),
                                       (X_aa, (0, 1)), (X_ab, (1, 1))]:
    for idx, image_data in enumerate(images[:size*size]):
        x = idx % size + label_col * (size + pad)
        y = idx // size + label_row * (size + pad)
        plt.imshow(image_data.reshape(28, 28), cmap="binary",
                   extent=(x, x + 1, y, y + 1))
plt.xticks([size / 2, size + pad + size / 2], [str(cl_a), str(cl_b)])
plt.yticks([size / 2, size + pad + size / 2], [str(cl_b), str(cl_a)])
plt.plot([size + pad / 2, size + pad / 2], [0, 2 * size + pad], "k:")
plt.plot([0, 2 * size + pad], [size + pad / 2, size + pad / 2], "k:")
plt.axis([0, 2 * size + pad, 0, 2 * size + pad])
plt.xlabel("Predicted label")
plt.ylabel("True label")
plt.show()

# Multilabel Classification

Sometimes, we want more than one label per example (digit and if it is odd, for example).

In [None]:
import numpy as np
from sklearn.neighbors import KNeighborsClassifier

y_train_large = (y_train >= '7') # larger than 6
y_train_odd = (y_train.astype('int8') % 2 == 1) # odd
y_multilabel = np.c_[y_train_large, y_train_odd]

knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train[:2000], y_multilabel[:2000])

In [None]:
knn_clf.predict([some_digit])

In [None]:
y_train_knn_pred = cross_val_predict(knn_clf, X_train[:2000], y_multilabel[:2000], cv=3)
f1_score(y_multilabel[:2000], y_train_knn_pred[:2000], average="macro")

In [None]:
# extra code – shows that we get a negligible performance improvement when we
#              set average="weighted" because the classes are already pretty
#              well balanced.
f1_score(y_multilabel[:2000], y_train_knn_pred[:2000], average="weighted")

In [None]:
# Could we use our existing model type for multilabel?
sgd_multi_clf = SGDClassifier()
sgd_multi_clf.fit(X_train[:2000], y_multilabel[:2000])

In [None]:
# Instead, we could make separate classifiers for each of the two tasks.  
# Write an SGD classifier for each task (y_train_large and y_train_odd) and get their accuracy.
# For speed, use only the first 2000 examples as above.








A chain classifier makes a prediction, then uses that as additional information for its next classifier in the chain.

In [None]:
from sklearn.multioutput import ClassifierChain

chain_clf = ClassifierChain(SGDClassifier(), cv=3, random_state=42)
chain_clf.fit(X_train[:2000], y_multilabel[:2000])

In [None]:
chain_clf.predict([some_digit])

# Multioutput Classification

We may want a multilabel multiclass classification (more than one output and the values are more than just True/False). For this example, we will try to predict the pixel value for each of the 784 pixels in the original image.

In [None]:
np.random.seed(42)  # to make this code example reproducible
noise = np.random.randint(0, 100, (len(X_train), 784))
X_train_mod = X_train + noise
noise = np.random.randint(0, 100, (len(X_test), 784))
X_test_mod = X_test + noise
y_train_mod = X_train
y_test_mod = X_test

In [None]:
plt.subplot(121); plot_digit(X_test_mod[0])
plt.subplot(122); plot_digit(y_test_mod[0])
plt.show()

Given a noisy image, predict the pixel values for the clean image.

In [None]:
knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train_mod, y_train_mod)
clean_digit = knn_clf.predict([X_test_mod[0]])
plot_digit(clean_digit)
plt.show()

If we run it again on its prior output, will it get even better?  Make a prediction, then try it out!

In [None]:
# Write your code here


