# Classification

In [None]:
# load MNIST dataset
from sklearn.datasets import fetch_openml

mnist = fetch_openml('mnist_784', version=1)
mnist.keys()

- **A DESCR** key that describing the dataset
- **A data key** containing an array with one row per instance and one column per feature
- **A target key** containing an array with the labels

In [None]:
# take a look at dataset
X, y = mnist["data"], mnist["target"]
print(X.shape, y.shape)

There are **70.000 images**, and each image has **784 features**. This is because each image is **28x28 pixels**. And each feature simply represents one pixel's intensity, from 0 (white) to 255 (black).

In [None]:
# display one digit from dataset
import matplotlib as mpl
import matplotlib.pyplot as plt

some_digit = X[0]
some_digit_image = some_digit.reshape(28, 28)

plt.imshow(some_digit_image, cmap="binary")
plt.axis("off")
plt.show()

In [None]:
import numpy as np

# lets take a look at the label
print(y[0])

# cast the label into integer type
y = y.astype(np.uint8)

In [None]:
# split dataset into training set and testing set
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

# since the training set is already shuffled, it will be easy to do cross-validation

## Training a Binary Classifier

In [None]:
# We simplify the problem with binary classifier
# So, it will distinguish between just two classes, 5 and not-5
y_train_5 = (y_train==5) # true for all 5s, false for the other digits
y_test_5 = (y_test==5)

In [None]:
# we will use Stochastic Gradient Descent (SGD)
# SGD being capable of deals with very large dataset efficiently

from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(random_state=42)
sgd_clf.fit(X_train, y_train_5)

In [None]:
# lets use SGD model to predict some digit
sgd_clf.predict([some_digit])

# the result should be True, since the digit is 5

## Performance Measures

### Measuring Accuracy Using Cross-Validation

In [None]:
from sklearn.model_selection import cross_val_score

cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy")

The accuracy results looks good with 96%. But, let's look at very dumb classifier that classifies every single image in the "not-5" class.

In [None]:
from sklearn.base import BaseEstimator

class Never5Classifier(BaseEstimator):
    def fit(self, X, y=None):
        pass
    def predict(self, X):
        return np.zeros((len(X), 1), dtype=bool)

In [None]:
# let's find out the accuracy
never_5_clf = Never5Classifier()
cross_val_score(never_5_clf, X_train, y_train_5, cv=3, scoring="accuracy")

The result has over 90% accuracy, because only 10% of the images are 5s, so if we always guess that an image is not a 5, we will be right about 90%.

This condition called **skewed datasets** (when some classes are much more frequent than others)

### Confusion Matrix
We must have a set of prediction to be compared to the actual targets.

In [None]:
from sklearn.model_selection import cross_val_predict

y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)

# cross_val_predict performs k-fold cross-validation
# and it returns the predictions made on each test fold

now let's calculate the confusion matrix

In [None]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y_train_5, y_train_pred)

Each **row** in a confusion matrix represents an **actual class**, while each **column** represents a **predicted class.**

- **53892:** were correctly classified as non-5s **(true negatives)**
- **687:** were wrongly classified as 5s **(false positives)**
- **1891:** were wrongly classified as non-5s **(false negatives)**
- **3530:** were correctly classified as 5s **(true positives)**

A perfect classifier would only have **true positives** and **true negatives.**

In [None]:
# let's pretend we reached perfection
y_train_perfect_predictions = y_train_5
confusion_matrix(y_train_5, y_train_perfect_predictions)

\begin{equation*}
Precision | Accuracy = \frac{TP}{(TP + FP)}.
\end{equation*}

\begin{equation*}
Recall|Sensitivity|True Positive Rate = \frac{TP}{(TP + FN)}.
\end{equation*}

- **TP:** true positive
- **FP:** false positive
- **FN:** false negative

### Precision and Recall

In [None]:
from sklearn.metrics import precision_score, recall_score
pre = precision_score(y_train_5, y_train_pred)
rec = recall_score(y_train_5, y_train_pred)

print(pre)
print(rec)

It is convenient to combine precision and recall into a single metric called *F1 score.*

\begin{equation*}
F_1 = \frac{TP}{\frac{1}{precision} + \frac{1}{recall}} = 2 \times \frac{precision \times recall}{precision + recall} = \frac{TP}{TP + \frac{FN + FP}{2}}.
\end{equation*}

In [None]:
from sklearn.metrics import f1_score
f1_score(y_train_5, y_train_pred)

In some contexts we mostly care about precision, and in other contexts we really care about recall.

For example,
- If you trained a classifier to detect videos that are safe for kids, we would probably prefer a classifier that rejects many good videos **(low recall)** but keeps only safe ones **(high precision)** rather than a classifier that has a much higher recall but lets a few really bad videos show up.
- Otherwise, suppose we train a classifier to detect shoplifters on surveillance images: its probably fine if your classifier has only 30% precision as long it has 99% recall (sure, the security guards will get a few false alerts, but almost all shoplifters will get caught)

We can't have it both ways: increasing precision reduces recall, and vice versa. This is called **precision/recall tradeoff**

### Precision/Recall Tradeoff

In [None]:
# scikit-learn gives access to the decision scores that it uses to make predictions
# decision_function() method returns a score for each instance, and then make predictions based on those score using any threshold
y_scores = sgd_clf.decision_function([some_digit])
print(y_scores)

threshold = 0
y_some_digit_pred = (y_scores > threshold)
print(y_some_digit_pred)

# it will return the same result as the predict() method -> True

In [None]:
# let's raise the threshold
threshold = 8000
y_some_digit_pred = (y_scores > threshold)
print(y_some_digit_pred)

Raising the threshold decreases recall. The image actually represents a 5, and the classifier detects it when the threshold is 0, but it misses it when the threshold is increased to 8000.

Now, how do we decide which threshold to use?
We need to get the scores of all instances in the training set using cross_val_predict() function.

In [None]:
y_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3,
                             method="decision_function")

Now with these scores, we can compute precision and recall for all possible thresholds using *precision_recall_curve*

In [None]:
from sklearn.metrics import precision_recall_curve

precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)

In [None]:
# plot precision and recall
def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
    plt.plot(thresholds, precisions[:-1], "b--", label="Precision")
    plt.plot(thresholds, recalls[:-1], "g-", label="Recall")
    plt.legend(loc="center right", fontsize=16)
    plt.xlabel("Threshold", fontsize=16)
    plt.grid(True)
    plt.axis([-50000, 50000, 0, 1])

plot_precision_recall_vs_threshold(precisions, recalls, thresholds)
plt.show()

Precision curve is bumpier than recall curve, because the precision may sometimes go down when we raise the threshold. On the other hand, recall can only go down when the threshold is increased, which explains why its curve looks smooth.

In [None]:
# precision/recall trade-off depend on our project

# let's try with 90% precision
threshold_90_precision = thresholds[np.argmax(precisions >= 0.90)]
threshold_90_precision

In [None]:
# make predictions
y_train_pred_90 = (y_scores >= threshold_90_precision)
y_train_pred_90

Let's check precision and recall

In [None]:
print(precision_score(y_train_5, y_train_pred_90))
print(recall_score(y_train_5, y_train_pred_90))

If someone says "let's reach 99% precision", you should ask, "at what recall?"

### The ROC Curve
*Receiving Operating Characteristics (ROC)* curve plotting precision versus recall. ROC curve plot the *true positive rate (recall)* against the *false positive rate.*

The ROC curve plots *sensitivity* (recall) versus 1 - *specificity* 

In [None]:
from sklearn.metrics import roc_curve

fpr, tpr, thresholds = roc_curve(y_train_5, y_scores)

In [None]:
# let's plot roc curve
def plot_roc_curve(fpr, tpr, label=None):
    plt.plot(fpr, tpr, linewidth=2, label=label)
    plt.plot([0,1], [0,1], 'k--')
    plt.axis([0, 1, 0, 1])
    plt.xlabel('False Positive Rate (Fall-Out)', fontsize=16)
    plt.ylabel('True Positive Rate (Recall)', fontsize=16)
    plt.grid(True)
    
plot_roc_curve(fpr, tpr)
plt.show()

# There is a tradeoff: the higher TPR, the more FPR the classifier produces
# The dotted line represents ROC curve of a purely random classifier
# A good classifier stays as far away from that line (toward the top-left corner)

One way to compare classifiers is to measure the *area under the curve* (AUC). A perfect classifier will have ROC AUC equal to 1.

In [None]:
from sklearn.metrics import roc_auc_score

roc_auc_score(y_train_5, y_scores)

How to decide which one to use ROC curve or PR curve?
- We should prefer PR curve whenever the positive class is rare or when we care more about false positives than the false negatives.
- And the ROC curve otherwise

In [None]:
# Let's train a RandomForestClassifier and compare its ROC curve and ROC AUC score to the SGDClassifier
from sklearn.ensemble import RandomForestClassifier

forest_clf = RandomForestClassifier(random_state=42)
y_probas_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3,
                                    method="predict_proba")

In [None]:
# to plot a ROC curve, we need scores
y_scores_forest = y_probas_forest[:, 1] # proba of positive class
fpr_forest, tpr_forest, threshold_forest = roc_curve(y_train_5, y_scores_forest)

In [None]:
# plot the ROC curve
plt.plot(fpr, tpr, "b:", label="SGD")
plot_roc_curve(fpr_forest, tpr_forest, "Random Forest")
plt.legend(loc="lower right")
plt.show()

As you can see, *RandomForestClassifier* is better than *SGDClassifier*

In [None]:
# let's calculate the roc auc score
print(roc_auc_score(y_train_5, y_scores_forest))

# precision and recall
y_train_pred_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3)
print(precision_score(y_train_5, y_train_pred_forest))
print(recall_score(y_train_5, y_train_pred_forest))

## Multiclass Classification

- SGD classifiers, Random Forest classifiers, and naive Bayes classifiers are capable of handling multiple classes.
- Logistic Regression and Support Vector Machine classifiers are strictly binary classifiers.

There are various strategies to perform multiclass classification using multiple binary classifiers.
- **OvA (one-versus-the-rest):** create a system that can classify the digit images into 10 classes (from 0 to 9) is to train 10 binary classifiers, one for each digit. So, when we want to classify an image, we get the decision scores from each classifier for that image and select the class whose classifier outputs the highest score.
- **OvO (one-versus-one):** train a binary classifier for every pair of digits: one to distinguish 0s and 1s, etc. If there is N classes, we need to train *N x (N - 1)/2* classifiers. The advantage is that each classifier only needs to be trained on the part of the training set for the two classes that it must be distinguish.

For most binary classification algorithms, OvA is preferred.

In [None]:
# fortunately, scikit-learn can automatically runs OvA or OvR, depending o the algorithm
from sklearn.svm import SVC
svm_clf = SVC()
svm_clf.fit(X_train, y_train)
svm_clf.predict([some_digit])

In [None]:
# actually scikit-learn used the OvO strategy:
# it trained binary classifiers, got their decision scores for the image,
# and selected the class that won the most duels

# let's show the decision scores with decision_function
some_digit_scores = svm_clf.decision_function([some_digit])
some_digit_scores

In [None]:
# The highest score:
print(np.argmax(some_digit_scores))

print(svm_clf.classes_)
print(svm_clf.classes_[5])

We can force scikit-learn to use OvO or OvA

In [None]:
from sklearn.multiclass import OneVsRestClassifier
ovr_clf = OneVsRestClassifier(SVC())
ovr_clf.fit(X_train, y_train)
print(ovr_clf.predict([some_digit]))

print(len(ovr_clf.estimators_))

In [None]:
# training with SGDClassifier or RandomForestClassifier is just as easy
sgd_clf.fit(X_train, y_train)
sgd_clf.predict([some_digit])

In SGDClassifier, scikit-learn did not have to run OvA or OvO because SGD classifiers can directly classify instances into multiple classes.

In [None]:
# let's look at the score that the SGD classifier assigned
sgd_clf.decision_function([some_digit])

Class #5 has a score of 2412.5

In [None]:
# evaluate SGDClassifier
cross_val_score(sgd_clf, X_train, y_train, cv=3, scoring="accuracy")

In [None]:
# we can optimize the result by scaling the inputs
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train.astype(np.float64))
cross_val_score(sgd_clf, X_train_scaled, y_train, cv=3, scoring="accuracy")

## Error Analysis

In [None]:
# let's take a look at the confusion matrix
y_train_pred = cross_val_predict(sgd_clf, X_train_scaled, y_train, cv=3)
conf_mx = confusion_matrix(y_train, y_train_pred)
conf_mx

In [None]:
# plot the confusion matrix
plt.matshow(conf_mx, cmap=plt.cm.gray)
plt.show()

We need to divide each value in the confusion matrix by the numbe of images in the corresponding class, so we can compare error rates instead absolute number of errors.

In [None]:
row_sums = conf_mx.sum(axis=1, keepdims=True)
norm_conf_mx = conf_mx/row_sums

In [None]:
# now let's fill the diagonal with zeros to keep only the errors
np.fill_diagonal(norm_conf_mx, 0)
plt.matshow(norm_conf_mx, cmap=plt.cm.gray)
plt.show()

**Remember that rows represent actual classes, while columns represent predicted classes.**

Analyzing individual errors can also be a good way to gain insights on what your classifier is doing and why it is failing.

In [None]:
# let's plot examples of 3s and 5s
cl_a, cl_b = 3, 5
X_aa = X_train[(y_train == cl_a) & (y_train_pred == cl_a)]
X_ab = X_train[(y_train == cl_a) & (y_train_pred == cl_b)]
X_ba = X_train[(y_train == cl_b) & (y_train_pred == cl_a)]
X_bb = X_train[(y_train == cl_b) & (y_train_pred == cl_b)]

plt.figure(figsize=(8,8))
plt.subplot(221); plot_digits(X_aa[:25], images_per_row=5)
plt.subplot(222); plot_digits(X_ab[:25], images_per_row=5)
plt.subplot(223); plot_digits(X_ba[:25], images_per_row=5)
plt.subplot(224); plot_digits(X_bb[:25], images_per_row=5)
plt.show()

The two 5x5 blocks on the left show digits classifier as 3s, and the two 5x5 blocks on the right show images classified as 5s. Some of the digits that the classifier gets wrong.

The reason is that we used a simple SGDClassifier, which is linear model. All it does is assign a weight per class to each pixel, and when it sees a new image it just sums up the weighted pixel intensities to get a score for each class. So since 3s and 5s differ only by a few pixels, this model will easily confuse them.

One way to reduce the 3/5 confuson would be to preprocess the images to ensure that they are well centered and not too rotated. This will probably help reduce other errors as well.

## Multilabel Classification

In some cases, we want our classifier to output multiple classes for each instances.

Example:
- There is a classifier that has been trained to recognize three faces, Alice, Bob, and Charlie.
- When it is shown a picture of Alice and Charlie, it should output (1, 0, 1) meaning "Alice yes, Bob no, Charlie yes".

In [None]:
# example
from sklearn.neighbors import KNeighborsClassifier

y_train_large = (y_train >= 7)
y_train_odd = (y_train % 2 == 1)
y_multilabel = np.c_[y_train_large, y_train_odd]

knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, y_multilabel)

# the first indicates whether or not the digit is large (7, 8, 9)
# the second indicates whether or not it is odd

In [None]:
# now we can make a prediction
knn_clf.predict([some_digit])

# the result is, the digit 5 is not large (False) and odd (True)

In [None]:
# we can measure it using F1 score for each individual label
y_train_knn_pred = cross_val_predict(knn_clf, X_train, y_multilabel, cv=3)
f1_score(y_multilabel, y_train_knn_pred, average="macro")

## Multioutput Classification

It is a simply a generalization of multilabel classification where each label can be multiclass (i.e., it can have more than two possible values)

In [None]:
# let's create traning and test sets and adding noise to their pixel intensities
noise = np.random.randint(0, 100, (len(X_train), 784))
X_train_mod = X_train + noise
noise = np.random.randint(0, 100, (len(X_test), 784))
y_test_mod = X_test + mod
y_train_mod = X_train
y_test_mod = X_test

In [None]:
# let's take a peek
some_index = 0
plt.subplot(121); plot_digit(X_test_mod[some_index])
plt.subplot(122); plot_digit(y_test_mod[some_index])
plt.show()

In [None]:
# let's train the classifier and make it clean 
knn_clf.fit(X_train_mod, y_train_mod)
clean_digit = knn_clf.predict([X_test_mod[some_index]])
plot_digit(clean_digit)

## Exercises