<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/ufidon/ml/blob/main/mod2/cmte.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
  </td>
  <td>
    <a target="_blank" href="https://kaggle.com/kernels/welcome?src=https://github.com/ufidon/ml/blob/main/mod2/cmte.ipynb"><img src="https://kaggle.com/static/images/open-in-kaggle.svg" /></a>
  </td>
</table>
<br>


__Classification__

_homl3 ch3_

- MNIST - a dataset of handwritten digits
- Building a digit recognizer
- Model evaluation
  - Measuring Accuracy Using Cross-Validation
  - Confusion Matrices
  - Precision and Recall
  - The Precision/Recall Trade-off
  - The ROC Curve
- Multiclass Classification
  - Error Analysis
- Multilabel Classification
- Multioutput Classification

In [None]:
import numpy as np, pandas as pd, matplotlib.pyplot as plt, matplotlib as mpl
import sklearn as skl, sklearn.datasets as skds

[MNIST - a dataset of handwritten digits](https://en.wikipedia.org/wiki/MNIST_database)
---
- Modified National Institute of Standards and Technology database (MNIST)
- a large database of handwritten digits used by image processing systems
- contains 70,000 black and white images 
  - 60,000 for training and 10,000 for testing
- each image is normalized to fit into a 28x28 pixel bounding box and anti-aliased

In [None]:
# fetch the dataset from https://www.openml.org/
mnist = skds.fetch_openml('mnist_784', as_frame=False)

# the returned if of type sklearn.utils.Bunch
# this is a dictionary whose keys can also be accessed as attributes
mnist.keys()

In [None]:
# the description of the dataset
print(mnist.DESCR)

In [None]:
X,y = mnist.data, mnist.target
X.shape, y.shape

In [None]:
plt.imshow(X[1000].reshape((28,28))), y[1000]

In [None]:
fig, axs=plt.subplots(10,10,figsize=(9,9), layout='constrained')
for idx, dimg in enumerate(X[60_000:60_100]):
  axs[idx//10, idx%10].imshow(dimg.reshape((28,28)), cmap='binary')
  axs[idx//10, idx%10].axis("off")


In [None]:
# The dataset is already shuffled and split into a training set and a test set
X_train, y_train = X[:60_000], y[:60_000]
X_test, y_test = X[60_000:], y[60_000:]
# 👍 Thumb rule for data splitting: 80% for training 20% for testing

Building a digit recognizer
---
- Let's start from recognizing a single digit such as
  - `0` or `non-0`, `8` or `non-8`
  - which is a binary classifier
- can be implemented with many scikit-learn's classifiers, e.g.
  - stochastic gradient descent (SGD, or stochastic GD) classifier
  - implemented in the scikit-learn's SGDClassifier class

In [None]:
# create and train a binary classifier to recognize 8
from sklearn.linear_model import SGDClassifier
clfSgd = SGDClassifier(random_state=50)
y_train_8 = (y_train == '8')
clfSgd.fit(X_train, y_train_8)

In [None]:
# recognize 8 from test images using this classifier
res = clfSgd.predict(X[60_000:60_100])
res.reshape((10,10))

In [None]:
fig1, axs1=plt.subplots(10,10,figsize=(9,9), layout='constrained')
for idx, dimg in enumerate(X[60_000:60_100]):
  axs1[idx//10, idx%10].imshow(dimg.reshape((28,28)), cmap='binary') if res[idx] == False else axs1[idx//10, idx%10].imshow(dimg.reshape((28,28)))
  axs1[idx//10, idx%10].axis("off")

[Model evaluation](https://scikit-learn.org/stable/model_selection.html)
---
- many metrics are available for model evaluation, such as
  - confusion matrix
  - accuracy, precision, recall, f1 score, etc.
- which metrics are preferred depends on the requirements

Measuring Accuracy Using Cross-Validation
---
- k-fold cross-validation
  - split the training set into k folds
  - train the model k times
  - hold out a different fold each time for evaluation
  - implemented with cross_val_score in scikit

accuracy=(# of correct predictions)/(# of all predictions)


In [None]:
from sklearn.model_selection import cross_val_score
cross_val_score(clfSgd, X_train, y_train_8, cv=5, scoring='accuracy') # cv=5 number of folds, default 5

In [None]:
# The accuracies are quite good for all folds. 
# However, this is caused by the imbalance of the chosen data.
# by just telling not-8 every time, we get 90% right
1-len(y_train_8[y_train_8 == True])/len(y_train_8)

In [None]:
# equally randomly guess imbalanced data achieves high accuracy
# so accuracy is NOT useful in situations with highly imbalanced data
from sklearn.dummy import DummyClassifier
clfDummy = DummyClassifier()
clfDummy.fit(X_train, y_train_8)
cross_val_score(clfDummy, X_train, y_train_8, scoring='accuracy')

In [None]:
# an implementation of cross-validation

from sklearn.model_selection import StratifiedKFold
from sklearn.base import clone

skFolder = StratifiedKFold(n_splits=5, shuffle=True)
for trainIndex, testIndex in skFolder.split(X_train, y_train_8):
  cloneClf = clone(clfSgd)
  X_trainFold = X_train[trainIndex]
  y_trainFold = y_train_8[trainIndex]
  X_testFold = X_train[testIndex]
  y_testFold = y_train_8[testIndex]

  cloneClf.fit(X_trainFold, y_trainFold)
  yPred = cloneClf.predict(X_testFold)
  nCorrect = sum(yPred == y_testFold)
  print(nCorrect/len(yPred), end=" ")

[Confusion Matrices](https://en.wikipedia.org/wiki/Confusion_matrix)
---
- visualize of the performance of algorithms
- show number of misclassifications

| Actual\Prediction | non-`8` | `8` |
|:---:|:---:|:---:|
| non-`8` | True negative (TN) | False positive (FP)<br>or type I error |
| `8` | False negative (FN)<br>or type II error | True positive (TP) |

In [None]:
# generate the confusion matrix on training data
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix

y_train_pred = cross_val_predict(clfSgd, X_train, y_train_8)
cm = confusion_matrix(y_train_8, y_train_pred)

In [None]:
cm

Precision, recall and F1 score
---

- the precision of the classifier is the accuracy of the positive predictions


$\displaystyle precision=\frac{TP}{TP+FP}$

- could be misleading in the case like
  - always make negative predictions
  - make only one positive prediction on the instance it is sure about
  - then, precision = 1/1 = 100%
- so, precision is usually used along with *recall*, the *sensitivity*, or the *true positive rate (TPR)*
  - i.e. the ratio of positive instances correctly predicted by the classifier

$\displaystyle recall=\frac{TP}{TP+FN}$

- Now, accuracy can be calculated as

$\displaystyle accuracy = \frac{TP+TN}{P+N}=\frac{TP+TN}{TP+TN+FP+FN}$

In [None]:
# calculate precision and recall
from sklearn.metrics import precision_score, recall_score, accuracy_score
tn,fp,fn,tp = cm.flatten()
print(f'precision=TP/(TP+FP)={tp}/({tp}+{fp})={tp/(tp+fp)}={precision_score(y_train_8,y_train_pred)}')
print(f'recall=TP/(TP+FN)={tp}/({tp}+{fn})={tp/(tp+fn)}={recall_score(y_train_8,y_train_pred)}')
print(f'accuracy=(TP+TN)/(TP+TN+FP+FN)=({tp}+{tn})/({tp}+{tn}+{fp}+{fn})={cm.trace()/cm.sum()}={accuracy_score(y_train_8,y_train_pred)}')

- the classifier is correct only 60.33% of the time
  - detects 56.79% of the `8`'s
- precision and recall can be combined into a single metric $F_1$ score
  - the *harmonic mean* of precision and recall
  - gives more weight to low values
  - $F_1$ is high when both precision and recall are high

$\displaystyle F_1=\frac{2}{\frac{1}{precision}+\frac{1}{recall}}=\frac{2TP}{2TP+FN+FP}$

In [None]:
# calculate F1 score
from sklearn.metrics import f1_score
print(f'f1 = 2TP/(2TP+FN+FP)=2*{tp}/(2*{tp}+{fn}+{fp})={2*tp/(2*tp+fn+fp)}={f1_score(y_train_8, y_train_pred)}')

The Precision/Recall Trade-off
---
- the *precision/recall trade-off* is that increasing precision reduces recall, and vice versa
- the SGDClassifier makes its classification decisions in two steps
  - computes a score based on a decision function
  - compares the score with a threshold
    - classifies the instance as positive if score > threshold
    - else negative
  - the default threshold used by the SGDClassifier is 0
  - raising the threshold decreases recall
- the appropriate threshold, i.e. the precision/recall trade-off can be made on
  - the curves of precision vs. threshold and recall vs. threshold

In [None]:
# calculate all precisions and recalls vs. thresholds
y_scores = cross_val_predict(clfSgd, X_train, y_train_8, method='decision_function')

from sklearn.metrics import precision_recall_curve
precisions, recalls,thresholds = precision_recall_curve(y_train_8, y_scores)

In [None]:
# draw the curves of precision vs. threshold and recall vs. threshold
threshold = 3000
fig2, ax2 = plt.subplots(figsize=(6,3),layout='constrained')
ax2.plot(thresholds,precisions[:-1], 'b--', label='Precision', linewidth=2)
ax2.plot(thresholds,recalls[:-1],'g-', label='Recall', linewidth=2)
ax2.vlines(threshold,0, 1.0, 'r', "dotted", label='threshold')

idx = (thresholds >= threshold).argmax() # first index ≥ threshold
ax2.plot(thresholds[idx], precisions[idx], 'bo')
ax2.plot(thresholds[idx], recalls[idx], 'go')
ax2.grid('on')
ax2.axis([-50000,25000,-0.01,1.01])
ax2.set_xlabel("Threshold")
ax2.legend(loc='center right')

The ROC Curve
---

- Multiclass Classification
  - Error Analysis
- Multilabel Classification
- Multioutput Classification

# References
- [Model selection and evaluation in scikit](https://scikit-learn.org/stable/model_selection.html)