<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/ufidon/ml/blob/main/mod2/cmte.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
  </td>
  <td>
    <a target="_blank" href="https://kaggle.com/kernels/welcome?src=https://github.com/ufidon/ml/blob/main/mod2/cmte.ipynb"><img src="https://kaggle.com/static/images/open-in-kaggle.svg" /></a>
  </td>
</table>
<br>


__Classification__

_homl3 ch3_

- MNIST - a dataset of handwritten digits
- Building a digit recognizer
- Model evaluation
  - Measuring Accuracy Using Cross-Validation
  - Confusion Matrices
  - Precision and Recall
  - The Precision/Recall Trade-off
  - The ROC Curve
- Multiclass Classification
  - Error Analysis
- Multilabel Classification
- Multioutput Classification

In [None]:
import numpy as np, pandas as pd, matplotlib.pyplot as plt, matplotlib as mpl
import sklearn as skl, sklearn.datasets as skds

[MNIST - a dataset of handwritten digits](https://en.wikipedia.org/wiki/MNIST_database)
---
- Modified National Institute of Standards and Technology database (MNIST)
- a large database of handwritten digits used by image processing systems
- contains 70,000 black and white images 
  - 60,000 for training and 10,000 for testing
- each image is normalized to fit into a 28x28 pixel bounding box and anti-aliased

In [None]:
# fetch the dataset from https://www.openml.org/
mnist = skds.fetch_openml('mnist_784', as_frame=False)

# the returned if of type sklearn.utils.Bunch
# this is a dictionary whose keys can also be accessed as attributes
mnist.keys()

In [None]:
# the description of the dataset
print(mnist.DESCR)

In [None]:
X,y = mnist.data, mnist.target
X.shape, y.shape

In [None]:
plt.imshow(X[1000].reshape((28,28))), y[1000]

In [None]:
fig, axs=plt.subplots(10,10,figsize=(9,9), layout='constrained')
for idx, dimg in enumerate(X[60_000:60_100]):
  axs[idx//10, idx%10].imshow(dimg.reshape((28,28)), cmap='binary')
  axs[idx//10, idx%10].axis("off")


In [None]:
# The dataset is already shuffled and split into a training set and a test set
X_train, y_train = X[:60_000], y[:60_000]
X_test, y_test = X[60_000:], y[60_000:]
# 👍 Thumb rule for data splitting: 80% for training 20% for testing

Building a digit recognizer
---
- Let's start from recognizing a single digit such as
  - `0` or `non-0`, `8` or `non-8`
  - which is a binary classifier
- can be implemented with many scikit-learn's classifiers, e.g.
  - stochastic gradient descent (SGD, or stochastic GD) classifier
  - implemented in the scikit-learn's SGDClassifier class

In [None]:
# create and train a binary classifier to recognize 8
from sklearn.linear_model import SGDClassifier
clfSgd = SGDClassifier(random_state=50)
y_train_8 = (y_train == '8')
clfSgd.fit(X_train, y_train_8)

In [None]:
# recognize 8 from test images using this classifier
res = clfSgd.predict(X[60_000:60_100])
res.reshape((10,10))

In [None]:
fig1, axs1=plt.subplots(10,10,figsize=(9,9), layout='constrained')
for idx, dimg in enumerate(X[60_000:60_100]):
  axs1[idx//10, idx%10].imshow(dimg.reshape((28,28)), cmap='binary') if res[idx] == False else axs1[idx//10, idx%10].imshow(dimg.reshape((28,28)))
  axs1[idx//10, idx%10].axis("off")

Measuring Accuracy Using Cross-Validation
---
- k-fold cross-validation
  - split the training set into k folds
  - train the model k times
  - hold out a different fold each time for evaluation
  - implemented with cross_val_score in scikit


In [None]:
from sklearn.model_selection import cross_val_score
cross_val_score(clfSgd, X_train, y_train_8, cv=5, scoring='accuracy') # cv=5 number of folds, default 5

In [None]:
# The accuracies are quite good for all folds. 
# However, this is caused by the imbalance of the chosen data.
# by just telling not-8 every time, we get 90% right
1-len(y_train_8[y_train_8 == True])/len(y_train_8)

In [None]:
# equally randomly guess
from sklearn.dummy import DummyClassifier
clfDummy = DummyClassifier()
clfDummy.fit(X_train, y_train_8)
cross_val_score(clfDummy, X_train, y_train_8, scoring='accuracy')

In [None]:
# an implementation of cross-validation

from sklearn.model_selection import StratifiedKFold
from sklearn.base import clone

skFolder = StratifiedKFold(n_splits=5, shuffle=True)
for trainIndex, testIndex in skFolder.split(X_train, y_train_8):
  cloneClf = clone(clfSgd)
  X_trainFold = X_train[trainIndex]
  y_trainFold = y_train_8[trainIndex]
  X_testFold = X_train[testIndex]
  y_testFold = y_train_8[testIndex]

  cloneClf.fit(X_trainFold, y_trainFold)
  yPred = cloneClf.predict(X_testFold)
  nCorrect = sum(yPred == y_testFold)
  print(nCorrect/len(yPred), end=" ")

[Confusion Matrices](https://en.wikipedia.org/wiki/Confusion_matrix)
---
- visualize of the performance of algorithms
- show number of misclassifications

| Actual\Prediction | `8` | non-`8` |
|:---:|:---:|:---:|
| `8` | TP | FN |
| non-`8` | FP | TN |

  - Precision and Recall
  - The Precision/Recall Trade-off
  - The ROC Curve
- Multiclass Classification
  - Error Analysis
- Multilabel Classification
- Multioutput Classification