# Apprentissage supervisé: Classification de chiffres ecrits à la main

In this section we'll apply scikit-learn to the classification of handwritten
digits.  This will go a bit beyond the iris classification we saw before: we'll
discuss some of the metrics which can be used in evaluating the effectiveness
of a classification model.

Adapté du cours de Gaël Varoquaux

In [1]:
from sklearn.datasets import load_digits
digits = load_digits()

We'll re-use some of our code from before to visualize the data and remind us what
we're looking at:

In [2]:
%matplotlib inline
from matplotlib import pyplot as plt

In [None]:
fig = plt.figure(figsize=(6, 6))  # figure size in inches
fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)

# plot the digits: each image is 8x8 pixels
for i in range(64):
    ax = fig.add_subplot(8, 8, i + 1, xticks=[], yticks=[])
    ax.imshow(digits.images[i], cmap=plt.cm.binary, interpolation='nearest')
    
    # label the image with the target value
    ax.text(0, 7, str(digits.target[i]))

## Visualisation des données

Afin de visualiser des données 

A good first-step for many problems is to visualize the data using a
*Dimensionality Reduction* technique.  We'll start with the
most straightforward one, Principal Component Analysis (PCA).

PCA seeks orthogonal linear combinations of the features which show the greatest
variance, and as such, can help give you a good idea of the structure of the
data set.  You can use `PCA` or use `RandomizedPCA`, because it's faster for large `N`.

Creer la projection `proj` des variable 

In [4]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2, svd_solver="randomized")
proj = pca.fit_transform(digits.data)

Faire un plot avec `plt.scatter(X1, X2, c=y)` et `plt.colorbar();` 

In [None]:
plt.scatter(proj[:, 0], proj[:, 1], c=digits.target)
plt.colorbar();

**Question: Given these projections of the data, which numbers do you think
a classifier might have trouble distinguishing?**

## Logistic Regression Classification

Use a
Logistic Regression

In [6]:
from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split

In [7]:
# split the data into training and validation sets

# train the model

# use the model to predict the labels of the test data
predicted = #TODO
expected = y_test

**Question**: why did we split the data into training and validation sets?

Let's plot the digits again with the predicted labels to get an idea of
how well the classification is working:

On affiche en vert quand la prediciton est bonne et en rouge quand elle est mauvaise (utiliser la variable `predicted`). 

In [None]:
fig = plt.figure(figsize=(6, 6))  # figure size in inches
fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)

# plot the digits: each image is 8x8 pixels
for i in range(64):
    ax = fig.add_subplot(8, 8, i + 1, xticks=[], yticks=[])
    ax.imshow(X_test.reshape(-1, 8, 8)[i], cmap=plt.cm.binary,
              interpolation='nearest')
    
    # label the image with the target value
    if predicted[i] == expected[i]:
        ax.text(0, 7, str(predicted[i]), color='green')
    else:
        ax.text(0, 7, str(predicted[i]), color='red')

## Quantitative Measurement of Performance

We'd like to measure the performance of our estimator without having to resort
to plotting examples.  A simple method might be to simply compare the number of
matches:

In [None]:
matches = (predicted == expected)
print(matches.sum())
print(len(matches))

In [None]:
matches.sum() / float(len(matches))

### Maintenant plus de métriques : 

We see that nearly 1500 of the 1800 predictions match the input.  But there are other
more sophisticated metrics that can be used to judge the performance of a classifier:
several are available in the ``sklearn.metrics`` submodule.

One of the most useful metrics is the ``classification_report``, which combines several
measures and prints a table with the results:

Utiliser `pandas.DataFrame` pour visualiser de manière simple. 

In [None]:
from sklearn import metrics
from pandas import DataFrame
DataFrame(metrics.classification_report(expected, predicted, output_dict=True)).T

## Matrice de confusion 

Another enlightening metric for this sort of multi-label classification
is a *confusion matrix*: it helps us visualize which labels are
being interchanged in the classification errors:

In [None]:
DataFrame(metrics.confusion_matrix(expected, predicted))

Que vois-t-on de particulier par rapport aux erreur de notre modèle ? Est-ce normal ? 

# Tester sur l'ensemble d'entrainement 

Regardons si le modele est meilleur sur l'ensemble qu'il a utilisé pour s'entrainer

# Comparaison avec un autre modele : le Bayesien Naïf

In [None]:
from sklearn.naive_bayes import GaussianNB

In [None]:
# TODO