![examples from the digits dataset](images/digits.png)

The digits dataset contains 1797 8x8 images of handwritten digits ranging from 0 to 9.

It is a classical dataset used to test supervised and unsupervised algorithms.

It can be found in scikit-learn's `datasets` module, using the following code:

```python
from sklearn import datasets
digits = datasets.load_digits()
```

But for convenience we have already packaged it and transformed the raw pixel values using PCA and [UMAP](https://umap-learn.readthedocs.io/en/latest/).

The PCA is not always the best for visualization, but it is appropriate for clustering or classification learning.

On the other hand, the UMAP is great for visualization, but it is not appropriate for clustering.

In [None]:
import pandas as pd

df_pca = pd.read_csv( "data/digits.PCA20.csv" ,  index_col=0 )
df_umap = pd.read_csv( "data/digits.UMAP.csv" ,  index_col=0 )

In [None]:
import plotly.express as px
px.scatter( x = df_pca.PC0 , y = df_pca.PC1  , color=df_pca.labels.astype(str) )

In [None]:
px.scatter( x = df_umap.UMAP0 , y = df_umap.UMAP1  , color=df_umap.labels.astype(str) )

In [None]:
## the actual label of the original numbers can be found in the labels column:
labels = df_pca.labels


In [None]:
## this is how you can perform a Kmeans clustering on the 20 first PCA axes

from kmeans import Kmeans

data = df_pca.loc[ : , [f"PC{i}" for i in range(20)] ].to_numpy()
cluster_assigment = Kmeans( data , k = 10 ) # here K is 10 -> we produce 10 clusters

cluster_assigment ## array containing the cluster assignment of each point`

> NB: the clusters produced by the Kmean algorithm will be labelled arbitrarily from 0 to 9, without expectation that these correspond to the actual label of the number (eg, the 3s could be in cluster 0 for example).

In [None]:
# we can check how the true label correspond to the created clusters with a confusion matrix:
pd.crosstab( labels , cluster_assigment )

## Goal: 

 * make an interactive vizualization which displays the UMAP data, the original labels, and the clusters proposed by a Kmeans on the PCA data

extra stuff:
 * add a slider or dropdown to change the value of K (you will have to pre-compute the K-means for several values of K)
 * add a button to switch between UMAP view and PCA view

## proposed corrections

preamble: utility fonction to get colors

In [3]:
# %load -r -23 solutions/solution_kmeans_plotly

Basic version 

In [None]:
# %load -r 24-50 solutions/solution_kmeans_plotly

Version with a slider:

In [None]:
# %load -r 52-109 solutions/solution_kmeans_plotly

version with a slider (controlling K) and a dropdown menu (showing UMAP or PCA)

In [None]:
# %load -r 110- solutions/solution_kmeans_plotly