# Clustering: Digits dataset

Scikit-learn includes the test set from  the NIST Optical Recognition of Handwritten Digits Data Set.
The data set may be used for classification, as the true class information is available.

We are, however, not going to use the target information here, except for informally comparing the unsupervised learning (clustering) results to the ground truth.

## Load and inspect the data

In [None]:
# the usual imports
from __future__ import division
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
from sklearn.datasets import load_digits
digits = load_digits()
digits.data.shape

In [None]:
digits.DESCR

In [None]:
digits.images[0]

In [None]:
plt.gray() 
for i in range(5): plt.matshow(digits.images[i])

In [None]:
X = digits.data

# y contains the true classes, but we are not going to use them for learning
y = digits.target

## Zeros and Ones

First, let's look at 0s and 1s only, which should look rather different overall ;-)

In [None]:
# subset of data set containing 0s and 1s only
X01 = X[np.logical_or(y == 0, y==1)]
X01

In [None]:
# subset of true class data containing 0s and 1s only
y01 = y[np.logical_or(y == 0, y==1)]
y01

Before clustering, we will perform dimensionality reduction using PCA.

Looking at how much variance is explained by the various factors, how many factors would you chose to proceed with?

In [None]:
from sklearn import decomposition
### fill in missing code
pca = 


pca.explained_variance_

We want to plot the clusters in 2 dimensions, so (fully aware we are not going to make the "best" choice) we perform PCA with 2 components and proceed with the transformed data:

In [None]:
X01_reduced = decomposition.PCA(n_components=2).fit_transform(X01)
X01_reduced.shape

Now perform k-means clustering on the transformed data.

We know we have 2 different digits, so we tell the algorithm we want 2 clusters:

In [None]:
from sklearn import cluster
### fill in missing code
kmeans01 = 

kmeans01.cluster_centers_

Display cluster membership:

In [None]:
print('cluster membership: {}\n'.format(kmeans01.labels_))

As we have the true classes, we can compare:

In [None]:
# If this produces all False, re-run from PCA above
y01 == kmeans01.labels_

Now plot the clusters in 2d.

In terms of the 2 principal compnonents, which digit is more homogeneous, 0 or 1?

In [None]:
cluster_1 = X01_reduced[kmeans01.labels_ == 0]
cluster_2 = X01_reduced[kmeans01.labels_ == 1]

In [None]:
plt.figure()
plt.title('k means clustering, k = 2')
plt.plot(cluster_1[:,0], cluster_1[:,1], 'bo')
plt.plot(cluster_2[:,0], cluster_2[:,1], 'gv')
plt.show()

## Sevens and Ones

Now, do the same with digits 1 and 7.

How well does the clustering separate the digits?

In [None]:
# subset of data set containing 7s and 1s only
### fill in missing code
X17 = 
X17

In [None]:
# subset of true class data containing 7s and 1s only
### fill in missing code
y17 =
y17

In [None]:
# reduce to 2 components
### fill in missing code
X17_reduced = 

In [None]:
# cluster the reduced data
### fill in missing code
kmeans17 = 

In [None]:
# inspect cluster memberships
print('cluster membership: {}\n'.format(kmeans17.labels_))

In [None]:
y17_classes = np.where(y17 == 7, 1, 0)
# If this produces all False, re-run from PCA above
y17_classes == kmeans17.labels_

In [None]:
cluster_1 = X17_reduced[kmeans17.labels_ == 0]
cluster_2 = X17_reduced[kmeans17.labels_ == 1]
plt.figure()
plt.title('k means clustering, k = 2')
plt.plot(cluster_1[:,0], cluster_1[:,1], 'bo')
plt.plot(cluster_2[:,0], cluster_2[:,1], 'gv')
plt.show()

Finally, look at some evaluation metrics. Do they make sense?

In [None]:
# within cluster sum of squares
kmeans01.inertia_, kmeans17.inertia_

In [None]:
# Silhouette score
# score = (b - a) / max(a,b)
#    a: The mean distance between a sample and all other points in the same class.
#    b: The mean distance between a sample and all other points in the next nearest cluster.

from sklearn import metrics
print('Silhouette score, 0 vs 1: {}'.format(metrics.silhouette_score(X01, kmeans01.labels_)))
print('Silhouette score, 1 vs 7: {}'.format(metrics.silhouette_score(X17, kmeans17.labels_)))