# Cluster Validation
In this lesson, we'll be analyzing the `Beer` dataset and the `UserTracks` dataset to help us begin to systematically evaluate the performance of our clustering algorithm. We'll focus on using these metrics for Kmeans, but all the methods presented can be applied to other clustering algorithms. 

In [None]:
# beer dataset
import pandas as pd
url = '../data/beer.txt'
beer = pd.read_csv(url, sep=' ')
beer.head()

In [None]:
# define X
X = beer.drop('name', axis=1)

In [None]:
# K-means with 3 clusters
from sklearn.cluster import KMeans
km = KMeans(n_clusters=3, random_state=1)
km.fit(X)

In [None]:
# Review the cluster labels
km.labels_

In [None]:
# save the cluster labels and sort by cluster
beer['cluster'] = km.labels_
beer.sort_values('cluster')

### Visualing the Cluster Centers
The cluster centers are available to us from the sklearn Kmeans implmentation. Let's see if we can begin to understand what the clusters seem to be based on and why?

In [None]:
# review the cluster centers
km.cluster_centers_

In [None]:
# calculate the mean of each feature for each cluster
centers = beer.groupby('cluster').mean()
centers.head()

In [None]:
# allow plots to appear in the notebook
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams['font.size'] = 14

In [None]:
# create a "colors" array for plotting
import numpy as np
colors = np.array(['red', 'green', 'blue', 'yellow'])

In [None]:
# scatter plot of calories versus alcohol, colored by cluster (0=red, 1=green, 2=blue)
plt.scatter(beer.calories, beer.alcohol, c=colors[list(beer.cluster)], s=50)

# cluster centers, marked by "+"
plt.scatter(centers.calories, centers.alcohol, linewidths=3, marker='+', s=300, c='black')

# add labels
plt.xlabel('calories')
plt.ylabel('alcohol')

In [None]:
# scatter plot matrix of new cluster assignments (0=red, 1=green, 2=blue)
pd.plotting.scatter_matrix(X, c=colors[list(beer.cluster)], figsize=(10,10), s=100)

## Challenge
_5 minutes_

What do the clusters seem to be based on and why?

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

The silhouette coefficient can take values between -1 and 1.

In general, we want separation to be high and cohesion to be low. This corresponds to a value of `SC` close to +1. 

A negative silhouette coefficient means the cluster radius is larger than the space between clusters, and thus clusters overlap. 

## Internal Validation
In general, k-means will converge to a solution and return a partition of k clusters, even if no natural clusters exist in the data.

We will look at two validation metrics useful for partitional clustering, __cohesion__ and __separation__. 

__Cohesion__ measures clustering effectiveness within a cluster:
$$ \hat{C}(C_i) = \sum_{xEc_i}d(x, c_i)$$

__Separation__ measures clustering effectiveness between clusters:

$$\hat{S}(C_i, C_j) = d(c_i, c_j)$$

One useful measure that combines the ideas of cohesion and separation is the __silhouette coefficient__. For point x, this is given by:
    $$SC_i = \frac{b_i-a_i}{max(a_i, b_i)}$$

such that:
* $a_i$ = average in-cluster distance to $x_i$
* $b_{ij}$ = average between-cluster distance to $x_i$
* $b_i = min_j(b_{ij})$

The silhouette coefficient for the cluster $C_i$ is given by the average silhouette coefficient across all points in $C_i$:
$$SC(C_i) = \frac{1}{m_i}\sum_{x\in{C_i}}{SC_i}$$

The overall silhouette coefficient is given by the average silhouette coefficient across all clusters:
$$SC_{total} = \frac{1}{k}\sum_{1}^{k}SC(C_i)$$

*Note:* This gives a summary measure of the overall clustering quality. 
*Application:* Determining the best number of clusters for your dataset

In [None]:
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score

import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np

##### Silhouette Coefficient
The silhouette coefficient can take on values between -1 and 1. 

In [None]:
from sklearn import metrics
metrics.silhouette_score(X, km.labels_)

In [None]:
# calculate SC for K=2 through K=19
k_range = range(2, 20,1)
scores = []
for k in k_range:
    km = KMeans(n_clusters=k, random_state=1)
    km.fit(X)
    scores.append(metrics.silhouette_score(X, km.labels_))

In [None]:
# plot the results
plt.plot(k_range, scores)
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette Coefficient')
plt.grid(True)

## Exercise
_20-30 minutes_

Cluster the `UserTopTracks` dataset with Kmeans and perform Silhouette analysis to pick the optimal number of $k$ paritions. 

Visualize the results of the clustering, and describe the patterns in music preference.

In [None]:
url = '../data/UserTopTracks.csv'
tracks = pd.read_csv(url, encoding='latin1')
tracks.head()