# Clustering

One of the most common use cases in unsupervised machine learning is the identification of clusters - discrete groups of samples which are somehow closer related to samples within the same cluster than they are with those outside. Once we abstract our data in to a general *d*-dimensional space of N samples, we can quickly start to apply our intuition to how to determine cluster membership. 

### Learning motivation / points to consider

- You may be wondering which clustering algorithm is the best to find "natural subgroups" in your data?
- Well, the nature of the data will answer that question. 
- Is anything known about the underlying structure? 
- Are you looking for a specific number of clusters? 
- So, unfortunately, you need to have various algorithms in your toolbox, ready to deploy as the circumstances dictate 

 (k-means is not the solution to everything...)

In [None]:
import os
import sys
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

### K-Means clustering
This is among the most common clustering algorithms. 

<img src="assets/KMeans_animation.gif" style="float:left"/>

Assuming we know *a priori* the number of clusters (k), the algorithm starts by placing k coordinates (centroides c) in the feature space. First all samples are assigned to their closest centroid. Once assigned, we update the centroid location as the mean of all the samples belonging to it. These steps are allowed to continue until convergence.

In [None]:
from sklearn.datasets import make_blobs # generate dataset
from sklearn.cluster import KMeans   # clustering algorithm

In [None]:
# create 1000 points on 4 clusters
X, y = make_blobs(n_samples=200, centers=4,random_state=42, cluster_std=1.5)

In [None]:
def scatter(X, y=None, ax=plt):
    ax.scatter(X[:,0], X[:,1], c=y)
scatter(X,y)

### The scikit-learn workflow
- initialize model
- fit ("train") model
- predict using model

In [None]:
# initialize
model = KMeans(4, random_state=0)

In [None]:
type(X)

In [None]:
# train
model.fit(X)

In [None]:
# predict
y = model.predict(X)
y

The model here is itself a python *object*, and can thus have certain attributes, such as the centroids locations:

In [None]:
centroids = model.cluster_centers_
centroids

In [None]:
# Make a function for common plot formatting
def format_plot(ax, title):
    ax.xaxis.set_major_formatter(plt.NullFormatter())
    ax.yaxis.set_major_formatter(plt.NullFormatter())
    ax.set_xlabel('feature 1', color='black')
    ax.set_ylabel('feature 2', color='black')
    ax.set_title(title, color='black')

In [None]:
fig, ax = plt.subplots(figsize=(8, 6))
ax.scatter(X[:, 0], X[:, 1], s=50, color='gray')

# format the plot
format_plot(ax, 'Simulated Input Data')

plt.show()
# fig.savefig('assets/k-means-clustering-1.png')

#### Exercise 1.Plot the data with color-coded cluster labels and star-shaped cluster centroids

In [None]:
#example of a solution
fig, ax = plt.subplots(figsize=(8, 6))
ax.scatter(X[:, 0], X[:, 1], s=50, c=y, cmap='viridis')
ax.scatter(centroids[:, 0], centroids[:, 1], marker='*', s=400, c=range(4), cmap='viridis', edgecolors = 'red')

# format the plot
format_plot(ax, 'Unsupervised learning of cluster labels with the star-shaped cluster centroids')

plt.show()

### Another example where we specify the mean and variance to generate a bit more complicated clustering task

In [None]:
blob_centers = np.array(
    [[ 0.2,  2.3],
     [-1.5 ,  2.3],
     [-2.8,  1.8],
     [-2.8,  2.8],
     [-2.8,  1.3],
     ])
blob_std = np.array([0.4, 0.3, 0.1, 0.1, 0.1])

In [None]:
X2, y = make_blobs(n_samples=2000, centers=blob_centers,
                  cluster_std=blob_std, random_state=42)

def plot_clusters(X, y=None):
    if y==None:
        plt.scatter(X[:, 0], X[:, 1], c='k', s=1)
    else:
        plt.scatter(X[:, 0], X[:, 1], c=y, s=1)        
    plt.xlabel("$x_1$", fontsize=14)
    plt.ylabel("$x_2$", fontsize=14, rotation=90)
    
plt.figure(figsize=(9, 6))
plot_clusters(X2)
plt.title('Feature space')
plt.show()

Notice that we have three compact clusters on the left. However, they only separate based on x2 feature. There are also two clusters that have higher variability.

In [None]:
K = 5
model2 = KMeans(n_clusters=K, n_init=1, random_state=42)
model2.fit(X2)
y2 = model2.predict(X2)


### Visualizing the decision boundaries -  _Voronoi_ diagrams

In [None]:
# don't worry about this code, we provide it to make a more visual representation of cluster assignments

def plot_data(X):
    plt.plot(X[:, 0], X[:, 1], 'k.', markersize=2)

def plot_centroids(centroids, weights=None, circle_color='w', cross_color='k'):
    if weights is not None:
        centroids = centroids[weights > weights.max() / 10]
    plt.scatter(centroids[:, 0], centroids[:, 1],
                marker='o', s=30, linewidths=8,
                color=circle_color, zorder=10, alpha=0.9)
    plt.scatter(centroids[:, 0], centroids[:, 1],
                marker='x', s=50, linewidths=50,
                color=cross_color, zorder=11, alpha=1)

def plot_decision_boundaries(clusterer, X, resolution=1000, show_centroids=True,
                             show_xlabels=True, show_ylabels=True):
    mins = X.min(axis=0) - 0.1
    maxs = X.max(axis=0) + 0.1
    xx, yy = np.meshgrid(np.linspace(mins[0], maxs[0], resolution),
                         np.linspace(mins[1], maxs[1], resolution))
    Z = clusterer.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    plt.contourf(Z, extent=(mins[0], maxs[0], mins[1], maxs[1]),
                cmap="Pastel2")
    plt.contour(Z, extent=(mins[0], maxs[0], mins[1], maxs[1]),
                linewidths=1, colors='k')
    plot_data(X)
    if show_centroids:
        plot_centroids(clusterer.cluster_centers_)

    if show_xlabels:
        plt.xlabel("$x_1$", fontsize=14)
    else:
        plt.tick_params(labelbottom=False)
    if show_ylabels:
        plt.ylabel("$x_2$", fontsize=14, rotation=90)
    else:
        plt.tick_params(labelleft=False)

In [None]:
plt.figure(figsize=(9, 5))
plot_decision_boundaries(model2, X2)

plt.show()

Did you obtain a good result?