# INFO204 Lab 6 - Clustering

For this lab we will take a look at how $k$-means and hierarchal clustering works using it for exploratory data analysis and providing preliminary insight into a data set. In addition, we'll also use Elbow Analysis to establish the optimal value for $k$ when clustering data.

For code examples, refer to the relevant Sklearn documents, and material in previous lectures and labs. 

## Part 1. Preparation

In [None]:
# %load ../standard_import.txt
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.datasets import make_blobs
from sklearn.preprocessing import scale, StandardScaler
from sklearn.cluster import KMeans
from scipy.cluster import hierarchy

%matplotlib inline
plt.style.use('seaborn-white')

In [None]:
# Generate data
np.random.seed(2)
X = np.random.standard_normal((50,2))
X[:25,0] = X[:25,0]+3
X[:25,1] = X[:25,1]-4

##  Part 2. $k$-means clustering of two data sets
**Task 1:** Now, use the "KMeans" function to cluster data set X. Specify these options: n_clusters=2, n_init=20, and init='random'. 

In [None]:
# to complete ...
np.random.seed(4)
# km1 = KMeans(...
# km1.fit(...

**Task 2:** Use the "KMeans" function again to cluster data set X. Specify these options: n_clusters=3, n_init=20, and init='random'. 

In [None]:
# to complete ...
np.random.seed(4)
# km2 = KMeans(...
# km2.fit(...

**Task 3:** Now, produce scatterplots of the data and cluster centres of km1 and km2 by completing the code below:

1) Specify s=40, c=km1.labels_, and cmap=plt.cm.prism as part of the scatterplot for the scatterplot of the data

2) Specify marker='+', c='k', and linewidth=2 as part of the scatterplot of the cluster centers

In [None]:
# to complete
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(14,5))

ax1.set_title('$k$-Means Clustering Results with K=2')

ax2.set_title('$k$-Means Clustering Results with K=3')

**Comment** on which value of $k$ seems more appropriate for clustering the data based on the your analysis of the scatterplots. 
-  

## Part 3. Hierarchal clustering of the two data sets

Examine the code below and run it to perform hierarchal clustering, using its different options, of data set X.

In [None]:
fig, (ax1,ax2,ax3) = plt.subplots(3,1, figsize=(15,18))

for linkage, cluster, ax in zip([hierarchy.complete(X), hierarchy.average(X), hierarchy.ward(X)], ['c1','c2','c3'],
                                [ax1,ax2,ax3]):
    cluster = hierarchy.dendrogram(linkage, ax=ax, color_threshold=0)

ax1.set_title('Complete Linkage')
ax2.set_title('Average Linkage')
ax3.set_title('Ward Linkage');

**Comment** on your observations on the differences, if any, on the difference in linkages between the samples and number of clusters found by each linkage method:
-  

## Part 4. Finding the optimal vaue for $k$ using Elbow Analysis

**Task 5:** Alter the code below to experiment with values of n_samples from 250 to 1000 in increments of 250 in order to find the optimal value for $k$ for each sample set size by viewing the output of the Elbow Analysis.

In [None]:
from __future__ import print_function

print(__doc__)

# Generating the sample data from make_blobs

n_samples=250

X, y = make_blobs(n_samples,
                  n_features=2,
                  centers=4,
                  cluster_std=1,
                  center_box=(-10.0, 10.0),
                  shuffle=True)

plt.scatter(X[:, 0], X[:, 1], marker='.', s=30, lw=0, alpha=0.7,
                c='b', edgecolor='k')
plt.title("The visualisation of %d observations." % (n_samples))
plt.xlabel("Feature space for the 1st feature")
plt.ylabel("Feature space for the 2nd feature")

scaler = StandardScaler()
X_scaled = scaler.fit_transform( X )
cluster_range = range( 1, 15 )
cluster_errors = []

for num_clusters in cluster_range:
    clusters = KMeans( num_clusters )
    clusters.fit( X_scaled )
    cluster_errors.append( clusters.inertia_ )
    
plt.figure(figsize=(12,6))
clusters_df = pd.DataFrame( { "num_clusters":cluster_range, "cluster_errors": cluster_errors } )
print(clusters_df[0:10])
plt.plot( clusters_df.num_clusters, clusters_df.cluster_errors, marker = "o" )
plt.xlabel("Number of clusters")
plt.ylabel("Cluster errors (variance)")

1) **Comment** on your observations on how the shape of the elbow is affected as the number and distribution of samples increases. 

2) **Can** you make any general comments about how effective Elbow Analysis is for determining the value of $k$?

-  