# COMP 455

# Lab 3: Clustering



In [None]:
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.cluster import KMeans    # for k-means clustering
from scipy.cluster import hierarchy   # for hierarchical clustering

from sklearn.preprocessing import scale

%matplotlib inline
plt.style.use('seaborn-white')

# Clustering

We will look first at k-means clustering and then at hierarchical clustering.

### K-Means Clustering

We will read in some sample data. This dataset was randomly generated by drawing values from a Normal distribution and then adjusting them. There are two variables (X1 and X2) and 50 observations/instances.

We also scale the data to make sure the variables are generally in the same range.

In [None]:
df = pd.read_csv('sample-data.csv')
df = pd.DataFrame(scale(df), index=df.index, columns=df.columns)
df.head()

Let's first plot the data before we cluster it. Does it seem to cluster into groups? If so, how many?

In [None]:
plt.figure()
plt.scatter(df['X1'], df['X2'])
plt.show()

#### First we will try K-means with K = 2

In [None]:
km1 = KMeans(n_clusters=2, n_init=20)
km1.fit(df)

In [None]:
km1.labels_

#### Now we will try K = 3

In [None]:
np.random.seed(4)
km2 = KMeans(n_clusters=3, n_init=20)
km2.fit(df)

In [None]:
pd.Series(km2.labels_).value_counts()

In [None]:
km2.cluster_centers_

In [None]:
km2.labels_

In [None]:
# Sum of distances of samples to their closest cluster center.
km2.inertia_

In [None]:
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(14,5))

ax1.scatter(df['X1'], df['X2'], s=40, c=km1.labels_, cmap=plt.cm.prism) 
ax1.set_title('K-Means Clustering Results with K=2')
ax1.scatter(km1.cluster_centers_[:,0], km1.cluster_centers_[:,1], marker='+', s=100, c='k', linewidth=2)

ax2.scatter(df['X1'], df['X2'], s=40, c=km2.labels_, cmap=plt.cm.prism) 
ax2.set_title('K-Means Clustering Results with K=3')
ax2.scatter(km2.cluster_centers_[:,0], km2.cluster_centers_[:,1], marker='+', s=100, c='k', linewidth=2);

### Hierarchical Clustering

For hierarchical clustering, we will use scipy instead of sklearn.

We will try three types of linkages. 

In [None]:
fig, (ax1,ax2,ax3) = plt.subplots(3,1, figsize=(15,18))

for linkage, cluster, ax in zip([hierarchy.complete(df), hierarchy.average(df), hierarchy.single(df)], ['c1','c2','c3'],
                                [ax1,ax2,ax3]):
    cluster = hierarchy.dendrogram(linkage, ax=ax, color_threshold=0)

ax1.set_title('Complete Linkage')
ax2.set_title('Average Linkage')
ax3.set_title('Single Linkage');

We can get cluster assignments by cutting the tree. A strength of hierarchical clustering is that we can choose different numbers of clusters from the tree. Here we show cluster assignments with two clusters and with three clusters. 

In [None]:
cuts = hierarchy.cut_tree(hierarchy.complete(df), n_clusters=[2,3])
print(cuts)

# Lab Assignment

## Clustering Car Seat Sales Data

You will now carry out k-means clustering on a dataset about the sales of car seats. Carry out the following steps. 
* Use just the features 'Sales' and 'Income' from the data in Carseats.csv (this step is done for you).
* Scale the features. 
* Create a scatterplot showing 'Sales' vs. 'Income.'
* Do K-Means clustering with k=2.
* Do K-Means clustering with k=3.
* Create two new scatterplots showing the clusterings with k=2 and k=3.

In [None]:
df = pd.read_csv("Carseats.csv")
print(df.head())

X = df[['Sales', 'Income']]
X.head()


In [21]:
# your code here

df = pd.DataFrame(scale(df), index=df.index, columns=df.columns)
df.head()

plt.figure()
plt.scatter(df['Sales'], df['Income'])
plt.show()

km1 = KMeans(n_clusters=2, n_init=20)
km1.fit(df)
km1.labels_

np.random.seed(4)
km2 = KMeans(n_clusters=3, n_init=20)
km2.fit(df)
pd.Series(km2.labels_).value_counts()
km2.cluster_centers_
km2.labels_
km2.inertia_

fig, (ax1, ax2) = plt.subplots(1,2, figsize=(14,5))

ax1.scatter(df['Sales'], df['Income'], s=40, c=km1.labels_, cmap=plt.cm.prism) 
ax1.set_title('K-Means Clustering Results with K=2')
ax1.scatter(km1.cluster_centers_[:,0], km1.cluster_centers_[:,1], marker='+', s=100, c='k', linewidth=2)

ax2.scatter(df['Sales'], df['Income'], s=40, c=km2.labels_, cmap=plt.cm.prism) 
ax2.set_title('K-Means Clustering Results with K=3')
ax2.scatter(km2.cluster_centers_[:,0], km2.cluster_centers_[:,1], marker='+', s=100, c='k', linewidth=2);

ValueError: could not convert string to float: 'Yes'

__Submit your completed notebook via Blackboard.__

Extra (not required): If you finish the k-means clustering early, try hierarchical clustering on the same dataset.