# Chapter 15: Cluster Analysis

## Instructions for Assignment 7

Run each of the code blocks below and before running the block put your initials followed by last two digits of your ID as a comment first. 

Change all variables used with the extension of your initials and last two digits of your ID, i.e. XX will be named XXJD48 if student's name is Jane Doe with last two digits of ID 48. All variables in all code blocks will be changed similarly.

And then write a comment to explain the code block shortly. You can make use of the comments from the textbook. Your comments can be very short for obvious and short blocks. 

But be careful that no student's initials and ID digits should appear in another student's submission. Otherwise, I will file your case to the Academic Integrity Office and warn you while decreasing your letter grade below one level.

Save this file like "ADTA 5230.501 Module 7 Chapter 15 Hands On, LAST NAME First Name Last Two Digits of Your ID", like "ADTA 5230.501 Module 7 Chapter 15 Hands On DOE Jane 48".

You will submit the ipynb file with Python code blocks run, as well as a PDF file that you will save after putting all your comments and running code blocks.

Notes: 
1. Code blocks below are originated from the textbook.

2. Refer to notes of the textbook for further explanations of these codes.


## Import required packages

Make sure DMBA package is installed

In [None]:
pip install dmba

In [None]:
from pathlib import Path

import pandas as pd
from sklearn import preprocessing
from sklearn.metrics import pairwise
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
from sklearn.cluster import KMeans
import matplotlib.pylab as plt
import seaborn as sns
from pandas.plotting import parallel_coordinates

import dmba

%matplotlib inline

## Table 15.2
Load the data, set row names (index) to the utilities column (company) and remove it. Convert all columns to `float`

In [None]:
utilities_df = dmba.load_data('Utilities.csv')
utilities_df.set_index('Company', inplace=True)

# while not required, the conversion of integer data to float will avoid a warning when 
# applying the scale function
utilities_df = utilities_df.apply(lambda x: x.astype('float64'))
utilities_df.head()

Compute Euclidean distance matrix (to compute other metrics, change the name of `metric` argument)

In [None]:
d = pairwise.pairwise_distances(utilities_df, metric='euclidean')
pd.DataFrame(d, columns=utilities_df.index, index=utilities_df.index).head(5)

## Table 15.4
Here are two ways to normalize the input variables. Pandas calculates by default the sample standard deviation, whereas scikit-learn uses the population standard deviation. The normalized data from the two methods will therefore differ slightly. We will use the Pandas approach as it is equivalent to the R implementation of scale.

In [None]:
# scikit-learn uses population standard deviation
utilities_df_norm = utilities_df.apply(preprocessing.scale, axis=0)

# pandas uses sample standard deviation
utilities_df_norm = (utilities_df - utilities_df.mean())/utilities_df.std()

# compute normalized distance based on Sales and Fuel Cost
d_norm = pairwise.pairwise_distances(utilities_df_norm[['Sales', 'Fuel_Cost']], 
                                     metric='euclidean')
pd.DataFrame(d_norm, columns=utilities_df.index, index=utilities_df.index).head(5)

## Figure 15.3


In [None]:
Z = linkage(utilities_df_norm, method='single')

fig = plt.figure(figsize=(10, 6))
fig.subplots_adjust(bottom=0.23)
plt.title('Hierarchical Clustering Dendrogram (Single linkage)')
plt.xlabel('Company')
dendrogram(Z, labels=utilities_df_norm.index, color_threshold=2.75)
plt.axhline(y=2.75, color='black', linewidth=0.5, linestyle='dashed')
plt.show()

In [None]:
Z = linkage(utilities_df_norm, method='average')

fig = plt.figure(figsize=(10, 6))
fig.subplots_adjust(bottom=0.23)
plt.title('Hierarchical Clustering Dendrogram (Average linkage)')
plt.xlabel('Company')
dendrogram(Z, labels=utilities_df_norm.index, color_threshold=3.6)
plt.axhline(y=3.6, color='black', linewidth=0.5, linestyle='dashed')
plt.show()

## Table 15.6

In [None]:
memb = fcluster(linkage(utilities_df_norm, 'single'), 6, criterion='maxclust')
memb = pd.Series(memb, index=utilities_df_norm.index)
for key, item in memb.groupby(memb):
    print(key, ': ', ', '.join(item.index))

In [None]:
memb = fcluster(linkage(utilities_df_norm, 'average'), 6, criterion='maxclust')
memb = pd.Series(memb, index=utilities_df_norm.index)
for key, item in memb.groupby(memb):
    print(key, ': ', ', '.join(item.index))

## Figure 15.4

In [None]:
utilities_df_norm.index = ['{}: {}'.format(cluster, state) for cluster, state in zip(memb, utilities_df_norm.index)]
sns.clustermap(utilities_df_norm, method='average', col_cluster=False,  cmap="mako_r")
plt.show()

## Figure 15.9

In [None]:
# Load and preprocess data
utilities_df = dmba.load_data('Utilities.csv')
utilities_df.set_index('Company', inplace=True)
utilities_df = utilities_df.apply(lambda x: x.astype('float64'))

# Normalized distance
utilities_df_norm = utilities_df.apply(preprocessing.scale, axis=0)

kmeans = KMeans(n_clusters=6, random_state=0).fit(utilities_df_norm)

# Cluster membership
memb = pd.Series(kmeans.labels_, index=utilities_df_norm.index)
for key, item in memb.groupby(memb):
    print(key, ': ', ', '.join(item.index))

## Table 15.10

In [None]:
centroids = pd.DataFrame(kmeans.cluster_centers_, columns=utilities_df_norm.columns)
pd.set_option('display.precision', 3)
print(centroids)
pd.reset_option('display.precision')

In [None]:
withinClusterSS = [0] * 6
clusterCount = [0] * 6
for cluster, distance in zip(kmeans.labels_, kmeans.transform(utilities_df_norm)):
    withinClusterSS[cluster] += distance[cluster]**2
    clusterCount[cluster] += 1
for cluster, withClustSS in enumerate(withinClusterSS):
    print('Cluster {} ({} members): {:5.2f} within cluster'.format(cluster, 
        clusterCount[cluster], withinClusterSS[cluster]))

In [None]:
# calculate the distances of each data point to the cluster centers
distances = kmeans.transform(utilities_df_norm)

# reduce to the minimum squared distance of each data point to the cluster centers
minSquaredDistances = distances.min(axis=1) ** 2

# combine with cluster labels into a data frame
df = pd.DataFrame({'squaredDistance': minSquaredDistances, 'cluster': kmeans.labels_}, 
    index=utilities_df_norm.index)

# Group by cluster and print information
for cluster, data in df.groupby('cluster'):
    count = len(data)
    withinClustSS = data.squaredDistance.sum()
    print(f'Cluster {cluster} ({count} members): {withinClustSS:.2f} within cluster ')

## Figure 15.5

In [None]:
centroids['cluster'] = ['Cluster {}'.format(i) for i in centroids.index]

plt.figure(figsize=(10,6))
fig.subplots_adjust(right=3)
ax = parallel_coordinates(centroids, class_column='cluster', colormap='Dark2', linewidth=5)
plt.legend(loc='center left', bbox_to_anchor=(0.95, 0.5))
plt.xlim(-0.5,7.5)
centroids

In [None]:
utilities_df_norm.groupby(kmeans.labels_).mean()

## Table 15.11

In [None]:
print(pd.DataFrame(pairwise.pairwise_distances(kmeans.cluster_centers_, metric='euclidean')))

In [None]:
pd.DataFrame(pairwise.pairwise_distances(kmeans.cluster_centers_, metric='euclidean')).sum(axis=0)

## Figure 15.6

In [None]:
inertia = []
for n_clusters in range(1, 7):
    kmeans = KMeans(n_clusters=n_clusters, random_state=0).fit(utilities_df_norm)
    inertia.append(kmeans.inertia_ / n_clusters)
inertias = pd.DataFrame({'n_clusters': range(1, 7), 'inertia': inertia})
ax = inertias.plot(x='n_clusters', y='inertia')
plt.xlabel('Number of clusters(k)')
plt.ylabel('Average Within-Cluster Squared Distances')
plt.ylim((0, 1.1 * inertias.inertia.max()))
ax.legend().set_visible(False)
plt.show()