<table class="table table-bordered">
    <tr>
        <th style="text-align:center; width:25%"><img src='https://www.nus.edu.sg/images/default-source/base/logo.png' style="width: 250px; height: 125px; "></th>
        <th style="text-align:center;"><h1>Machine Learning in Python</h1><h2>Lab 5b - Hierarchical Clustering </h2><h3></h3></th>
    </tr>
</table>

### 1. Introduction
You are working in a bank and your job is to analyze your customer information (e.g. age, annual salary and etc.) to find some patterns. This will help your sales team to target the right customers effectively.

In [None]:
#importing the required libraries
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
from sklearn.decomposition import PCA 
from sklearn.cluster import AgglomerativeClustering 
from sklearn.preprocessing import StandardScaler, normalize 
from sklearn.metrics import silhouette_score 
import scipy.cluster.hierarchy as shc 

### 2. Load and Scaled the Data

In [None]:
# Import dataset
dat = pd.read_csv('bank.csv')
dat

In [None]:
df = dat.loc[dat.sample(200, random_state = 0).index,['age', 'balance']]
df.head()

In [None]:
# View number of items and data type of each column
df.describe()

In [None]:
# Convert dataframe into numpy arrays
X =df.values

In [None]:
plt.scatter(X[:,0], X[:,1])
plt.xlabel('age')
plt.ylabel('balance')

We can see the above two attributes are at very different ranges. Hierarchical Clustering is very sensitive on the ranges of attributes. Thus, before feed the data into the model, we need to scale the data first. In this example, we will be using Z-score transformation to scale the data.

In [None]:
# Scaling the data so that all the features/attributes become comparable 
scaler = StandardScaler() 
X_scaled = scaler.fit_transform(X) 

In [None]:
plt.scatter(X_scaled[:,0], X_scaled[:,1])
plt.xlabel('age_scaled')
plt.ylabel('balance_scaled')

### 3. Distance Matrix and Dendorgrams

#### Step 1: Generate Distance Matrix
Here we will use `cdist` from `scipy` to generate the full distance matrix. It is just a line of code, you can have the full distance matrix.

In [None]:
# The full distance matrix 
from scipy.spatial.distance import cdist
print(cdist(X_scaled, X_scaled))

In [None]:
print(cdist(X_scaled, X_scaled).shape)

#### Step 2: Look for pairs of samples (i.e. clusters) with the lowest dissimilarity
We use `linkage` function from `scipy.cluster.hierarchy` package to find the clusters with the lowest dissimilarity and merge them accordingly. You can choose different linkage methods, e.g. single, complete, average, ward and etc. Ward is the default method and it picks the two clusters to merge such that the variance within all clusters increases the least.

In [None]:
import scipy.cluster.hierarchy as shc
help(shc.linkage)

In [None]:
# Perform hierarchical/agglomerative clustering
Z = shc.linkage(X_scaled, method ='ward') 
Z # A condensed distance matrix

The above condensed distance matrix (`Z`) listed the two cluster (with the minimum distance) we are merging and their corresponding distance. It is impossible to read through this long list, thus we will be Dendrograms to visualize this hierarchical clustering. 

#### Step 3: Generate Dendrograms.
Use the `dendrogram` function from `scipy.cluster.hierarchy` package and feed in the condensed distance matrix `Z` generated in Step 2, we can easily generate the below Dendrogram for us to visualize the linkage relationship between different points/clusters.

In [None]:
# Visualizing the hierarchical clustering through Dendrograms
plt.figure(figsize =(12, 12)) 
plt.title('Visualising the Hierarchical Clustering') 
Dendrogram = shc.dendrogram(Z)

From the above graph, it seems three clusters may be a good choice to start from.

In case you're wondering about where the colors come from, you might want to have a look at the color_threshold argument of dendrogram(), which as not specified automagically picked a distance cut-off value of 70 % of the final merge and then colored the first clusters below that in individual colors.

### 4. Build and Evaluate the model

In [None]:
# Build Agglomerative Clusting model with number of clusters set as 3
ac3 = AgglomerativeClustering(n_clusters = 3)
ac3

In [None]:
# Fit the model to the data and predict the clusters
ac3.fit_predict(X_scaled)

In [None]:
# Visualize the Three Clusters
plt.figure(figsize =(6, 6)) 
plt.scatter(X_scaled[:,0], X_scaled[:,1], 
            c = ac3.fit_predict(X_scaled), cmap ='rainbow') 
plt.xlabel('age_scaled')
plt.ylabel('balance_scaled')
plt.show() 

In [None]:
# number of clusters
ac3.n_clusters_

In [None]:
# label of each data point/sample
ac3.labels_ 
# same as 'ac3.fit_predict(X_scaled)'

We will be using Sihouette Score to evaluate the model. This can be done in one line of code by using `silhouette_score` function from `sklearn.metrics`. 

In [None]:
# Calculate the Silhouette Score
from sklearn.metrics import silhouette_score 
silhouette_score(X_scaled, ac3.labels_)

This is a good starting point. In the below section, we will try a range of n_clusters values (i.e. the numer of clusters) and find the best model with the highest silhouette score. 

### 5. Improve the Model

In [None]:
# We would like to evaluate the Silhouette Scores for different K, i.e. n_clusters (ranging from 2 to 11)
k_range = range(2,11)
silhouette_scores =[]

for i in k_range:
    ac_i = AgglomerativeClustering(n_clusters = i,linkage='ward')
    silhouette_scores.append(silhouette_score(X_scaled, ac_i.fit_predict(X_scaled)))


In [None]:
silhouette_scores

In [None]:
# Plotting Silhouette Scores using a bar graph to compare the results 
plt.bar(k_range, silhouette_scores) 
plt.xlabel('Number of clusters', fontsize = 20) 
plt.ylabel('Silouette Score', fontsize = 20)
plt.axis([1, 11, 0.3, 0.55])
plt.show() 

From the above, we can see `n_clusters = 5` (i.e. Five Clusters) is having the highest Silhouette Score.

In [None]:
# Building the final model with n_clusters = 5
ac5 = AgglomerativeClustering(n_clusters = 5)
ac5

In [None]:
ac5.fit_predict(X_scaled)

In [None]:
# scatter plot
plt.figure(figsize =(6, 6)) 
plt.scatter(X_scaled[:,0], X_scaled[:,1], 
            c = ac5.fit_predict(X_scaled), cmap ='rainbow')
plt.xlabel('age_scaled')
plt.ylabel('balance_scaled')
plt.show() 

In [None]:
silhouette_score(X_scaled, ac5.fit_predict(X_scaled))

By increasing the number of clusters from Three to Five, We manage to improve the silhouette score from 0.38 to 0.40. Moreover, from the above graph we can easily identify two outlier clusters: 
* customers with super high balance <font color='green'>(Green)</font> 
* super senior customers <font color=FFAE33>(Yellow)</font> 

This will help us to understand and prepare the data. 