# Hierarchical Clustering with Sklearn

This notebook shows how to train and measure a hieararchical (agglomerative) clustering model.

* Method: [Hierarchical Clustering](http://scikit-learn.org/stable/modules/clustering.html#hierarchical-clustering)
* Dataset: Stock market data

## Imports

In [None]:
from itertools import product

import pandas as pd
import numpy as np

from sklearn.cluster import AgglomerativeClustering
from sklearn import metrics

import seaborn as sb
import matplotlib.pyplot as plt
from pylab import rcParams

%matplotlib inline
rcParams['figure.figsize'] = 16, 8
sb.set_style('whitegrid')

## Load and Prepare the Data

In [None]:
DATA_FILE = "/Users/robert.dempsey/Dev/daamlobd/data/sample_stocks.csv"

In [None]:
# Import the data
data = pd.read_csv(DATA_FILE)
data.head(5)

In [None]:
# Check the data types
data.dtypes

In [None]:
# Create the X and y
X = data[['dividendyield']]
y = data[['returns']]

## Identify the Number of Clusters and Linkage Type to Use

In [None]:
# Create a list of tuples to test cluster ranges with different linkages
cluster_range = range(2, 11)
linkage = ['average', 'complete', 'ward']

cluster_range_linkage = list(product(cluster_range, linkage))
print(cluster_range_linkage)

In [None]:
# Create a list of AgglomerativeClustering models with differing numbers of clusters
ag_models = [AgglomerativeClustering(n_clusters=i[0], linkage=i[1]) for i in cluster_range_linkage]
print(ag_models[0])
print(ag_models[7])

In [None]:
# For each model, fit it to the data and get the Silhouette score (described below)
cluster_scores = list()

# Fit each of the models on the features (y)
for ag_model in ag_models:
    model = ag_model.fit(y)
    s_score = metrics.silhouette_score(data, model.labels_, metric='euclidean')
    cluster_scores.append(s_score)

# Show one of the scores
cluster_scores[0]

In [None]:
# Plot an barchart of the scores
chart_labels = ["{}_{}".format(i[0], i[1]) for i in cluster_range_linkage]

sb.barplot(y=chart_labels, x=cluster_scores)

**Observation**: based on the graph above it appears that 3 clusters using complete linkage has the best silhouette score.

In [None]:
# Get the index value of the max cluster score
max_score_index = cluster_scores.index(max(cluster_scores))

# Get the number of clusters used for the model with the max score
params_to_use = cluster_range_linkage[max_score_index]

print("Number of clusters: {}".format(params_to_use[0]))
print("Linkage type: {}".format(params_to_use[1]))

## Fit a Hierarchical Clustering Model

Arguments:
* n_clusters: number of clusters to find
* linkage: linkage criterion; determines which distance to use between sets of observation
  * ward: minimizes the variance of the clusters being merged.
  * average: uses the average of the distances of each observation of the two sets.
  * complete: uses the maximum distances between all observations of the two sets.
* affinity: metric used to compute the linkage. Can be`euclidean`, `l1`, `l2`, `manhattan`, `cosine`, or `precomputed`. If linkage is `ward`, only `euclidean` is accepted.

In [None]:
# Fit the model
ag_model = AgglomerativeClustering(n_clusters=params_to_use[0], linkage=params_to_use[1])
model = ag_model.fit(y)

## Model Evaluation

### Silhouette Score

The mean Silhouette Coefficient of all samples.

The Silhouette Coefficient is calculated using the mean intra-cluster distance (``a``) and the mean nearest-cluster distance (``b``) for each sample.

The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters. Negative values generally indicate that a sample has been assigned to the wrong cluster, as a different cluster is more similar.

In [None]:
s_score = metrics.silhouette_score(data, model.labels_, metric='euclidean')
print("Silhouette score: %0.2f" % s_score)

### Additional Model Information

In [None]:
print("Number of leaves: {}".format(model.n_leaves_))
print("Number of components: {}".format(model.n_components_))
print("Model affinity: {}".format(model.affinity))