# Hierarchial Clustering
## Kumar Rahul

We will use beer data to perform hierarchial clustering using seaborn packages `clustermap`. This function derives some features from `scipy` package.

Refer to http://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html for refernce to available methods on hierarchial clustering.

The other option to achieve hierarchial clustering is to use `AgglomerativeClustering` from `sklearn.cluster`. More about it: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html#sklearn.cluster.AgglomerativeClustering

In [None]:
import pandas as pd
import seaborn as sn


## 1. Preparing Data

Read data from a specified location


In [None]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [None]:
raw_df = pd.read_csv( "../data/Hclust_Beer data.csv", 
                        sep = ',', na_values = ['', ' '])

raw_df.columns = raw_df.columns.str.lower().str.replace(' ', '_')
raw_df.head()


## 2. Extract Features and Standardize

Two ways to extract the features:

> * use `pd.filter` and pass the list of features to extract for scaling
* Use `pd.drop` and pass the list of features which need not be extracted

The feature can also be extracted by using `dataframeName[[<name of features>]]` 


In [None]:
#feature_df = raw_df[['cal', 'sod', 'alc', 'cost']]

feature_df = raw_df.drop({'id','beer'}, axis =1)
col_names = feature_df.columns
#col_names

row_index = raw_df.iloc[:,1]
#row_index

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

feature_scaled_df = pd.DataFrame(scaler.fit_transform( feature_df ))

feature_scaled_df.columns = col_names
feature_scaled_df.index = row_index 

Use `rename` function, in case renaming of a specific column or index is required

In [None]:
#feature_scaled_df.rename(index={'Budweiser':'Bud'}, inplace=True)

The referening of a row or column can be changed by using the below code chunk. Uncomment and change the values within `iloc` to understand how referencing works:

In [None]:
#ref_row_col = raw_df.iloc[:,:]
#ref_row_col

In [None]:
feature_scaled_df


## 3. Cluster and Visualize

Refer to http://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html for refernce to available methods, metric.

Rather than standardizing the values above We could have set `z_score` parameter inside the `clustermap` to 1 for standardizing the column values.

The other option to achieve hierarchial clustering is to use `AgglomerativeClustering` from `sklearn.cluster`. More about it: 
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html#sklearn.cluster.AgglomerativeClustering

In [None]:
sn.clustermap(feature_scaled_df, method = 'complete', metric = 'euclidean', 
              row_cluster=True,col_cluster = False,
              linewidths=.5,figsize =(15,15))

### Exercise: 

Use AgglomerativeClustering from sklearn.cluster to build Hierarchial Clustering using the beer data.