<a href="https://www.bigdatauniversity.com"><img src = "https://ibm.box.com/shared/static/cw2c7r3o20w9zn8gkecaeyjhgw3xdgbj.png" width = 400, align = "center"></a>

# <center>Hierarchical Clustering</center>

Welcome to Lab of Hierarchical Clustering with Python using Scipy and Scikit-learn package.

#  Hierarchical Clustering - Agglomerative

We will be looking at a clustering technique, which is <b>Agglomerative Hierarchical Clustering</b>. Remember that agglomerative is the bottom up approach. <br> <br>
In this lab, we will be looking at Agglomerative clustering, which is more popular than Divisive clustering. <br> <br>
We will also be using Complete Linkage as the Linkage Criteria. <br>
<b> <i> NOTE: You can also try using Average Linkage wherever Complete Linkage would be used to see the difference! </i> </b>

<h1>Table of contents</h1>

<div class="alert alert-block alert-info" style="margin-top: 20px">
    <ul>
        <li><a href="#insurance_dataset">Insurance Dataset</a></li>
        <li><a href="#weather_station_clustering">Weather Station Clustering</a></li>
            <ol>
                <li><a href="#download_data">Download data into your Data Scientist Workbench</a></li>
                <li><a href="#load_dataset">Load the dataset</a></li>
                <li><a href="#data_cleaning">Data Cleaning</a></li>
                <li><a href="#data_visualization">Data Visualization</a></li>
                <li><a href="#data_sampling">Data Sampling</a></li>
                <li><a href="#data_clustering">Data Clustering using average temperature</a></li>
                <li><a href="#plot_dendrogram">Plot the first dendrogram</a></li>
                <li><a href="#clustering_location_temperature">Clustering based on location and temperature</a></li>
                <li><a href="#visualize_dendrogram">Visualization dendrogram</a></li>
                <li><a href="#clustering_results">Clustering results (Labels)</a></li>
                <li><a href="#visualization_clusters">Visualization of clusters</a></li>
            </ol>
    </ul>
</div>
<br>
<hr>

Lets import the libraries that we need for this lab.

In [None]:
import numpy as np 
from scipy import ndimage 
from scipy.cluster import hierarchy 
from scipy.spatial import distance_matrix 
from matplotlib import pyplot as plt 
from sklearn import manifold, datasets 
from sklearn.cluster import AgglomerativeClustering 
from sklearn.datasets.samples_generator import make_blobs 
%matplotlib inline

---
### Generating Random Data
We will be generating a set of data using the <b>make_blobs</b> class. <br> <br>
Input these parameters into make_blobs:
<ul>
    <li> <b>n_samples</b>: The total number of points equally divided among clusters. </li>
    <ul> <li> Choose a number from 10-1500 </li> </ul>
    <li> <b>centers</b>: The number of centers to generate, or the fixed center locations. </li>
    <ul> <li> Choose arrays of x,y coordinates for generating the centers. Have 1-10 centers (ex. centers=[[1,1], [2,5]]) </li> </ul>
    <li> <b>cluster_std</b>: The standard deviation of the clusters. The larger the number, the further apart the clusters</li>
    <ul> <li> Choose a number between 0.5-1.5 </li> </ul>
</ul> <br>
Save the result to <b>X1</b> and <b>y1</b>.

In [None]:
X1, y1 = make_blobs(n_samples=50, centers=[[4,4], [-2, -1], [1, 1], [10,4]], cluster_std=0.9)

Plot the scatter plot of the randomly generated data

In [None]:
plt.scatter(X1[:, 0], X1[:, 1], marker='o') 

---
### Agglomerative Clustering
We will start by clustering the random data points we just created.

The <b> AgglomerativeClustering </b> class will require two inputs:
<ul>
    <li> <b>n_clusters</b>: The number of clusters to form as well as the number of centroids to generate. </li>
    <ul> <li> Value will be: 4 </li> </ul>
    <li> <b>linkage</b>: Which linkage criterion to use. The linkage criterion determines which distance to use between sets of observation. The algorithm will merge the pairs of cluster that minimize this criterion. </li>
    <ul> 
        <li> Value will be: 'complete' </li> 
        <li> <b>Note</b>: It is recommended you try everything with 'average' as well </li>
    </ul>
</ul> <br>
Save the result to a variable called <b> agglom </b>

In [None]:
agglom = AgglomerativeClustering(n_clusters = 4, linkage = 'average')

Fit the model with <b> X2 </b> and <b> y2 </b> from the generated data above.

In [None]:
agglom.fit(X1,y1)

Run the following code to show the clustering! <br>
Remember to read the code and comments to gain more understanding on how the plotting works.

In [None]:
# Create a figure of size 6 inches by 4 inches.
plt.figure(figsize=(6,4))

# These two lines of code are used to scale the data points down,
# Or else the data points will be scattered very far apart.

# Create a minimum and maximum range of X1.
x_min, x_max = np.min(X1, axis=0), np.max(X1, axis=0)

# Get the average distance for X1.
X1 = (X1 - x_min) / (x_max - x_min)

# This loop displays all of the datapoints.
for i in range(X1.shape[0]):
    # Replace the data points with their respective cluster value 
    # (ex. 0) and is color coded with a colormap (plt.cm.spectral)
    plt.text(X1[i, 0], X1[i, 1], str(y1[i]),
             color=plt.cm.nipy_spectral(agglom.labels_[i] / 10.),
             fontdict={'weight': 'bold', 'size': 9})
    
# Remove the x ticks, y ticks, x and y axis
plt.xticks([])
plt.yticks([])
#plt.axis('off')



# Display the plot of the original data before clustering
plt.scatter(X1[:, 0], X1[:, 1], marker='.')
# Display the plot
plt.show()

---
### Dendrogram Associated for the Agglomerative Hierarchical Clustering
Remember that a <b>distance matrix</b> contains the <b> distance from each point to every other point of a dataset </b>. <br>
Use the function <b> distance_matrix, </b> which requires <b>two inputs</b>. Use the Feature Matrix, <b> X2 </b> as both inputs and save the distance matrix to a variable called <b> dist_matrix </b> <br> <br>
Remember that the distance values are symmetric, with a diagonal of 0's. This is one way of making sure your matrix is correct. <br> (print out dist_matrix to make sure it's correct)

In [None]:
dist_matrix = distance_matrix(X1,X1) 
print(dist_matrix)

Using the <b> linkage </b> class from hierarchy, pass in the parameters:
<ul>
    <li> The distance matrix </li>
    <li> 'complete' for complete linkage </li>
</ul> <br>
Save the result to a variable called <b> Z </b>

In [None]:
Z = hierarchy.linkage(dist_matrix, 'complete')

Next, we will save the dendrogram to a variable called <b>dendro</b>. In doing this, the dendrogram will also be displayed.
Using the <b> dendrogram </b> class from hierarchy, pass in the parameter:
<ul> <li> Z </li> </ul>

In [None]:
dendro = hierarchy.dendrogram(Z)

<h1 id="insurance_dataset">Insurance dataset</h1>

## Read data

In [None]:
filename = 'InsuranceClaims.csv'

#Read csv
pdf = pd.read_csv(filename)
print ("Shape of dataset: ", pdf.shape)

pdf.head(5)

### Pre-processing

In [None]:
pdf.dtypes

In [None]:
pdf['claimamt'] = pd.to_numeric(pdf['claimamt'], errors='coerce')
pdf = pdf[pd.notnull(pdf["claimamt"]) & (np.isfinite(pdf['claimamt']))]
pdf = pdf.reset_index(drop=True)
print ("Shape of dataset after cleaning: ",pdf.size)
pdf.head(5)

In [None]:
#Normalization
from sklearn.preprocessing import MinMaxScaler
x = pdf.values #returns a numpy array
min_max_scaler = MinMaxScaler()
norp = min_max_scaler.fit_transform(x)
norp [0:5]

<h2 id="modeling_using_scipy">Modeling using Scipy</h2>

In [None]:
import scipy
leng = norp.shape[0]
D = scipy.zeros([leng,leng])
for i in range(leng):
    for j in range(leng):
        D[i,j] = scipy.spatial.distance.euclidean(norp[i], norp[j])

In [None]:
import pylab
import scipy.cluster.hierarchy
Z = hierarchy.linkage(D, 'single')

In [None]:
fig = pylab.figure(figsize=(28,10))
def llf(id):
    return '[%d %d %d]' % (pdf['holderage'][id], pdf['vehiclegroup'][id], pdf['vehicleage'][id])  	
    
dendro = hierarchy.dendrogram(Z,  leaf_label_func=llf, leaf_rotation=90, leaf_font_size =12, )

A way starting from the dendrogram is by determining the number of clusters. You can then use:

In [None]:
from scipy.cluster.hierarchy import fcluster
k = 5
clusters = fcluster(Z, k, criterion='maxclust')
clusters


or by cutting line:

In [None]:
from scipy.cluster.hierarchy import fcluster
max_d = 4.2
clusters = fcluster(Z, max_d, criterion='distance')
clusters

## Modeling using scikit-learn

In [None]:
dist_matrix = distance_matrix(norp,norp) 
print(dist_matrix)

In [None]:
agglom = AgglomerativeClustering(n_clusters = 5, linkage = 'complete')

In [None]:
agglom.fit(norp)

In [None]:
agglom.labels_

In [None]:
pdf['cluster_'] = agglom.labels_
pdf.head()

In [None]:
# Create a figure of size 6 inches by 4 inches.
plt.figure(figsize=(16,14))

for color, label in zip('bgrmy', [0, 1, 2, 3, 4]):
    subset = pdf[pdf.cluster_ == label]
    for i in subset.index:
        plt.text(subset.nclaims[i], subset.claimamt[i],str(int(subset['vehiclegroup'][i])) )
    plt.scatter(subset.nclaims, subset.claimamt, s=subset.vehicleage*200, c=color, label='cluster'+str(label),alpha=0.3)
plt.legend()
plt.title('Clusters')
plt.xlabel('nclaims')
plt.ylabel('claimamt')

Lets summerize it:

In [None]:
pdf.groupby(['cluster_','vehiclegroup'])['cluster_'].count()

In [None]:
agg_claims = pdf.groupby(['cluster_','vehiclegroup'])['holderage','vehicleage','claimamt','nclaims'].mean()
agg_claims

In [None]:
for color, label in zip('bgrmy', [0, 1, 2, 3, 4]):
    subset = agg_claims.loc[(label,),]
    for i in subset.index:
        plt.text(subset.loc[i][1], subset.loc[i][0], str(int(i)))
    plt.scatter(subset.vehicleage, subset.holderage, s=subset.claimamt*2, c=color, label='cluster'+str(label))
plt.legend()
plt.title('Clusters')
plt.xlabel('vehicleage')
plt.ylabel('holderage')


<h1 id="weather_station_clustering">Weather Station Clustering</h1>

## Hierarchical Clustering using python & scikit-learn¶

Lets import all the libraries that we need.

In [None]:
import pandas as pd
import numpy as np
from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt
from pylab import rcParams
from sklearn.preprocessing import normalize
import pylab
import scipy
import scipy.cluster.hierarchy as sch
import matplotlib.pyplot 
%matplotlib inline

### About the dataset

		
<h4 align = "center">
Environment Canada    
Monthly Values for July - 2015	
</h4>
<html>
<head>
<style>
table {
    font-family: arial, sans-serif;
    border-collapse: collapse;
    width: 100%;
}

td, th {
    border: 1px solid #dddddd;
    text-align: left;
    padding: 8px;
}

tr:nth-child(even) {
    background-color: #dddddd;
}
</style>
</head>
<body>

<table>
  <tr>
    <th>Name in the table</th>
    <th>Meaning</th>
  </tr>
  <tr>
    <td><font color = "green"><strong>Stn_Name</font></td>
    <td><font color = "green"><strong>Station Name</font</td>
  </tr>
  <tr>
    <td><font color = "green"><strong>Lat</font></td>
    <td><font color = "green"><strong>Latitude (North+, degrees)</font></td>
  </tr>
  <tr>
    <td><font color = "green"><strong>Long</font></td>
    <td><font color = "green"><strong>Longitude (West - , degrees)</font></td>
  </tr>
  <tr>
    <td>Prov</td>
    <td>Province</td>
  </tr>
  <tr>
    <td>Tm</td>
    <td>Mean Temperature (°C)</td>
  </tr>
  <tr>
    <td>DwTm</td>
    <td>Days without Valid Mean Temperature</td>
  </tr>
  <tr>
    <td>D</td>
    <td>Mean Temperature difference from Normal (1981-2010) (°C)</td>
  </tr>
  <tr>
    <td><font color = "black">Tx</font></td>
    <td><font color = "black">Highest Monthly Maximum Temperature (°C)</font></td>
  </tr>
  <tr>
    <td>DwTx</td>
    <td>Days without Valid Maximum Temperature</td>
  </tr>
  <tr>
    <td><font color = "black">Tn</font></td>
    <td><font color = "black">Lowest Monthly Minimum Temperature (°C)</font></td>
  </tr>
  <tr>
    <td>DwTn</td>
    <td>Days without Valid Minimum Temperature</td>
  </tr>
  <tr>
    <td>S</td>
    <td>Snowfall (cm)</td>
  </tr>
  <tr>
    <td>DwS</td>
    <td>Days without Valid Snowfall</td>
  </tr>
  <tr>
    <td>S%N</td>
    <td>Percent of Normal (1981-2010) Snowfall</td>
  </tr>
  <tr>
    <td><font color = "green"><strong>P</font></td>
    <td><font color = "green"><strong>Total Precipitation (mm)</font></td>
  </tr>
  <tr>
    <td>DwP</td>
    <td>Days without Valid Precipitation</td>
  </tr>
  <tr>
    <td>P%N</td>
    <td>Percent of Normal (1981-2010) Precipitation</td>
  </tr>
  <tr>
    <td>S_G</td>
    <td>Snow on the ground at the end of the month (cm)</td>
  </tr>
  <tr>
    <td>Pd</td>
    <td>Number of days with Precipitation 1.0 mm or more</td>
  </tr>
  <tr>
    <td>BS</td>
    <td>Bright Sunshine (hours)</td>
  </tr>
  <tr>
    <td>DwBS</td>
    <td>Days without Valid Bright Sunshine</td>
  </tr>
  <tr>
    <td>BS%</td>
    <td>Percent of Normal (1981-2010) Bright Sunshine</td>
  </tr>
  <tr>
    <td>HDD</td>
    <td>Degree Days below 18 °C</td>
  </tr>
  <tr>
    <td>CDD</td>
    <td>Degree Days above 18 °C</td>
  </tr>
  <tr>
    <td>Stn_No</td>
    <td>Climate station identifier (first 3 digits indicate   drainage basin, last 4 characters are for sorting alphabetically).</td>
  </tr>
  <tr>
    <td>NA</td>
    <td>Not Available</td>
  </tr>


</table>

</body>
</html>

 

<h2 id="download_data">1-Download data into your Data Scientist Workbench

In [None]:
!wget -O weather-stations20140101-20141231.csv https://ibm.box.com/shared/static/mv6g5p1wpmpvzoz6e5zgo47t44q8dvm0.csv

<h2 id="load_dataset">2- Load the dataset</h2>

In [None]:
filename='weather-stations20140101-20141231.csv'

#Read csv
pdf = pd.read_csv(filename)
print ("Shape of dataset: ", pdf.shape)

pdf.head(5)

<h2 id="data_cleaning">3- Data Cleaning</h2>

In [None]:
pdf = pdf[pd.notnull(pdf["Tm"]) & np.isfinite(pdf['Tm'])]
pdf = pdf.reset_index(drop=True)
print ("Shape of dataset after cleaning: ",pdf.size)
pdf.head(10)

<h2 id="data_visualization">4- Data Visualization</h2>

In [None]:
from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt
from pylab import rcParams

rcParams['figure.figsize'] = (14,10)

llon = -140
ulon = -50
llat = 40
ulat = 65

pdf = pdf[(pdf['Long'] > llon) & (pdf['Long'] < ulon) & (pdf['Lat'] > llat) &(pdf['Lat'] < ulat)]

my_map = Basemap(projection='merc',
            resolution = 'l', area_thresh = 1000.0,
            llcrnrlon=llon, llcrnrlat=llat, #min longitude (llcrnrlon) and latitude (llcrnrlat)
            urcrnrlon=ulon, urcrnrlat=ulat) #max longitude (urcrnrlon) and latitude (urcrnrlat)

my_map.drawcoastlines()
my_map.drawcountries()
#my_map.drawmapboundary()
my_map.fillcontinents(color = 'white', alpha = 0.3)
my_map.shadedrelief()

# To collect data based on stations        

xs,ys = my_map(np.asarray(pdf.Long), np.asarray(pdf.Lat))
pdf['xm'] = xs.tolist()
pdf['ym'] = ys.tolist()

#Visualization1
for index,row in pdf.iterrows():
#   x,y = my_map(row.Long, row.Lat)
   my_map.plot(row.xm, row.ym,markerfacecolor =([1,0,0]),  marker='o', markersize= 5, alpha = 0.75)
#plt.text(x,y,stn)
plt.show()



<h2 id="data_sampling">5- Data Sampling</h2>

In [None]:
hpdf = pdf.sample(frac=0.03, replace=False).reset_index(drop=True)
print ("Shape of sampled dataset: ", hpdf.shape)
hpdf.head()

<h2 id="data_clustering">6- Data Clustering using average temperature</h2>

In [None]:
Temper = np.asarray(hpdf['Tm'])
nx = normalize(Temper.astype(float).reshape(-1, 1), axis=0)
D = scipy.zeros([nx.size,nx.size])
for i in range(nx.size):
    for j in range(nx.size):
        D[i,j] = abs(nx[i] - nx[j])
Y = sch.linkage(D, method='centroid')

<h2 id="plot_dendrogram">7- Plot the first dendrogram</h2>

In [None]:
fig = pylab.figure(figsize=(8,8))
ax1 = fig.add_axes([0.1,0.1,0.4,0.6])
Z1 = sch.dendrogram(Y, orientation='right')
ax1.set_xticks([])
lb = zip(map(lambda x: round(x,2),Temper[Z1['leaves']]),hpdf['Stn_Name'][Z1['leaves']])
ax1.set_yticklabels(lb)
fig.show()

<h2 id="clustering_location_temperature">8-Clustering based on location and temperature</h2>

In [None]:
#Normalization
from sklearn.preprocessing import normalize
numpyMatrix = hpdf[['Tm','Tn','Tx','xm','ym']].as_matrix()
norp = normalize(numpyMatrix, axis=0)
norp[0:5]

In [None]:
leng = norp.shape[0]
D = scipy.zeros([leng,leng])
for i in range(leng):
    for j in range(leng):
        D[i,j] = scipy.spatial.distance.euclidean(norp[i], norp[j])

<h2 id="visualize_dendrogram">9-Visualization dendrogram.</h2>

In [None]:
fig = pylab.figure(figsize=(8,8))
ax1 = fig.add_axes([0.1,0.1,0.4,0.6])
Y = sch.linkage(D, method='centroid')
Z1 = sch.dendrogram(Y, orientation='right')
ax1.set_xticks([])
#ax1.set_yticks([])
lb=zip(map(lambda x: round(x,2),hpdf.Tx[Z1['leaves']]), \
       map(lambda x: round(x,2),hpdf.Tm[Z1['leaves']]), \
       map(lambda x: round(x,2),hpdf.Tn[Z1['leaves']]), \
       hpdf['Stn_Name'][Z1['leaves']],\
       map(lambda x: round(x,2),hpdf.Lat[Z1['leaves']]), \
       map(lambda x: round(x,2),hpdf.Long[Z1['leaves']]) \
      )
ax1.set_yticklabels(lb)
fig.show()

<h2 id="clustering_results">10- Clustering results (Labels)</h2>

In [None]:
labels = sch.fcluster(Y, 0.9*D.max(), 'distance')
hpdf["Clus_hier"]=labels-1
clusterNum=labels.max()
print (hpdf.Clus_hier)

<h2 id="visualization_clusters">11-Visualization of clusters</h2>

In [None]:
rcParams['figure.figsize'] = (14,10)

my_map = Basemap(projection='merc',
            resolution = 'l', area_thresh = 1000.0,
            llcrnrlon=llon, llcrnrlat=llat, #min longitude (llcrnrlon) and latitude (llcrnrlat)
            urcrnrlon=ulon, urcrnrlat=ulat) #max longitude (urcrnrlon) and latitude (urcrnrlat)

my_map.drawcoastlines()
my_map.drawcountries()
#my_map.drawmapboundary()
my_map.fillcontinents(color = 'white', alpha = 0.3)
my_map.shadedrelief()

# To create a color map
colors = plt.get_cmap('jet')(np.linspace(0.0, 1.0, clusterNum))

#Visualization1
for index,row in hpdf.iterrows():
    #print row.xm, row.ym , colors[np.int(row.Clus_hier)]
    my_map.plot(row.xm, row.ym, markerfacecolor =colors[np.int(row.Clus_hier)],  marker='o', markersize= 10, alpha = 0.75)
for i in range(clusterNum): 
    cluster=hpdf[["Stn_Name","Tm","xm","ym","Clus_hier"]][hpdf.Clus_hier==i]
    cenx=np.mean(cluster.xm) 
    ceny=np.mean(cluster.ym) 
    plt.text(cenx,ceny,str(i), fontsize=25, color='red',)
    #print "Cluster "+str(i)+', Avg Temp: '+ str(np.mean(cluster.Tm))

## Want to learn more?

IBM SPSS Modeler is a comprehensive analytics platform that has many machine learning algorithms. It has been designed to bring predictive intelligence to decisions made by individuals, by groups, by systems – by your enterprise as a whole. A free trial is available through this course, available here: [SPSS Modeler](http://cocl.us/ML0101EN-SPSSModeler).

Also, you can use Watson Studio to run these notebooks faster with bigger datasets. Watson Studio is IBM's leading cloud solution for data scientists, built by data scientists. With Jupyter notebooks, RStudio, Apache Spark and popular libraries pre-packaged in the cloud, Watson Studio enables data scientists to collaborate on their projects without having to install anything. Join the fast-growing community of Watson Studio users today with a free account at [Watson Studio](https://cocl.us/ML0101EN_DSX)

### Thanks for completing this lesson!

Notebook created by: <a href = "https://ca.linkedin.com/in/saeedaghabozorgi">Saeed Aghabozorgi</a>

<hr>

Copyright &copy; 2018 [Cognitive Class](https://cocl.us/DX0108EN_CC). This notebook and its source code are released under the terms of the [MIT License](https://bigdatauniversity.com/mit-license/).​