# Hierachical Clustering On Happines Report
<div class="alert alert-block alert-info" style="margin-top: 20px">
1. [Introduction and Data Import](#0)<br>
2. [Feature Selection](#1)<br>
3. [Clustering using Scipy](#2)
4. [Visualizing ](#3)
5. [Clustering using scikit-learn](#4)
6. [Visualizing ](#5)
<hr>

# Introduction and Data Import <a id="0"></a>

Importing necessary libraries.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from scipy import ndimage 
from scipy.cluster import hierarchy 
from scipy.spatial import distance_matrix 
from matplotlib import pyplot as plt 
from sklearn import manifold, datasets 
from sklearn.cluster import AgglomerativeClustering 
from sklearn.datasets.samples_generator import make_blobs 
%matplotlib inline

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
df=pd.read_csv("/kaggle/input/world-happiness/2015.csv")
df.head()

# Feature selection <a id="1"></a>

Lets select our feature set:

In [None]:
featureset = df[["Standard Error","Economy (GDP per Capita)","Family","Health (Life Expectancy)","Freedom","Trust (Government Corruption)","Generosity","Dystopia Residual"]]

Normalization<br>
Now we can normalize the feature set. MinMaxScaler transforms features by scaling each feature to a given range. It is by default (0, 1). That is, this estimator scales and translates each feature individually such that it is between zero and one.

In [None]:
from sklearn.preprocessing import MinMaxScaler
x = featureset.values #returns a numpy array
min_max_scaler = MinMaxScaler()
feature_mtx = min_max_scaler.fit_transform(x)
feature_mtx [0:5]

# Clustering using Scipy <a id="2"></a>

In this part we use Scipy package to cluster the dataset:<br>
First, we calculate the distance matrix.

In [None]:
import scipy
leng = feature_mtx.shape[0]
D = scipy.zeros([leng,leng])
for i in range(leng):
    for j in range(leng):
        D[i,j] = scipy.spatial.distance.euclidean(feature_mtx[i], feature_mtx[j])

In [None]:
import pylab
import scipy.cluster.hierarchy
Z = hierarchy.linkage(D, 'complete')

Essentially, Hierarchical clustering does not require a pre-specified number of clusters. However, in some applications we want a partition of disjoint clusters just as in flat clustering. So you can use a cutting line:

In [None]:
from scipy.cluster.hierarchy import fcluster
max_d = 3
clusters = fcluster(Z, max_d, criterion='distance')
clusters

Also, you can determine the number of clusters directly:

In [None]:
from scipy.cluster.hierarchy import fcluster
k = 5
clusters = fcluster(Z, k, criterion='maxclust')
clusters

# Visualizing <a id="3"></a>

In [None]:
fig = pylab.figure(figsize=(20,200))
def llf(id):
    return '[%s ,%s,%s ]' % (df['Country'][id], df['Region'][id],int(float(df['Happiness Rank'][id])) )
    
dendro = hierarchy.dendrogram(Z,  leaf_label_func=llf, leaf_rotation=0, leaf_font_size =12, orientation = 'right')

# Clustering using scikit-learn <a id="4"></a>

Lets redo it again, but this time using scikit-learn package:

In [None]:
dist_matrix = distance_matrix(feature_mtx,feature_mtx) 
print(dist_matrix)

Now, we can use the 'AgglomerativeClustering' function from scikit-learn library to cluster the dataset. The AgglomerativeClustering performs a hierarchical clustering using a bottom up approach. The linkage criteria determines the metric used for the merge strategy:

- Ward minimizes the sum of squared differences within all clusters. It is a variance-minimizing approach and in this sense is similar to the k-means objective function but tackled with an agglomerative hierarchical approach.
- Maximum or complete linkage minimizes the maximum distance between observations of pairs of clusters.
- Average linkage minimizes the average of the distances between all observations of pairs of clusters.

In [None]:
agglom = AgglomerativeClustering(n_clusters = 3, linkage = 'complete')
agglom.fit(feature_mtx)
agglom.labels_

And, we can add a new field to our dataframe to show the cluster of each row:

In [None]:
df['cluster_'] = agglom.labels_
df.head()

# Visualizing <a id="5"></a>

In [None]:
import matplotlib.cm as cm
n_clusters = max(agglom.labels_)+1
colors = cm.rainbow(np.linspace(0, 1, n_clusters))
cluster_labels = list(range(0, n_clusters))


plt.figure(figsize=(15,15))

for color, label in zip(colors, cluster_labels):
    subset = df[df.cluster_ == label]
    for i in subset.index:
            plt.text(subset["Happiness Score"][i], subset.Region[i],str(subset.Country[i]), rotation=25) 
    plt.scatter(subset["Happiness Score"], subset.Region,  c=color, label='cluster'+str(label),alpha=0.5)
#    plt.scatter(subset.horsepow, subset.mpg)
plt.legend()
plt.title('Clusters')
plt.xlabel('Happiness Score')
plt.ylabel('Region')

Obviously the countries with most happiness score are red, with less happiness score are purple and the others are light blue.<br>
There is a big 'but' beacuse USA and few other countries with high happiness score are light blue and few countries with fewer scores are red. Isn't it intersting?

Now we can look at the characterestics of each cluster:

In [None]:
df.groupby(['cluster_','Region'])['cluster_'].count()

In [None]:
agg_reg = df.groupby(['cluster_','Region'])['Happiness Score','Economy (GDP per Capita)','Freedom','Health (Life Expectancy)'].mean()
agg_reg

In [None]:
for label in cluster_labels:
    subset=agg_reg.loc[(label,),]
    print(subset)

In [None]:
plt.figure(figsize=(15,15))
for color, label in zip(colors, cluster_labels):
    subset = agg_reg.loc[(label,),]
    for i in subset.index:
        plt.text(subset.loc[i][0], subset.loc[i][2], 'Region='+str(i) + ', Health='+str(subset.loc[i][3]))
    plt.scatter(subset["Happiness Score"], subset["Freedom"], c=color, label='cluster'+str(label))
plt.legend()
plt.title('Clusters')
plt.xlabel('horsepow')
plt.ylabel('mpg')

Thank you for sahring your time to take a look at my kernel.

Please leave a comment if you like it or if you think the kernel needs improvement