## Clustering Assignment

#### Problem Statement
HELP International is an international humanitarian NGO that is committed to fighting poverty and providing the people of backward countries with basic amenities and relief during the time of disasters and natural calamities. It runs a lot of operational projects from time to time along with advocacy drives to raise awareness as well as for funding purposes.

After the recent funding programmes, they have been able to raise around $ 10 million. Now the CEO of the NGO needs to decide how to use this money strategically and effectively. The significant issues that come while making this decision are mostly related to choosing the countries that are in the direst need of aid. 

### Objective
1. To categorise the countries using some socio-economic and health factors that determine the overall development of the country. 
2. To suggest the countries which the CEO needs to focus on the most. 

In [None]:
#importing all necessary libraries here.
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.cluster import hierarchical

from sklearn.metrics import silhouette_score

from scipy.cluster.hierarchy import linkage
from scipy.cluster.hierarchy import dendrogram
from scipy.cluster.hierarchy import cut_tree

### EDA Steps


In [None]:
#Load and inspect data
country_df = pd.read_csv('Country-data.csv')
country_df.info()

In [None]:
country_df.describe()

### Data Conversion


In [None]:

country_df['exports_total'] = country_df['exports'] * country_df['gdpp'] // 100 
country_df['health_total'] = country_df['health'] * country_df['gdpp'] // 100
country_df['imports_total'] = country_df['imports'] * country_df['gdpp'] // 100
# The values are rounded off to integer values, since decimal accuracy are not needed for clustering.
country_df.head(10)

In [None]:
country_df.describe()

In [None]:
# Let's see the correlation of the columns to see what are some independent variables.
features = ['gdpp', 'income', 'exports_total', 'health_total', 'imports_total', 'inflation', 'life_expec', 'total_fer', 'child_mort']
corr_mat = country_df[features].corr()
plt.figure(figsize = (8,6))
sns.heatmap(corr_mat, annot=False)
plt.show()

In [None]:
# We will use Distribution Plots to visualize the data distribution.
plt.figure(figsize=(20,12))
features = ['gdpp', 'income', 'inflation', 'exports_total', 'health_total', 'imports_total', 'child_mort', 'life_expec', 'total_fer']
for i in enumerate(features):    
    ax = plt.subplot(3, 3, i[0]+1)
    sns.distplot(country_df[i[1]], kde=True)

### Outliers

In [None]:
# We will use Box Plots to visualize data for univariate analysis
plt.figure(figsize=(20,12))
features = ['gdpp', 'income', 'inflation', 'exports_total', 'health_total', 'imports_total', 'child_mort', 'life_expec', 'total_fer']
for i in enumerate(features): 
    ax = plt.subplot(3, 3, i[0]+1)
    sns.boxplot(country_df[i[1]])

### Capping


In [None]:
# The absolute values below are picked from the graphs above, and we are trying to pack the points closer.
# We are not taking values from IQR intentionally.
country_df['gdpp'][country_df['gdpp'] >= 55000] = 55000
country_df['income'][country_df['income'] >= 60000] = 60000
country_df['exports_total'][country_df['exports_total'] >= 50000] = 50000
country_df['health_total'][country_df['health_total'] >= 6000] = 6000
country_df['imports_total'][country_df['imports_total'] >= 50000] = 50000
country_df['inflation'][country_df['inflation'] >= 20] = 20

In [None]:
# We will use Box Plots to visualize data for univariate analysis
plt.figure(figsize=(20,12))
features = ['gdpp', 'income', 'inflation', 'exports_total', 'health_total', 'imports_total', 'child_mort', 'life_expec', 'total_fer']
for i in enumerate(features): 
    ax = plt.subplot(3, 3, i[0]+1)
    sns.boxplot(country_df[i[1]])

### Hopkins Statistics
Let's now determine if the dataset is good enough for clustering, by using Hopkin's measure.

In [None]:
#Calculating the Hopkins statistic
from sklearn.neighbors import NearestNeighbors
from random import sample
from numpy.random import uniform
import numpy as np
from math import isnan
 
def hopkins(X):
    d = X.shape[1]
    #d = len(vars) # columns
    n = len(X) # rows
    m = int(0.1 * n) 
    nbrs = NearestNeighbors(n_neighbors=1).fit(X.values)
 
    rand_X = sample(range(0, n, 1), m)
 
    ujd = []
    wjd = []
    for j in range(0, m):
        u_dist, _ = nbrs.kneighbors(uniform(np.amin(X,axis=0),np.amax(X,axis=0),d).reshape(1, -1), 2, return_distance=True)
        ujd.append(u_dist[0][1])
        w_dist, _ = nbrs.kneighbors(X.iloc[rand_X[j]].values.reshape(1, -1), 2, return_distance=True)
        wjd.append(w_dist[0][1])
 
    H = sum(ujd) / (sum(ujd) + sum(wjd))
    if isnan(H):
        print(ujd, wjd)
        H = 0
 
    return H

In [None]:
# Hopkins Statistic Value
# Note: The country or index column is dropped, and the other columns are not used, instead the correspinding total columns are used.
# We are calculating the hopkins measure with only the columns that could be used for clustering.
hopkins(country_df.drop(['country', 'exports', 'imports', 'health'], axis=1))

Based on the hopkins measure above, the data deems fit for Clustering. We will proceed with K-Means clustering next.

### K-Means Clustering

#### Prepare & Scale

In [None]:
scaler = StandardScaler()
# We will drop the unused columns again during scaling.
# Note: The country or index column is dropped, and the columns exports/imports/health are not used, instead the correspinding '_total' columns are used.
scaled_country_df = scaler.fit_transform(country_df.drop(['country', 'exports', 'imports', 'health'], axis=1)) 

#### Silhouette Score

In [None]:
ss = []
for k in range(2,10):
    kmeans = KMeans(n_clusters = k, random_state=20).fit(scaled_country_df)
    ss.append([k, silhouette_score(scaled_country_df, kmeans.labels_)])
    
plt.plot(pd.DataFrame(ss)[0], pd.DataFrame(ss)[1]);

#### Elbow Method to choose K

In [None]:
ssd = []
for k in list(range(1,10)):
    model = KMeans(n_clusters = k, max_iter = 50).fit(scaled_country_df)
    ssd.append([k, model.inertia_])
    
plt.plot(pd.DataFrame(ssd)[0], pd.DataFrame(ssd)[1]);

Based on both the elbox curve and silhouette score analysis, k = 3, seems to be the optimum choice of clusters for k-means.
It should also be noted that, using Silhouette score, even k = 2, option looks viable too.

#### Run k-means clustering \[K = 3\]

In [None]:
kmean = KMeans(n_clusters = 3, max_iter = 50, random_state = 20)
kmean.fit(scaled_country_df)

In [None]:
#Adding the cluster IDs to the original dataframe.
country_df = pd.concat([country_df, pd.Series(data=kmean.labels_, name='kmeans_clusterid')], axis = 1)
country_df.head()

In [None]:
country_df['kmeans_clusterid'].value_counts()

In [None]:
# Plotting the clusters with features gdpp, income, child_mort
plt.figure(figsize=(21, 7))
plt.subplot(131)
sns.scatterplot(data = country_df, x = 'gdpp', y = 'income', hue ='kmeans_clusterid', legend = 'full')
plt.subplot(132)
sns.scatterplot(data = country_df, x = 'gdpp', y = 'child_mort', hue ='kmeans_clusterid', legend = 'full')
plt.subplot(133)
sns.scatterplot(data = country_df, x = 'child_mort', y = 'income', hue ='kmeans_clusterid', legend = 'full')

plt.show()

#### Profiling Clusters

In [None]:
country_df[['kmeans_clusterid', 'gdpp', 'income']].groupby('kmeans_clusterid').mean().plot(kind = 'bar')

In [None]:
country_df[['kmeans_clusterid', 'child_mort']].groupby('kmeans_clusterid').mean().plot(kind = 'bar')

In [None]:
plt.figure(figsize=(21,7))
plt.subplot(131)
sns.boxplot(x='kmeans_clusterid', y='gdpp', data=country_df[['kmeans_clusterid', 'gdpp']])
plt.subplot(132)
sns.boxplot(x='kmeans_clusterid', y='income', data=country_df[['kmeans_clusterid', 'income']])
plt.subplot(133)
sns.boxplot(x='kmeans_clusterid', y='child_mort', data=country_df[['kmeans_clusterid', 'child_mort']])
plt.show()

In [None]:
# The top 5 countries (for aid) from this clustering method can be obtained by sorting cluster 2 with decreasing child mortality rate and increasing gdpp
country_df[country_df['kmeans_clusterid']==2].sort_values(by=["child_mort", 'gdpp'], ascending=[False, True])[['country', 'child_mort', 'gdpp', 'income']].head(5)

### Hierarchical Clustering

In [None]:
#Hierarchical clustering with single linkage.
plt.figure(figsize=(30, 10))
linkages = linkage(scaled_country_df, method="single", metric='euclidean')
dendrogram(linkages)
plt.show()

In [None]:
#Hierarchical clustering with single linkage.
plt.figure(figsize=(30, 10))
linkages = linkage(scaled_country_df, method="complete", metric='euclidean')
dendrogram(linkages)
plt.show()


We will use cut_tree to get our list of features and k=3.

#### Run Hierarchical clustering \[K = 3\]

In [None]:
cluster_labels = cut_tree(linkages, n_clusters=3).reshape(-1,)
country_df['hierarchical_clusterid'] = cluster_labels
country_df.head()

In [None]:
country_df['hierarchical_clusterid'].value_counts()

In [None]:
# Plotting the clusters with features gdpp, income, child_mort
plt.figure(figsize=(21, 7))
plt.subplot(131)
sns.scatterplot(data = country_df, x = 'gdpp', y = 'income', hue ='hierarchical_clusterid', legend = 'full')
plt.subplot(132)
sns.scatterplot(data = country_df, x = 'gdpp', y = 'child_mort', hue ='hierarchical_clusterid', legend = 'full')
plt.subplot(133)
sns.scatterplot(data = country_df, x = 'child_mort', y = 'income', hue ='hierarchical_clusterid', legend = 'full')

plt.show()

#### Profiling Hierarchical Clusters

In [None]:
country_df[['hierarchical_clusterid', 'gdpp', 'income']].groupby('hierarchical_clusterid').mean().plot(kind = 'bar')

In [None]:
country_df[['hierarchical_clusterid', 'child_mort']].groupby('hierarchical_clusterid').mean().plot(kind = 'bar')

In [None]:
plt.figure(figsize=(21,7))
plt.subplot(131)
sns.boxplot(x='hierarchical_clusterid', y='gdpp', data=country_df[['hierarchical_clusterid', 'gdpp']])
plt.subplot(132)
sns.boxplot(x='hierarchical_clusterid', y='income', data=country_df[['hierarchical_clusterid', 'income']])
plt.subplot(133)
sns.boxplot(x='hierarchical_clusterid', y='child_mort', data=country_df[['hierarchical_clusterid', 'child_mort']])
plt.show()

In [None]:
# The top 5 countries from this clustering method can be obtained by sorting cluster 2 with decreasing child mortality rate and increasing gdpp
# We will also prefilter the countries within this cluster with child mortality higher than 100, gdpp less than 1000 and income less than 2000. 
country_df[(country_df['hierarchical_clusterid']==0) & (country_df['child_mort'] >= 100) & (country_df['gdpp'] <= 1000) & (country_df['income'] <= 1500)].sort_values(by=["child_mort", 'gdpp'], ascending=[False, True])[['country', 'child_mort', 'gdpp', 'income']].head(5)

### Final Analysis
1. Publish the classifications obtained for the countries.
2. Choose a cluster/clusters that should be preferred for aid, based on the socio-economic/health factors.


Further we took the countries with highest mortality rates with low GDPP to determine the best 5 candidate countries to receive the aid. The country list is given below from the hierarchical clusering output.

In [None]:
# The top 5 countries from this clustering method can be obtained by sorting cluster 2 with decreasing child mortality rate and increasing gdpp
country_df[(country_df['hierarchical_clusterid']==0)].sort_values(by=["child_mort", 'gdpp'], ascending=[False, True])[['country']].head(5)