# <span style="color:navy"> HELP International - Humanitarian NGO 
#### <span style="color:navy"> Clustering case study to identify the countries in need of Financial aid

### <span style="color:navy"> Problem Statement
HELP International is an international humanitarian NGO that is committed to fighting poverty and 
providing the people of backward countries with basic amenities and relief during the time of disasters and 
natural calamities. It runs a lot of operational projects from time to time along with advocacy drives to 
raise awareness as well as for funding purposes.

After the recent funding programmes, they have been able to raise around $10 million. 
Now the CEO of the NGO needs to decide how to use this money strategically and effectively. 
The significant issues that come while making this decision are mostly related to choosing the countries 
that are in the direst need of aid. 

The case study is to categorise the countries using some socio-economic and health factors that determine the
overall development of the country and to suggest the countries which the CEO needs to focus on the most.  

#### <span style="color:navy"> Data Dictionary
* **country**		Name of the country
* **child_mort**	Death of children under 5 years of age per 1000 live births
* **exports** Exports of goods and services. Given as %age of the Total GDP
* **health**		Total health spending as %age of Total GDP
* **imports**		Imports of goods and services. Given as %age of the Total GDP
* **Income**		Net income per person
* **Inflation**	The measurement of the annual growth rate of the Total GDP
* **life_expec**	The average number of years a new born child would live if the current mortality patterns are to remain the same
* **total_fer**	The number of children that would be born to each woman if the current age-fertility rates remain the same.
* **gdpp**		The GDP per capita. Calculated as the Total GDP divided by the total population.


## <span style="color:navy"> 1. Import Data and Initial Analysis
    
### 1.1 Data Import

In [None]:
# Import the required packages for analysis and model building
import pandas as pd
import numpy as np 

import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt

import sklearn
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

from scipy.cluster.hierarchy import linkage
from scipy.cluster.hierarchy import dendrogram
from scipy.cluster.hierarchy import cut_tree

import warnings
warnings.filterwarnings('ignore')

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
df = pd.read_csv('/kaggle/input/help-international/Country-data.csv')

In [None]:
df.head()

### 1.2 Data Inspection

In [None]:
df.info()

In [None]:
df.describe(percentiles= [.1, .05, .25, .5, .75, .95, .99])

## <span style="color:navy"> 2. Initial Data Analysis

The second step is to do the initial analysis on the data. The following points should be taken care of before proceeding 
with the model building. 
* Find Missing values if any
* Drop the unnecessary variables
* Identify the Categorical and Continuous Features
* Check the data types of all the columns and make changes if needed

### 2.1 Missing Values Analysis

In [None]:
# Check if there are any missing values
df.isnull().sum()

Since there are no missing values, we can proceed to Data Preparation

### 2.2 Data Preparation 

* Change the columns 'health', 'imports' and 'exports' from a percentage of Total GDPP into actual value of 'health', 'imports' and 'exports'.
* Set the index of the dataset to 'country' to enable better data visualiztion and to remove the categorical variable for analysis

In [None]:
# Change the variables from percentage of Total GDPP to actual values

df['exports']=(df['exports']*df['gdpp'])/100
df['health']=(df['health']*df['gdpp'])/100
df['imports']=(df['imports']*df['gdpp'])/100

#Set the index value as Country to do analysis of numerical columns
df.set_index('country',inplace=True)

df.head()

## <span style="color:navy"> 3. Visualising the Data

The next step is to visualise the data using `matplotlib` and `seaborn`.

This is one of the most important step - **understanding the data**. This step will help us understand the properties of data. 
- Helps to identify any outliers. 
- If there is some obvious multicollinearity going on, this can be identified here. 
- Here's where we can also identify if some predictors directly have a strong association with the outcome variable

### 3.1 Pairplot to see the distribution of the features

In [None]:
plt.figure(figsize = (25,25))
sns.pairplot(df)

### 3.2 Heatmap to see the correlation between features

In [None]:
# Heatmap to determine the correlation between the features. 
sns.heatmap(df.corr(), annot = True, cmap="YlGnBu")

### 3.3 Scatter plot to see the distribution of the Countries

We can define a function to plot the scatter plot of countries based on child_mort, income, gdpp and life_expec
* child_mort is along the x_axis
* income is along the y_axis
* The size of the scatter point is based on the gdpp (higher the gdpp, bigger are the points)
* The color of the scatter point is based on the life_expec (higher the life_expec, lighter the shade)

In [None]:
def plotdata(data):
    color = data.life_expec
    area = 5e-2 * data.gdpp
    data.plot.scatter('child_mort','income',
                      s=area,c=color,
                      colormap=matplotlib.cm.get_cmap('Purples_r'), vmin=45, vmax=90,
                      linewidths=1,edgecolors='k',
                      figsize=(20,15))
    
    # labeling different cluster points with country names 
    for i, txt in enumerate(data.index):
        if txt == 'India':
            plt.annotate(txt, (data.child_mort[i],data.income[i]), fontsize=25, ha='left', rotation=25)
        plt.annotate(txt, (data.child_mort[i],data.income[i]), fontsize=10, ha='left', rotation=25)
    
    plt.title('Countries based on child_mort, income, gdpp and life_expec',fontsize=20)
    plt.xlabel('child_mort',  fontsize=20)
    plt.ylabel('income', fontsize=20)
    plt.tight_layout()    
    plt.show()

In [None]:
plotdata(df)

### 3.4 Distplot to see the distribution of the features

In [None]:
# Distplot to see the distribution of the features
plt.figure(figsize = (15,15))
features = df.columns[:-1]
for i in enumerate(features):
    plt.subplot(3,3,i[0]+1)
    sns.distplot(df[i[1]])

### Observations: 
* The Distribution plot highlights the presence of outliers at both the upper and lower ends of the socio-economic spectrum 
* The Distribution plot also highlights the general skewness in the data implying that there is a wide disparity between the two extremes â€“ wealthy and poor countries.

## <span style="color:navy"> 4. Outlier analysis

As a next step, we have to remove the outliers. As our objective is to find the Countries that are in need of financial need, we can **cap the outliers for the Countries that have higher gdpp, income, exports, imports and health**. These are essentially rich countries that will not need aid, and hence can be capped. 

**Capping the Outliers**:
* The outliers are capped in the upper end for the following features - **gdpp, income, exports, imports and health**
* The outliers on the lower end for the features - **child_mort, life_expec and total_fer** should **NOT** be capped as these are essential for our analysis   

### 4.1 Box plot to identify outliers (Before capping the outliers)

In [None]:
# Box plot to identify the outliers
plt.figure(figsize=(20, 15))
for i, x_var in enumerate(df.columns[:-1]):
    plt.subplot(3,3,i+1)
    sns.boxplot(x = x_var, data = df)

### 4.2 A special case of Outlier - inflation

* As we can see from the figure below, there is a special case of outlier for inflation with only one Country - Nigeria with an inflation of more than 100.  
* When a model was built for clustering without removing the outlier for inflation, Nigeria was clustered in a separate cluster of its own as the inflation is abnormally high. 
* For our analysis purposes, we can cap the inflation outlier also to bring Nigeria on par with the other socio-economically weak countries. But we have to consider this during the final recommendation. 

In [None]:
# To find out the one outlier Country for the inflation feature. 
df.sort_values('inflation', ascending=False).head(10)

### 4.3 Handle Outliers - Capping the upper limit

In [None]:
# Copy the original dataset to a different dataset to cap the Outliers. This is essential to keep the original dataset
df_capped = df.copy()
cap_outliers = ['exports', 'health', 'imports', 'income', 'gdpp', 'inflation']

In [None]:
# For each of the features in remove_outlier, cap the outliers in the upper end. 
# Only the outliers in the upper end need to be capped. 
for i, var in enumerate(cap_outliers):
    q4 = df[var].quantile(0.95)
    df_capped[var][df_capped[var]>=q4] = q4

In [None]:
df_capped.describe()

### 4.4 Box plot after capping the outliers

In [None]:
# Plot the boxplot for the features 
plt.figure(figsize=(20, 15))
for i, x_var in enumerate(df_capped.columns[:-1]):
    plt.subplot(3,3,i+1)
    sns.boxplot(x = x_var, data = df_capped)

## <span style="color:navy"> 5. Preparing for the model building

The following are the steps needed to prepare the dataset for model building:
 1. Perform scaling on the dataset 
 2. Verify Hopkins Statistics to determine the cluster tendency of the dataset
 3. Determine the optimum number of K using Elbow-curve method and Silhouette analysis. 

### 5.1 Rescaling the dataset 

In [None]:
# Importing the scaling library - StandardScaler
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()

In [None]:
# Scaling the dataset with Standard Scaler 
df_scaled=pd.DataFrame(scaler.fit_transform(df_capped),columns=df_capped.columns, index=df_capped.index)
df_scaled.head()

### 5.2 Hopkins Statistics:

The Hopkins statistic, is a statistic which gives a value which indicates the cluster tendency, in other words: how well the data can be clustered.

* If the value is between {0.01, ...,0.3}, the data is regularly spaced.
* If the value is around 0.5, it is random.
* If the value is between {0.7, ..., 0.99}, it has a high tendency to cluster.

In [None]:
#Calculating the Hopkins statistic
from sklearn.neighbors import NearestNeighbors
from random import sample
from numpy.random import uniform
import numpy as np
from math import isnan
 
def hopkins(X):
    d = X.shape[1] # columns
    n = len(X) # rows
    m = int(0.1 * n) 
    nbrs = NearestNeighbors(n_neighbors=1).fit(X.values)
 
    rand_X = sample(range(0, n, 1), m)
 
    ujd = []
    wjd = []
    for j in range(0, m):
        u_dist, _ = nbrs.kneighbors(uniform(np.amin(X,axis=0),np.amax(X,axis=0),d).reshape(1, -1), 2, return_distance=True)
        ujd.append(u_dist[0][1])
        w_dist, _ = nbrs.kneighbors(X.iloc[rand_X[j]].values.reshape(1, -1), 2, return_distance=True)
        wjd.append(w_dist[0][1])
 
    H = sum(ujd) / (sum(ujd) + sum(wjd))
    if isnan(H):
        print(ujd, wjd)
        H = 0
 
    return H

In [None]:
hopkins(df_capped)

In [None]:
hopkins(df_scaled)

### 5.3 Finding the Optimal Number of Clusters

Methods to define the number of clusters
* Visual methods - Elbow criterion
* Mathematical methods - Silhouette coefficient
* Experimentation and interpretation

**5.3.1 Elbow criterion method - Visual method**
* Plot the number of clusters against within-cluster sum-of-squared-errors (SSE)
* Sum of squared distances from every data point to their cluster center
* Identify an "elbow" in the plot (Elbow - a point representing an "optimal" number of clusters)

In [None]:
# Elbow-curve/SSD
ssd = []
range_n_clusters = [2, 3, 4, 5, 6, 7, 8]
for num_clusters in range_n_clusters:
    kmeans = KMeans(n_clusters=num_clusters, max_iter=50)
    kmeans.fit(df_scaled)
    ssd.append(kmeans.inertia_)
    
# plot the SSDs for each n_clusters starting from 2 clusters
xi = list(range(len(range_n_clusters)))
plt.plot(xi, ssd) 
plt.xlabel('Number of clusters')
plt.ylabel('Sum of squared distances') 
plt.xticks(xi, range_n_clusters)
plt.title('Elbow-curve Analysis')
plt.show()

**5.3.2 Silhouette coefficient - Statistical method**

$$\text{silhouette score}=\frac{p-q}{max(p,q)}$$

$p$ is the mean distance to the points in the nearest cluster that the data point is not a part of
$q$ is the mean intra-cluster distance to all the points in its own cluster.

* The value of the silhouette score range lies between -1 to 1. 
* A score closer to 1 indicates that the data point is very similar to other data points in the cluster, 
* A score closer to -1 indicates that the data point is not similar to the data points in its cluster.

In [None]:
# silhouette analysis
range_n_clusters = [2, 3, 4, 5, 6, 7, 8]
s_score = []

for num_clusters in range_n_clusters:
    # intialise kmeans
    kmeans = KMeans(n_clusters=num_clusters, max_iter=50)
    kmeans.fit(df_scaled)
    cluster_labels = kmeans.labels_
    
    # silhouette score
    silhouette_avg = silhouette_score(df_scaled, cluster_labels)
    s_score.append(silhouette_avg)
    
    print("For n_clusters={0}, the silhouette score is {1}".format(num_clusters, silhouette_avg)) 
    
# plot the Silhouette score for each n_clusters starting from 2 clusters
xi = list(range(len(range_n_clusters)))
plt.plot(xi, s_score) 
plt.xlabel('Number of clusters')
plt.ylabel('Silhoutte scores') 
plt.xticks(xi, range_n_clusters)
plt.title('Silhoutte Score Analysis')
plt.show()

**5.3.3 Experimental approach - Analyze segments**
* Build clustering at and around elbow solution (k=3 or k=4 in our case)
* Analyze their properties - Build models with multiple k's and analyse the results 
* Compare against each other and choose one which makes most business sense

**5.3.4 Observations**
* We can see that the Silhouette score is high for cluster numbers 3 and 4 after which the score is reduced. 
* From Analysis perspective also, to provide a good recommendation dividing the countries into 3 or 4 clusters. 
* Lesser than 3 clusters will not help us make good recommendation as many countries will be lumped together
* Having More than 4 clusters will be difficult to do the profiling based on the features 

## <span style="color:navy"> 6. K-Means Clustering 
    
K-Means clustering is done for 3 clusters and 4 clusters. 

### 6.1 Create functions to plot the cluster visualizations
*  Once we define the functions for plotting, we can use the same function to plot for multiple datasets based on the k.

In [None]:
# This function is to plot the scatter plot of countries based on child_mort, income and gdpp. 
# The child_mort is along the x-axis, income along the y-axis and the size of the points relational to the gdpp

def plotdata_cluster(data, title):
    area = 5e-2 * data.gdpp
    colors = data.cluster_id.map({0: 'skyblue', 1: 'gold', 2: 'coral', 3: 'palegreen'})
    
    data.plot.scatter('child_mort','income',
                      s=area, c=colors,
                      linewidths=1,edgecolors='k',
                      figsize=(20,15))
        
    # labeling different cluster points with country names 
    for i, txt in enumerate(data.index):
        if txt == 'India':
            plt.annotate(txt, (data.child_mort[i],data.income[i]), fontsize=25, ha='left', rotation=25)
        plt.annotate(txt, (data.child_mort[i],data.income[i]), fontsize=12, ha='left', rotation=25)
    
    #plt.title('Countries clusters based on child_mort, income and gdpp after Outlier removal',fontsize=20)
    plt.title(plot_title,fontsize=20)
    plt.xlabel('child_mort',  fontsize=20)
    plt.ylabel('income', fontsize=20)
    plt.tight_layout()

In [None]:
# This function is to plot the scatter plot of countries based on child_mort, life_expec and health (expenditure). 
# The child_mort is along the x-axis, life_expec along the y-axis and the size of the points relational to the health.

def plotdata_health(data, plot_title):
    area = 50e-1 * data.health
    colors = data.cluster_id.map({0: 'skyblue', 1: 'gold', 2: 'coral', 3: 'palegreen'})
    
    data.plot.scatter('child_mort','life_expec',
                      s=area, c=colors,
                      linewidths=1,edgecolors='k',
                      figsize=(20,15))
        
    # labeling different cluster points with country names 
    for i, txt in enumerate(data.index):
        if txt == 'India':
            plt.annotate(txt, (data.child_mort[i],data.life_expec[i]), fontsize=25, ha='left', rotation=25)
        plt.annotate(txt, (data.child_mort[i],data.life_expec[i]), fontsize=12, ha='left', rotation=25)
    
    #plt.title('Countries clusters based on child_mort, life_expec and health expenditure',fontsize=20)
    plt.title(plot_title,fontsize=20)
    plt.xlabel('child_mort',  fontsize=20)
    plt.ylabel('life_expec', fontsize=20)
    plt.tight_layout()

### 6.2 K-means with 3 clusters

In [None]:
# K-Means clustering with 3 clusters
# Set the random_state to a fixed value so that the labels do not change for each iteration
kmeans_3=KMeans(n_clusters=3, max_iter=100, random_state=50)    # k=3 and iteration=100
kmeans_3.fit(df_scaled)                                         # fitting the dataset

In [None]:
kmeans_3.labels_

In [None]:
# Appending the cluster labels to the original capped dataset 
df_cap_kmeans3 = df_capped.copy()
df_cap_kmeans3['cluster_id']=kmeans_3.labels_
#df_cap_kmeans3.head()

In [None]:
# Plot the scatter plot of countries for Kmeans clustering with 3 clusters
# child_mort along the x axis, income along the y-axis and size of the scatter point relative to the gdpp 
plot_title = 'Countries based on child_mort, income and gdpp'
plotdata_cluster(df_cap_kmeans3, plot_title)

#### 6.2.1 Observations for Kmeans clustering - 3 Clusters
* The Rich Countries have high income, low child_mort and high gdpp (based on the size of the scatter points)
* The cluster that is of interest to us for the Foundation is the cluster 2 in the bottom right side of the curve - with very low income, very high child_mort and very low gdpp
* Child Mortality rate seem to be an important factor seeing from Equatorial Guinea which is placed in Group 2 even though the income and gdpp seem to be relatively high
* Can you plot India in the curve? - The gdpp is very less indicating the economic disparity between the rich and poor in the country 

### 6.3 K-means with 4 clusters

In [None]:
# Final model: K Means clustering with k=4
kmeans_4=KMeans(n_clusters=4, max_iter=100, random_state=50)    # k=4 and iteration=500
kmeans_4.fit(df_scaled) # fitting the dataset

In [None]:
# Cluster labels for Kmeans with k=4
kmeans_4.labels_

In [None]:
# Appending the cluster labels to the original capped dataset  
df_cap_kmeans4 = df_capped.copy()
df_cap_kmeans4['cluster_id']=kmeans_4.labels_

In [None]:
# Plot the scatter plot of countries for Kmeans clustering with 3 clusters
# child_mort along the x axis, income along the y-axis and size of the scatter point relative to the gdpp 
plot_title = 'Countries based on child_mort, income and gdpp - 4 Clusters'
plotdata_cluster(df_cap_kmeans4, plot_title)

#### 6.3.1 Observations for Kmeans clustering - 4 Clusters

* It is interesting to note from the scatter plot that although countries in cluster 0 (Developing countries) have similar gdpp to countries in cluster 2 (poor countries), the child_mort rate is very much less. 

In [None]:
# This function is to plot the scatter plot of countries based on child_mort, life_expec and health (expenditure). 
# The child_mort is along the x-axis, life_expec along the y-axis and the size of the points relational to the health.

plot_title = 'Countries based on child_mort, life_expec and health expenditure - 4 Clusters'
plotdata_health(df_cap_kmeans4, plot_title)

#### 6.3.2 Observations for Kmeans clustering - 4 Clusters - Health Plot
* The Rich Countries have high life_expec, low child_mort and high health expenditure (based on the size of the scatter points)
* The cluster that is of interest to us for the Foundation is the cluster 2 in the centre and bottom right side of the curve - with very low life_expec, very high child_mort and very low health expenditure.
* Haiti is an Outlier Country with a child_mort greater than 200 and a very low life expectancy as well. 
* Lesotho is next to Haiti in terms of worst life expectany.
* Again can you plot India in the curve? - The amount spent on Health per capita is still very less. 

## <span style="color:navy"> 7. Segment Profiling and analysis
    
We can use the following methods to compare the segments and do profiling
* Snake Plots
* Relative importance heat map

### 7.1 Snake plots to understand and compare segments
* Snake plot is an important Market research technique to compare different segments
* It provides a Visual representation of each segment's attributes
* The data should first be normalizes (center & scale)
* Plot each cluster's average normalized values of each attribute

In [None]:
# Define a function to create the snake plots for segment analysis
def snake_plot(data, plot_title):
    sns.set_style('whitegrid')
    df_melt = pd.melt(data.reset_index(), id_vars = ['country','cluster_id'], 
                      var_name = 'Feature', value_name = 'Value')

    plt.figure(figsize = (12,6))
    sns.lineplot(x = 'Feature', y = 'Value', data = df_melt,
                 hue = 'cluster_id', palette = "muted")
    
    plt.title(plot_title,fontsize=20)
    plt.xlabel('Features',  fontsize=20)
    plt.ylabel('Scaled Values', fontsize=20)
    plt.tight_layout()    
    
    # To print the number of Countries falling under each clusters
    print (title)
    print (df_sp.cluster_id.value_counts())

In [None]:
# Create Dataset for Snake plots - Scaled dataset should be used for the snake plot
df_sp = df_scaled.copy()

In [None]:
# Use the labels from the KMeans clustering with 3 clusters for snake plot
df_sp['cluster_id'] = kmeans_3.labels_ 

# snake plot for 3 clusters
title = 'Segment Analysis - KMeans with 3 clusters'
snake_plot(df_sp, title)

In [None]:
# Use the labels from the KMeans clustering with 4 clusters for snake plot
df_sp['cluster_id'] = kmeans_4.labels_ 

# snake plot for 3 clusters
plot_title = 'Segment Analysis - KMeans with 4 clusters'
snake_plot(df_sp, plot_title)

#### 7.1.1 Observations from Snake plot - 4 clusters 

The observations from the snake plot are very clear - 

* Developed Countries have high exports, imports, health, gdpp, income and life expectancy
  This includes the rich and very rich countries (clusters 1 and 3)
* Developing Countries have low exports, imports, health, gdpp and income. But what is different from the poor countries    is that the Developing countries have lower child mortality rate, and total fertility and higher life expectancy compared to the poor countries
* Under Developed or poor countries have very low exports, imports, health, gdpp, income and life_expectancy.
* For the under developed countries, the child_mort is proportionalto the total_fer
    This clearly indicates that the survival rate of the children under 5 years of age is greatly affected by the total children birthed by the mother. Providing affordable contraceptives could help in tackling this issue. 

### 7.2 Relative importance of segment attributes
* Useful technique to identify relative importance of each segment's attribute
* Calculate mean values of each cluster
* Calculate mean values of population
* Calculate importance score by dividing them and subtracting 1 (ensures 0 is returned when cluster mean = population mean)
* Analyze and plot relative importance
* The further a ratio is from 0, the more important that attribute is for a segment relative to the total population.

In [None]:
# Relative Importance plot to get the relative importance of each feature for the clusters 
def relative_imp_plot(data, plot_title):
    cluster_mean = data.groupby(['cluster_id']).mean()
    pop_mean = data.mean()
    relative_imp = cluster_mean / pop_mean - 1
    
    plt.figure(figsize=(10, 4))
    plt.title(plot_title, fontsize=20)
    sns.heatmap(data=relative_imp, annot=True, fmt='.2f', cmap='RdBu', vmin=-4, vmax=4)
    plt.show()

In [None]:
# Relative Importance plot to get the relative importance of each feature for the clusters - Kmeans 4 clusters
plot_title = 'Relative Importance plot - KMeans with 4 clusters'
relative_imp_plot(df_cap_kmeans4, plot_title)

In [None]:
# Relative Importance plot to get the relative importance of each feature for the clusters - Kmeans 3 clusters
plot_title = 'Relative Importance plot - KMeans with 3 clusters'
relative_imp_plot(df_cap_kmeans3, plot_title)

### 7.3 Choosing the number of K based on experimentation: 

* We have built the model for clusters = 3 and clusters = 4. 
* From the above snake plots, we can see that the majority of the countries fall under Group 0 - the developing countries
* The group of interest to us is the Group 2 - which has the poorest countries in terms of both economy and health
* There are no major differences between clustering with k = 3 or k = 4 in terms of the developing and poor countries. 
* **We will use k = 4 for our final model** as it helps to understand the effect of outliers better
* Cluster size of 4 also will help understand the differences between rich and very rich countries 

## <span style="color:navy"> 8. Further analysis with Final K-Means Model - 4 Clusters

We can further our analysis with number of clusters as 4. 
Our next step is to analyse the individual features for the 4 clusters to do the profiling in addition to section 7.

### 8.1 Analysis of means for all features

In [None]:
cluster_analysis = df_cap_kmeans4.groupby('cluster_id').agg({'child_mort': 'mean', 'exports': 'mean', 
                             'health': 'mean','imports': 'mean', 'income': 'mean', 'inflation': 'mean', 
                             'life_expec': 'mean', 'total_fer': 'mean', 'gdpp': ['mean', 'count']}).round(0)
cluster_analysis

In [None]:
# Analyse the following features for the summary and boxplot analysis
summary_cols = ['gdpp', 'income', 'life_expec',
                'child_mort', 'total_fer','inflation' ]

In [None]:
# Summary Analysis for mean of the features for each of the clusters
sns.set(style='whitegrid')
plt.figure(figsize=(20, 10))
for i, x_var in enumerate(summary_cols):
    plt.subplot(2,3,i+1)
    sns.barplot(x = cluster_analysis.reset_index().cluster_id, 
                y = cluster_analysis[x_var]['mean'])
    plt.ylabel(x_var, fontsize=15)
    plt.xlabel('cluster_id', fontsize=15)
plt.tight_layout()    
plt.show()    

#### 8.1.1 Observations from the Summary Analysis (Mean)
We can make the follwing observations from the Mean summary analysis above:
* For the cluster of interest - Cluster 2 (Poor Countries), we can see that 
    * gdpp, income and life_expec are very less
    * child_mort, total_infer and inflation are very high  

### 8.2 Box plot of features for Profiling columns - 'gdpp', 'income' and 'child_mort'

In [None]:
profiling_cols = ['gdpp', 'income', 'child_mort']

sns.set(style='white')
# plt.title('Cluster Profiling based on gdpp, income and child_mort')
plt.figure(figsize=(12,6))

for i, x_var in enumerate(profiling_cols):
    plt.subplot(1,3,i+1)
    sns.boxplot(x = df_cap_kmeans4.cluster_id, 
                y = df_cap_kmeans4[x_var])
    plt.ylabel(x_var, fontsize=20)
    plt.xlabel('cluster_id', fontsize=20)
    plt.tight_layout()    
plt.show()    

### 8.3 Profiling based on columns - 'gdpp', 'income' and 'child_mort'

Based on the three features, we can arrive at the below profiling for the countries
1. **<span style="color:blue">Cluster 0 - Developing Countries** - Have low gdpp, low income and low child_mort.
2. **<span style="color:blue">Cluster 1 - Developed Countries** - Have high gdpp, high income and low child_mort.
3. **<span style="color:blue">Cluster 2 - Under-Developed or Poor Countries** - Have very low gdpp, very low income and very high child_mort.
4. **<span style="color:blue">Cluster 3 - Very Rich Countries** - Have very high gdpp, very high income and very low child_mort. 

## <span style="color:navy"> 9. Recommended list of Countries in need of aid - KMeans Clustering
    
From the analysis in section 7 and 8, we can conclude that the **Countries which are in dire need of financial aid are
from cluster 2**  
    
**For the final list of recommended countries, we will get the data based on original dataset and not the scaled dataset**

### 9.1 List of countries from Cluster 2

The Countries are sorted in the following order - 
* 'child_mort' - Descending
* 'life_expec' - Ascending
* 'gdpp' - Ascending
* 'income' - Ascending

### 9.1 Scatter plot of countries from Clusters 2 and 0

Get the countries from Cluster 2 (Poor Countries) and clusters 0 (Developing Countries) for comparision

In [None]:
# This function is to plot the scatter plot of countries based on child_mort, income and gdpp. 
# The child_mort is along the x-axis, income along the y-axis and the size of the points proportionalto the gdpp
def plotdata_aid(data, plot_title):
    area = 50e-2 * data.gdpp
    colors = data.cluster_id.map({0: 'skyblue', 1: 'gold', 2: 'coral', 3: 'palegreen'})
    
    data.plot.scatter('child_mort','income',
                      s=area,c=colors,
                      linewidths=1,edgecolors='k',
                      figsize=(20,15))
        
    # labeling different cluster points with country names 
    for i, txt in enumerate(data.index):
        if txt == 'India':
            plt.annotate(txt, (data.child_mort[i],data.income[i]), fontsize=20, ha='left', rotation=25)
        plt.annotate(txt, (data.child_mort[i],data.income[i]), fontsize=12, ha='left', rotation=25)
    
    plt.title(plot_title,fontsize=20)
    plt.xlabel('child_mort',fontsize=20)
    plt.ylabel('income',fontsize=20)

In [None]:
# Plot the scatter plot for the Countries in need of financial aid 
# The child_mort is along the x-axis, income along the y-axis and the size of the points relational to the gdpp

plot_title = 'Socio-economically poor and developing Countries - KMeans Clusters 2 and 0'
plotdata_aid(df_cap_kmeans4[(df_cap_kmeans4.cluster_id == 2) | (df_cap_kmeans4.cluster_id == 0)], plot_title)

### 9.2 Final List of Recommended Countries: K Means Clustering

**The final list of 10 Countries recommended for financial aid by KMeans clustering are:**

**<span style="color:blue">'Haiti', 'Sierra Leone', 'Chad', 'Central African Republic', 'Mali',
       'Nigeria', 'Congo, Dem. Rep.', 'Lesotho', 'Burundi', 'Liberia'**
       
To get the final list of Countries to provide financial aid, we can rank the Countries based on the important features 

we can ignore some of the features 
* health, exports and imports as they are a ratio of gdpp
* total_fer can be ignored as we can see that this affects the child_mort and we have included child_mort

The ranks for the following features for each Country can be obtained and the Countries in need can be ordered according to these ranks.  
* **'child_mort' - Top 5 Countries with highest child mortality rate**
* **'life_expec' - Top 3 Countries with lowest life expectancy**
* **'gdpp'       - Top 3 countries with lowest gdpp**
* **'income'     - Top 3 countries with lowest income**
* **'inflation'  - Nigeria has abnormally high inflation rate.** 
A very high inflation rate can be detrimental to the Development of a Country. It also means financial struggle for the poorest people in the country. so, Nigeria can be included in the final list to cater to the needs of the poorest people in the country because of the very high inflation.                  

In [None]:
# Get the countries that are in need of aid
df_aid = df.copy()
df_aid['cluster_id'] = kmeans_4.labels_

In [None]:
# Get the countries from Cluster 2 (Poor Countries)
df_aid = df_aid.query('cluster_id == 2').sort_values(['child_mort', 'gdpp', 'income', 'life_expec' ], 
                                                   ascending = (False, True, True, True))
df_aid.head()

In [None]:
# Get the rank for individal features for each Country
df_aid['cmrank'] = df_aid.child_mort.rank(method='dense',ascending = False).astype(int)
df_aid['lerank'] = df_aid.life_expec.rank(method='dense',ascending = True).astype(int)
df_aid['gdrank'] = df_aid.gdpp.rank(method='dense',ascending = True).astype(int)
df_aid['inrank'] = df_aid.income.rank(method='dense',ascending = True).astype(int)
df_aid['ifrank'] = df_aid.inflation.rank(method='dense',ascending = False).astype(int)

In [None]:
df_aid.head(5)

In [None]:
# * **'child_mort' - Top 5 Countries with highest child mortality rate**
# * **'life_expec' - Top 3 Countries with lowest life expectancy**
# * **'gdpp'       - Top 3 countries with lowest gdpp**
# * **'income'     - Top 3 countries with lowest income**
# * **'inflation'  - Nigeria has abnormally high inflation rate.** 

df_final = df_aid[(df_aid.cmrank <=5) | (df_aid.lerank <=3) | 
                  (df_aid.gdrank <=3) | (df_aid.inrank <=3) | (df_aid.ifrank <=1)]
df_final

In [None]:
df_final.index

## <span style="color:navy"> 10. Hierarchical Clustering
    
We will follow similar process for Hierarchical clustering as we did for KMeans clustering

Following steps will be followed: 
* Plot Hierarchical Clustering Dendrogram - Single linkage
* Plot Hierarchical Clustering Dendrogram - Complete linkage
* Build model with n_clusters = 3 (Use clusters derived from Complete linkage)
* Build model with n_clusters = 4 (Use clusters derived from Complete linkage)
* Scatter plot visualization for n_clusters = 3 and n_clusters = 4 
* Snanke Plots for Segmenting analysis for n_clusters = 3 and n_clusters = 4
* Final Recommendations for Countries in need of aid

### 10.1  Define common functions for Plotting Hierarchical Clustering Dendrogram

In [None]:
# Plot the Hierarchical Clustering Dendrogram for Single linkage
def plot_dendrogram_single(data):
    plt.figure(figsize=(15,8))             # Setting the size of the figure
    sns.set_style('white')                  # Setting style

    # setting the labels on axes and title
    plt.title('Hierarchical Clustering Dendrogram - Single linkage',fontsize=20)
    plt.xlabel('Country',fontsize=20)
    plt.ylabel('Values',fontsize=20)

    mergings_s = linkage(data, method = "single", metric='euclidean') # Use the df_scaled dataset
    dendrogram(mergings_s, labels=data.index, leaf_rotation=90, leaf_font_size=6)
    plt.show()
    return mergings_s

In [None]:
# Plot the Hierarchical Clustering Dendrogram for Complete linkage
def plot_dendrogram_complete(data):
    plt.figure(figsize=(15,8))             # Setting the size of the figure
    sns.set_style('white')                  # Setting style

    # setting the labels on axes and title
    plt.title('Hierarchical Clustering Dendrogram - Complete linkage',fontsize=20)
    plt.xlabel('Country',fontsize=20)
    plt.ylabel('Values',fontsize=20)

    mergings_c = linkage(data, method = "complete", metric='euclidean')
    dendrogram(mergings_c, labels=data.index, leaf_rotation=90, leaf_font_size=6)
    plt.show()
    return mergings_c

### 10.2  Plot Hierarchical Clustering Dendrogram - Single linkage

In [None]:
# Plot the dendrogram for the normalized dataset with single linkage
mergings_s = plot_dendrogram_single(df_scaled)

### 10.3  Plot Hierarchical Clustering Dendrogram - Complete linkage

In [None]:
# Plot the dendrogram for the normalized dataset with single linkage
mergings_c = plot_dendrogram_complete(df_scaled)

**Observations from the Dendrograms:**
* As we can see from the above dendrograms, the outliers seem to have affected the output for the clustering to a large extent. 
* We have capped the outliers for the upper end for the features - cap_outliers = ['exports', 'health', 'imports', 'income', 'gdpp', 'inflation']
* We can now **cap the outliers for both upper and lower ends** and then do the hierarchical clustering. 

### 10.4  Plot Hierarchical Clustering Dendrogram - Complete linkage after capping lower ranges

Let us now plot the Hierarchical Clustering Dendrogram for Complete linkage after capping outliers at both lower and upper ends. 

As we can see from the dendrogram, the clustering is now similar to the results we got with KMeans clustering and gives a clear picture. 

In [None]:
cap_outliers

In [None]:
# Let us cap the outliers for both the upper and lower ends and then do the Hierarchical clusering
df_capped1 = df.copy()

for i, var in enumerate(cap_outliers):
    q1 = df[var].quantile(0.05)
    q4 = df[var].quantile(0.95)
    df_capped1[var][df_capped1[var]<=q1] = q1
    df_capped1[var][df_capped1[var]>=q4] = q4

df_scaled1=pd.DataFrame(scaler.fit_transform(df_capped1),columns=df_capped1.columns, index=df_capped1.index)
df_scaled1.head()

In [None]:
mergings_cc = plot_dendrogram_complete(df_scaled1)

### 10.5  Build model with n_clusters as 3 - Complete linkage

In [None]:
# 3 clusters - use the mergings from Complete linkage
cluster_labels_3 = cut_tree(mergings_cc, n_clusters=3).reshape(-1, )
cluster_labels_3

In [None]:
# To visualise let us use the df_capped1 dataset with the Outliers capped for upper outliers and lower outliers 
df_cap_hier3 = df_capped1.copy()

# assign cluster labels
df_cap_hier3['cluster_id'] = cluster_labels_3
df_cap_hier3.head()

In [None]:
# Cluster plot for Hierarchical cluster - 3 Clusters
plot_title = 'Cluster plot for Hierarchical cluster - 3 Clusters'
plotdata_cluster(df_cap_hier3, plot_title)

### 10.4  Build model with n_clusters as 4 - Complete linkage

In [None]:
# 4 clusters - use the mergings from Complete linkage
cluster_labels_4 = cut_tree(mergings_cc, n_clusters=4).reshape(-1, )
cluster_labels_4

In [None]:
# To visualise let us use the df_capped1 dataset with the Outliers capped for upper outliers and lower outliers 
df_cap_hier4 = df_capped1.copy()

# assign cluster labels
df_cap_hier4['cluster_id'] = cluster_labels_4
df_cap_hier4.head()

In [None]:
# Cluster plot for Hierarchical cluster - 3 Clusters
plot_title = 'Cluster plot for Hierarchical cluster - 4 Clusters'
plotdata_cluster(df_cap_hier4, plot_title)

### 10.5 Snake plots to analyze segments

In [None]:
# Snake plot for Hierarchical clustering with 3 clusters 
# Use the labels from the Hierarchical clustering with 3 clusters for snake plot
df_sp['cluster_id'] = cluster_labels_3

# snake plot for 3 clusters
title = 'Segment Analysis - Hierarchical with 3 clusters'
snake_plot(df_sp, title)

In [None]:
# Snake plot for Hierarchical clustering with 4 clusters 
# Use the labels from the Hierarchical clustering with 4 clusters for snake plot
df_sp['cluster_id'] = cluster_labels_4

# snake plot for 3 clusters
title = 'Segment Analysis - Hierarchical with 4 clusters'
snake_plot(df_sp, title)

### 10.6 Relative Importance plots to analyse segments

In [None]:
# Relative importance plot for the Hierarchical clustering with 4 clusters
plot_title = 'Relative importance plot for the Hierarchical clustering with 4 clusters'
relative_imp_plot(df_cap_hier4, plot_title)

In [None]:
# Relative importance plot for the Hierarchical clustering with 4 clusters
plot_title = 'Relative importance plot for the Hierarchical clustering with 3 clusters'
relative_imp_plot(df_cap_hier3, plot_title)

### 10.7 Final list of Recommended Countries - Hierarchical clustering

1. Copy the original dataset without outlier capping
2. Assign the labels from hierarchical clustering with n_clusters=4
3. Use the Cluster with under developed countries (cluster 0)
4. Rank the countries based on the socio-economic factors and sort to get the final list of poorest countries

In [None]:
# The Countries in need of aid (Poor Countries) are in Cluster 0 
df_aid_hier4 = df.copy()
df_aid_hier4['cluster_id'] = cluster_labels_4
df_aid_hier4 = df_aid_hier4.query('cluster_id == 0').sort_values(['child_mort', 'life_expec', 'income', 'gdpp' ], 
                                                   ascending = (False, True, True, True))
df_aid_hier4.head(20)

In [None]:
# Get the rank for individal features for each Country
df_aid_hier4['cmrank'] = df_aid_hier4.child_mort.rank(method='dense',ascending = False).astype(int)
df_aid_hier4['lerank'] = df_aid_hier4.life_expec.rank(method='dense',ascending = True).astype(int)
df_aid_hier4['gdrank'] = df_aid_hier4.gdpp.rank(method='dense',ascending = True).astype(int)
df_aid_hier4['inrank'] = df_aid_hier4.income.rank(method='dense',ascending = True).astype(int)
df_aid_hier4['ifrank'] = df_aid_hier4.inflation.rank(method='dense',ascending = False).astype(int)

In [None]:
df_aid_hier4.head()

In [None]:
# * **'child_mort' - Top 5 Countries with highest child mortality rate**
# * **'life_expec' - Top 3 Countries with lowest life expectancy**
# * **'gdpp'       - Top 3 countries with lowest gdpp**
# * **'income'     - Top 3 countries with lowest income**
# * **'inflation'  - Nigeria has abnormally high inflation rate.** 

df_final_hier4 = df_aid_hier4[(df_aid_hier4.cmrank <=5) | (df_aid_hier4.lerank <=3) | 
                              (df_aid_hier4.gdrank <=3) | (df_aid_hier4.inrank <=3) | (df_aid_hier4.ifrank <=1)]
df_final_hier4

**Final Recommended Countries from KMeans Clustering**

In [None]:
print (f'The following {len(df_final.index)} countries are ranked after KMeans clustering')
df_final.index

**Final Recommended Countries from Hierarchical Clustering**

In [None]:
print (f'The following {len(df_final_hier4.index)} countries are ranked after Hierarchical clustering')
df_final_hier4.index

### 10.8 Final List of Recommended Countries: Hierarchical Clustering

**The list of 10 Countries recommended for financial id by Hierarchical clustering after ranking are:**

**<span style="color:blue">'Haiti', 'Sierra Leone', 'Chad', 'Central African Republic', 'Mali',
       'Nigeria', 'Congo, Dem. Rep.', 'Lesotho', 'Burundi', 'Liberia'**

**Let us now plot the final recommended countries in the graph for developing and under-developed countries**

In [None]:
# This function is to plot the scatter plot of countries based on child_mort, income and gdpp. 
# The child_mort is along the x-axis, income along the y-axis and the size of the points proportionalto the gdpp
def plotdata_aid_final(data, plot_title):
    area = 50e-2 * data.gdpp
    colors = data.cluster_id.map({0: 'skyblue', 1: 'gold', 2: 'coral', 3: 'palegreen'})
    
    aid_countries = ['Haiti', 'Sierra Leone', 'Chad', 'Central African Republic', 'Mali',
                     'Nigeria', 'Congo, Dem. Rep.', 'Lesotho', 'Burundi', 'Liberia']
    data.plot.scatter('child_mort','income',
                      s=area,c=colors,
                      linewidths=1,edgecolors='k',
                      figsize=(20,15))
        
    # labeling different cluster points with country names 
    for i, txt in enumerate(data.index):
        if txt in aid_countries:
            plt.annotate(txt, (data.child_mort[i],data.income[i]), fontsize=20, ha='left', rotation=25)
        #plt.annotate(txt, (data.child_mort[i],data.income[i]), fontsize=12, ha='left', rotation=25)
    
    plt.title(plot_title,fontsize=20)
    plt.xlabel('child_mort',fontsize=20)
    plt.ylabel('income',fontsize=20)

In [None]:
# Plot the scatter plot for the Countries in need of financial aid 
# The child_mort is along the x-axis, income along the y-axis and the size of the points relational to the gdpp

plot_title = 'Final list of countries recommended for aid'
plotdata_aid_final(df_cap_kmeans4[(df_cap_kmeans4.cluster_id == 2) | (df_cap_kmeans4.cluster_id == 0)], plot_title)

## <span style="color:navy"> 11. Conclusion and Final Recommendations:

* We can see that after the ranking of Countries based on socio-economic factor, the countries recommended for Financial aid by the KMeans algorithm and Hierarchical clustering are the same.

### 11.1 Here are the final list of Countries recommended: 

**The final list of 10 Countries recommended for financial aid from KMeans clusters and Hierarchical clusters are:**

**<span style="color:blue">'Haiti', 'Sierra Leone', 'Chad', 'Central African Republic', 'Mali',
       'Nigeria', 'Congo, Dem. Rep.', 'Lesotho', 'Burundi', 'Liberia'**

### 11.2 Here are the steps we followed: 

1. Initial data analysis and Outlier handling
2. Determined the optimal number of clusters using visual method (elbow curve) and statistical method (Silhouette score)
3. Build KMeans clustering model with n_clusters = 3 and n_clusters = 4
4. Segment Analysis and Summary analysis for KMeans clustering model with n_clusters = 3 and n_clusters = 4
5. Build Hierarchical clustering model with n_clusters = 3 and n_clusters = 4
6. Segment Analysis and Summary analysis for Hierarchical clustering model with n_clusters = 3 and n_clusters = 4
7. Identified and Visualized the final cluster of Countries in need of Financial aid. 
8. From the cluster, arrived at the final list of Countries based on the below ranking. 

    * **'child_mort' - Top 5 Countries with highest child mortality rate**
    * **'life_expec' - Top 3 Countries with lowest life expectancy**
    * **'gdpp'       - Top 3 countries with lowest gdpp**
    * **'income'     - Top 3 countries with lowest income**
    * **'inflation'  - Nigeria has abnormally high inflation rate.**               

### 11.3 Important Observations: 

* Developed Countries have high exports, imports, health, gdpp, income and life expectancy.
  This includes the rich and very rich countries (clusters 1 and 3)
* Developing Countries have low exports, imports, health, gdpp and income. But what is different from the poor countries    is that the Developing countries have lower child mortality rate and total fertility and higher life expectancy compared to the poor countries
* Under Developed or poor countries have very low exports, imports, health, gdpp, income and life_expectancy.
* For the under developed countries, the child_mort is proportionalto the total_fer
    This clearly indicates that the survival rate of the children under 5 years of age is greatly affected by the total children birthed by the mother. Providing affordable contraceptives could help in tackling this issue.
* Haiti has very high child_mort as well as very low life expectancy
* Haiti and Lesotho have very very low life expectancy
* Nigeria has abnormally high inflation rate. A very high inflation rate can be detrimental to the Development of a Country. It also means financial struggle for the poorest people in the country. so, Nigeria can be included in the final list to cater to the needs of the poorest people in the country because of the very high inflation.   

Notes: The following dataframes were used for the analysis and visualizations:
* df - original dataframe
* df_capped - copy of df with the upper outliers capped
* df_scaled - copy of df_capped normalized using StandardScaler
* df_cap_kmeans3 - copy of df_capped with labels added from KMeans clusering with 3 clusters
* df_cap_kmeans4 - copy of df_capped with labels added from KMeans clusering with 4 clusters
* df_sp - copy of df_scaled with labels added for snake plots
* df_aid - origial dataframe with labels copied from k means with 4 clusters for plotting countries in need of aid
* df_capped1 - Original dataframe capped at both upper and lower ends for imports, exports, health, income, inflation and gdpp columns
* df_scaled1 - copy of the df_capped1 dataset normalized using StandardScaler
* df_cap_hier3 - copy of df_capped1 with labels added from Hierarchical clusering with 3 clusters
* df_cap_hier4 - copy of df_capped1 with labels added from Hierarchical clusering with 4 clusters
* df_final - Final dataframe with only the final list of countries recommended