# <font color = Blue> Clustering of countries (Humanitarian Aid to Underdeveloped Countries)

**Clustering based on socio-economic factors for financial aid by NGO using `K-means` and `Hierarchial` Clustering methods**



**Background:**

HELP International is an international humanitarian NGO that is committed to fighting poverty and providing the people of backward countries with basic amenities and relief during the time of disasters and natural calamities. It runs a lot of operational projects from time to time along with advocacy drives to raise awareness as well as for funding purposes.

After the recent funding programmes, they have been able to raise around $ 10 million. Now the CEO of the NGO needs to decide how to use this money strategically and effectively. The significant issues that come while making this decision are mostly related to choosing the countries that are in the direst need of aid. 

And this is where you come in as a data analyst. Your job is to categorise the countries using some socio-economic and health factors that determine the overall development of the country. Then you need to suggest the countries which the CEO needs to focus on the most.

**<font color= Brown> Problem Statement:**

Main task is to cluster the countries by the factors mentioned above and then present your solution and recommendations to the CEO using a PPT.  The following approach is suggested :
1. Start off with the necessary data inspection and EDA tasks suitable for this dataset - data cleaning, univariate analysis, bivariate analysis etc.
2. Outlier Analysis: You must perform the Outlier Analysis on the dataset. However, you do have the flexibility of not removing the outliers if it suits the business needs or a lot of countries are getting removed. Hence, all you need to do is find the outliers in the dataset, and then choose whether to keep them or remove them depending on the results you get.
3. Try both K-means and Hierarchical clustering(both single and complete linkage) on this dataset to create the clusters. [Note that both the methods may not produce identical results and you might have to choose one of them for the final list of countries.]
4. Analyse the clusters and identify the ones which are in dire need of aid. You can analyse the clusters by comparing how these three variables - [gdpp, child_mort and income] vary for each cluster of countries to recognise and differentiate the clusters of developed countries from the clusters of under-developed countries.
5. Also, you need to perform visualisations on the clusters that have been formed.  You can do this by choosing any two of the three variables mentioned above on the X-Y axes and plotting a scatter plot of all the countries and differentiating the clusters. Make sure you create visualisations for all the three pairs. You can also choose other types of plots like boxplots, etc. 
6. Both K-means and Hierarchical may give different results because of previous analysis (whether you chose to keep or remove the outliers, how many clusters you chose,  etc.) Hence, there might be some subjectivity in the final number of countries that you think should be reported back to the CEO since they depend upon the preceding analysis as well. Here, make sure that you report back at least 5 countries which are in direst need of aid from the analysis work that you perform.



**Technical Approach:**

1. Data inspection
2. EDA tasks suitable for this dataset
    - Data cleaning
    - Univariate analysis
    - Bivariate analysis 
3. Data preparation for clustering
    - Outlier Treatment
    - Feature Scaling
    - Hopkin Check
4. Perform Clustering
    - KMean Clustering
        - choose K using both Elbow and Silhouette score
        - Run K-Means with the chosen K
        - Visulise the clusters
        - Cluster profiling using “gdpp, child_mort and income”
     - Hierarchical Clustering
        - Use both Single and Complete linkage
        - Choose one method based on the results
        - Visualise the clusters
        - Clustering profiling using “gdpp, child_mort and income”
5. Final Model 
    - Final Model Selection and labelling
    - Select model based on cluster results
6. Conclusion
    - Top 5 countires selection for financial aid
    - Top 5 countires selection on some socio-economic and health factor

### 1. import required libraries

In [None]:
# Supress warnings
import warnings
warnings.filterwarnings ('ignore')

In [None]:
# Importing required packages
import numpy as np
import pandas as pd

import sklearn
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import NearestNeighbors
from kmodes.kmodes import KModes
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans

from scipy.cluster.hierarchy import linkage
from scipy.cluster.hierarchy import dendrogram
from scipy.cluster.hierarchy import cut_tree

# For Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.cm as cm
%matplotlib inline
from pylab import rcParams

### 2. Read and Understand the data set

The datasets containing those socio-economic factors and the corresponding data dictionary are provided

In [None]:
# reading datasets
df = pd.read_csv('../input/pca-kmeans-hierarchical-clustering/Country-data.csv')
df.head()

#### 2.1 Data Inspection

In [None]:
# Checking number of columns and rows of the dataframe using shape function
df.shape

In [None]:
# Checking basic information of the dataset
df.info()

In [None]:
# Checking the descriptive statistics
df.describe()

In [None]:
# Checking the columns names
df.columns

Insights:
- Data set is having 167 country data of 10 features of each.
- There is no missing malue or duplicate rows in the dataset
- Referring to data dictionary, `exports, health and imports` are given as %age of GDP per capita. So lets convert it to actual value

#### 2.2 Data Inspection

Checking following details in the session
1. The missing /null values
2. Duplicates and dropping the duplicates

**2.2.1 Missing/Null Values**

Checking the missing or null values in the columns

In [None]:
# Cheking the missing/values in the dataset
df.isnull().sum()

In [None]:
# Cheking the % of the missing/values in the NGO dataset
round(100*df.isnull().sum()/len(df.index),2)

Insight:
- There is `no missing` values observed in the dataset

**2.2.2 Duplicate Values Check**

Performing the duplicate function to find the duplicate values in the dataset

In [None]:
# Checking the duplicates
sum(df.duplicated(subset = 'country'))==0

In [None]:
# Checking the duplicates and dropping the entire duplicated rows if any
df.drop_duplicates(subset = None, inplace =True)
df.shape

Insight: 
- There is `no duplicates` values observed in the dataset

**2.3 Data Conversion/ Trasformation**

We will deduce the 3 variables namely `imports`, `exports` and `health` spending from percentage values to actual values of their GDP per capita since the percentage values don’t give a clear picture of that country. 

**For example**:- Austria and Belarus have almost the same exports % but their gdpp has a huge gap which doesn’t give an accurate idea of which country is more developed than the other.
Then we will remove the Country field and keep it as the row names in the final data frame and scale the remaining data.

In [None]:
#Checking the values before transformation
df.head()

In [None]:
# Converting imports, exports and health spending percentages to absolute values.

df['imports'] = df['imports'] * df['gdpp']/100
df['exports'] = df['exports'] * df['gdpp']/100
df['health'] = df['health'] * df['gdpp']/100

df.head()

In [None]:
# Checking the new shape of the dataframe
df.shape

In [None]:
# selectig numerical columns and droppig country
# list cols for upper caping
cols = ['exports', 'health', 'imports', 'total_fer','gdpp']
df[cols].describe(percentiles= [0.01,0.25,0.5,0.75,0.99])



### 3. Exploratory Data Analytics (EDA)

EDA is understanding the data sets by summarizing  their main characteristics often plotting them usally. Without building any model or making any predictions, lets first look at the data by itself.

In [None]:
# describe dataset and Checking outliers at 25%,50%,75%,90%,95% and 99%
df.describe(percentiles=[.25,.5,.75,.90,.95,.99])

#### 3.1. Univariate Analysis

We need to choose the countries that are in the direst need of aid. Hence, we need to identify those countries with using some socio-economic and health factors that determine the overall development of the country.

`child_mort`,`gdpp` and `income` are important variables to finalize the financial aid and those are highlited in `orange` color

In [None]:
# Cheking the outliers - how values in each columns are distrivuted using boxplot
fig, axs = plt.subplots(3,3, figsize = (15,12))

plt1 = sns.boxplot(df['child_mort'], ax = axs[0,0], color = 'orange')
plt2 = sns.boxplot(df['exports'], ax = axs[0,1])
plt3 = sns.boxplot(df['health'], ax = axs[0,2])
plt4 = sns.boxplot(df['imports'], ax = axs[1,0])
plt5 = sns.boxplot(df['income'], ax = axs[1,1], color = 'orange')
plt6 = sns.boxplot(df['inflation'], ax = axs[1,2])
plt7 = sns.boxplot(df['life_expec'], ax = axs[2,0])
plt8 = sns.boxplot(df['total_fer'], ax = axs[2,1])
plt9 = sns.boxplot(df['gdpp'], ax = axs[2,2], color = 'orange')

plt.show()

Insight: 

We observe the following about the outliers.
- Since we have limited number of countries(167), removing these outliers would shrink the shape of data and the under-developed countries which are in actual dire need may not contribute to the dataset.
- So we will cap them to upper and lower limit. But simply capping them to `Upper and Lower Hinge` of box plot may shift the `cluster centroid`.
- So considering all the sinario we would cap the extreme values in the outliers to `0.01 and 0.99` percentile. By doing this we will be avoiding the risk of cluster overlapping.
- There are some exclusions while doing outlier treatment.
    - `child_mort`, `inflation`: High Child mortality  and higher inflation are our matter of concern so we will not do uppercapping for these features.
    - `export`, `health`, `imports`, `total_fer` and `gdpp` : It has outlier at higher level. We will impute outlier to upper capping (0.99 percentile)
    - `life_expec` : It has outlers bellow the lower hinge, But again it is our matter of concern so we will not impute these values.
      
      
- There are very few oultiers (<= 5) for all variables except for income and gdpp.
- All variables have outliers on the upper side (higher values) except for life_expec which has outliers on the down side (less values) indicating that usually the life expectancy in most countries is above 50 except for 3 countries


**Top 10 poor countries** : Selected the top 10 poor countries to check, how the variables are affeacted  

In [None]:
df.columns

In [None]:

fig, axs = plt.subplots(3,3,figsize = (15,15))

# poor top 10 countries represented as `pt10`

# Child Mortality Rate 
pt10_child_mort = df[['country','child_mort']].sort_values('child_mort', ascending = False).head(10)
plt1 = sns.barplot(x='country', y='child_mort', data= pt10_child_mort, ax = axs[0,0])
plt1.set(xlabel = '', ylabel= 'Child Mortality Rate')

# Exports
pt10_exports = df[['country','exports']].sort_values('exports', ascending = True).head(10)
plt2 = sns.barplot(x='country', y='exports', data= pt10_exports, ax = axs[2,1])
plt2.set(xlabel = '', ylabel= 'Exports')

# Health 
pt10_health = df[['country','health']].sort_values('health', ascending = True).head(10)
plt3 = sns.barplot(x='country', y='health', data= pt10_health, ax = axs[1,0])
plt3.set(xlabel = '', ylabel= 'Health')

# Imports
pt10_imports = df[['country','imports']].sort_values('imports', ascending = True).head(10)
plt4 = sns.barplot(x='country', y='imports', data= pt10_imports, ax = axs[2,2])
plt4.set(xlabel = '', ylabel= 'Imports')

# Per capita Income 
pt10_income = df[['country','income']].sort_values('income', ascending = True).head(10)
plt5 = sns.barplot(x='country', y='income', data= pt10_income, ax = axs[1,2])
plt5.set(xlabel = '', ylabel= 'Per capita Income')

# Inflation
pt10_inflation = df[['country','inflation']].sort_values('inflation', ascending = False).head(10)
plt6 = sns.barplot(x='country', y='inflation', data= pt10_inflation, ax = axs[2,0])
plt6.set(xlabel = '', ylabel= 'Inflation')

# Fertility Rate
pt10_total_fer = df[['country','total_fer']].sort_values('total_fer', ascending = False).head(10)
plt7 = sns.barplot(x='country', y='total_fer', data= pt10_total_fer, ax = axs[0,1])
plt7.set(xlabel = '', ylabel= 'Fertility Rate')

# Life Expectancy
pt10_life_expec = df[['country','life_expec']].sort_values('life_expec', ascending = True).head(10)
plt8 = sns.barplot(x='country', y='life_expec', data= pt10_life_expec, ax = axs[0,2])
plt8.set(xlabel = '', ylabel= 'Life Expectancy')

# The GDP per capita 
pt10_gdpp = df[['country','gdpp']].sort_values('gdpp', ascending = True).head(10)
plt9 = sns.barplot(x='country', y='gdpp', data= pt10_gdpp, ax = axs[1,1])
plt9.set(xlabel = '', ylabel= 'GDP per capita')


for ax in fig.axes:
    plt.sca(ax)
    plt.xticks(rotation = 90)
    
plt.tight_layout()
plt.show()
    


Insight: 

- From above bar plot we could see the common contires in the profile of `gdpp`, `child_mort` and `income` are: `Congo, Dem. Rep.`, `Niger`, `Sierra Leone`,`Central African Republic`

In [None]:
# Distribution Plot
plt.figure(figsize=(15, 15))
features = ['child_mort', 'exports', 'health', 'imports', 'income','inflation', 'life_expec', 'total_fer', 'gdpp']
for i in enumerate(features):
    ax = plt.subplot(3, 3, i[0]+1)
    sns.distplot(df[i[1]])
    plt.xticks(rotation=20)

Insights:
- `life_expec` are right skewed and all other features are leftskewed

#### 3.2.  Bi-variate analysis

In [None]:
# pairplot for continuous data type
sns.pairplot(df.select_dtypes(['int64','float64']), diag_kind='kde', corner=True)
plt.show()

In [None]:
# Corrleation of the df dataset
plt.figure(figsize= (12,8))
sns.heatmap(df.corr(), annot = True, cmap = "YlGnBu")

Insights:

We observe the following correlations from the plot.

- Most of the data point are 'NOT Normally' distributed
- From paiplot and heatmat we could observe that there are features with high correations
    - gdpp and income are most highly correlated with correlation of 0.9
    - child_mortality and life_expentency are highly correlated with correlation of -0.89
    - child_mortality and total_fertility are highly correlated with correlation of 0.85
    - imports and exports are highly correlated with correlation of 0.74
    - life_expentency and total_fertility are highly correlated with correlation of -0.76

#### 3.3. Outlier Treatment

- `child_mort`: It has outliers beyond higher boundary but we will not do any imputation as we are concernened about the higher child_mort
- `export`, `helath`, `imports`, `total_fer` and `gdpp` : It has outlier at higher level. We will impute outlier to upper capping (0.99 percentile)
- `life_expec` : It has outlers bellow the lower boundary, But we are intersted on these values so we will not impute these values.

In [None]:
# Selecting the numerical columns and dropping the country
df1 = df.drop('country', axis =1)

In [None]:
# list cols for upper caping and get insigts of data
cols = ['exports', 'health', 'imports', 'total_fer','gdpp']
df1[cols].describe(percentiles= [0.01,0.25,0.5,0.75,0.99])

In [None]:
# upper caping
cap = 0.99
for col in cols:
    HL = round(df1[col].quantile(cap),2)
    df1[col] = df1[col].apply(lambda x: HL if x>HL else x)

In [None]:
# Descriptive statistics after capping
df1[cols].describe(percentiles= [0.01,0.25,0.5,0.75,0.99])

In [None]:
# check outliers after capping
df1[cols].plot.box(subplots = True, figsize = (18,6), fontsize = 12)
plt.tight_layout(pad=3)
plt.show()

Insights: 
- There are still some outliers exists but, we will choose to keep them and proceed for modeling.

### 4. Feature Scaling

Their range of all the features are different which indicates the need of standardising the data before we build the model. Since we need to compute the Euclidean distance between the data points, it is important to ensure that the attributes with a larger range of values do not out-weight the attributes with smaller range. Thus, scaling down of all attributes to the same normal scale is important here.

In [None]:
# dataset columns
df1.columns

In [None]:
# Create a scaling object
scaler = MinMaxScaler()

# fit_transform
df_scaled = scaler.fit_transform(df1)
df_scaled.shape

In [None]:
df_scaled

### 5. Hopkins Statistics

Before we apply any clustering algorithm to the given data, it's important to check whether the given data has some meaningful clusters or not? which in general means the given data is not random. The process to evaluate the data to check if the data is feasible for clustering or not is know as the clustering tendency. To check cluster tendency, we use Hopkins test. Hopkins test examines whether data points differ significantly from uniformly distributed data in the multidimensional space.
- If the value is between {0.01, ...,0.3}, the data is regularly spaced.
- If the value is around 0.5, it is random.
- If the value is between {0.7, ..., 0.99}, it has a high tendency to cluster.

In [None]:
# function hopkin statustics

from random import sample
from numpy.random import uniform
from math import isnan
 
def hopkins(X):
    d = X.shape[1]
    #d = len(vars) # columns
    n = len(X) # rows
    m = int(0.1 * n) 
    nbrs = NearestNeighbors(n_neighbors=1).fit(X.values)
 
    rand_X = sample(range(0, n, 1), m)
 
    ujd = []
    wjd = []
    for j in range(0, m):
        u_dist, _ = nbrs.kneighbors(uniform(np.amin(X,axis=0),np.amax(X,axis=0),d).reshape(1, -1), 2, return_distance=True)
        ujd.append(u_dist[0][1])
        w_dist, _ = nbrs.kneighbors(X.iloc[rand_X[j]].values.reshape(1, -1), 2, return_distance=True)
        wjd.append(w_dist[0][1])
 
    H = sum(ujd) / (sum(ujd) + sum(wjd))
    if isnan(H):
        print(ujd, wjd)
        H = 0
 
    return H

In [None]:
# Create dataframe of sclaled fetaures
df_scaled = pd.DataFrame(df_scaled, columns = df1.columns)

# Evaluate Hopkins Statistics
print('Hopkins statistics is: ', round(hopkins(df_scaled),2))

Insight: 
- Hopkins Statistic over .70 is a good score that indicated that the data is good for cluster analysis. 
- A 'Hopkins Statistic' value close to 1 tends to indicate the data is highly clustered, random data will tend to result in values around 0.5, and uniformly distributed data will tend to result in values close to 0.

### 6.  Clustering


#### 6.1 K-Means Clustering

K-means is a type of unsupervised learning which is considered as one of the most used algorithms due to its simplicity. To process the learning data, the K-means algorithm in data mining starts with a first group of randomly selected centroids, which are used as the beginning points for every cluster, and then performs iterative (repetitive) calculations to optimize the positions of the centroids. The ‘means’ in the K-means refers to averaging of the data; that is, finding the centroid.

**6.1.1 Run K-Means and choose K using both Elbow and Silhouette score**

**A. Sum of the Sqaured Distance Matrix (SSD)or (Elbow curve)**: In Elbow Curve method, the total within-cluster sum of square measures the compactness of the clustering and it should be as small as possible for better clustering results.

In [None]:
# CHeking the dataframe
df_scaled.head()

In [None]:
# Check shape
df_scaled.shape

In [None]:
# Creating list of clusters for no of cluster
num_clusers = list(range(1,11))
ssd = []
for clustuer in num_clusers:
    kmeans = KMeans(n_clusters=clustuer, max_iter= 50)
    kmeans.fit(df_scaled)
    ssd.append(kmeans.inertia_)

In [None]:
# pltng elbow method plot
plt.figure(figsize=(10,6))
plt.plot(num_clusers,ssd, marker = 'o')
plt.title('Elbow Method', fontsize = 16)
plt.xlabel('Number of clusters',fontsize=12)
plt.ylabel('Sum of Squared distance',fontsize=12)
plt.vlines(x=3, ymax=ssd[-1], ymin=ssd[0], colors="r", linestyles="-")
plt.hlines(y=ssd[2], xmax=9, xmin=1, colors="r", linestyles="--")

plt.show()

**B. Silhouette Analysis** : The average silhouette approach measures the quality of a clustering. That is, it determines how well each object lies within its cluster. A high average silhouette width indicates a good clustering.

$$\text{silhouette score}=\frac{p-q}{max(p,q)}$$

$p$ is the mean distance to the points in the nearest cluster that the data point is not a part of it (inter-cluster)

$q$ is the mean intra-cluster distance to all the points in its own cluster.

* The value of the silhouette score range lies between -1 to 1. 

* A score closer to 1 indicates that the data point is very similar to other data points in the cluster, 

* A score closer to 0 indicates that the data point is not in the cluster, 

* A score closer to -1 indicates that the data point is not similar to the data points in its cluster.

In [None]:
# silhouette analysis
num_clusters = list(range(2,11))
ss = []
for cluster in num_clusters:
    
    # intialise kmeans
    kmeans = KMeans(n_clusters= cluster, max_iter=50)
    kmeans.fit(df_scaled)
    
    cluster_labels = kmeans.labels_
    
    # silhouette score
    silhouette_avg = round(silhouette_score(df_scaled, cluster_labels),4)
    ss.append(silhouette_avg)
    print("For n_clusters={0}, the silhouette score is {1}".format(cluster, silhouette_avg))


plt.plot(num_clusters,pd.DataFrame(ss)[0])
plt.title('Silhouette Score', fontsize = 16)
plt.show()

Insight: 
- The maximum silhouette score is at cluster no 3
- There is less reduction in sum of squared distance after cluster 3 in elbow method.

With above two methods to select optimum cluster we would choose `3 clusters` as optimal no.


**6.1.2 Run K-Means with the chosen K**

Choosing the model from the above results, we could see that using 3 Clusters provided a better output in terms of a balanced cluster size. So we will consider the 'K-Means with 3 Clusters' as our FINAL MODEL

In [None]:
# K-Mean with k =3
kmeans = KMeans(n_clusters = 3, max_iter = 50, random_state= 50)
kmeans.fit(df_scaled)

In [None]:
kmeans.labels_

First we need to assign the Cluster IDs that we generated to each of the datapoints that we have with us

In [None]:
#adding produced labels dataframe
df_country = df.copy()
df_country['KMean_clusterid']= pd.Series(kmeans.labels_)
df_country.head()

In [None]:
# Checking the no of countries in each cluster
df_country.KMean_clusterid.value_counts()

**6.1.3 Visualisation and Cluster Analysis (K Mean Cluster)**

In [None]:
# Scatter plot on various variables to visualize the clusters based on them

plt.figure(figsize=(18, 5))
plt.subplot(1, 3, 1)
sns.scatterplot(x='gdpp', y='child_mort', hue='KMean_clusterid', data=df_country, palette="bright", alpha=.4)

plt.subplot(1, 3, 2)
sns.scatterplot(x='income', y='child_mort', hue='KMean_clusterid',data=df_country, palette="bright", alpha=.4)

plt.subplot(1, 3, 3)
sns.scatterplot(x='gdpp', y='income', hue='KMean_clusterid', data=df_country, palette="bright", alpha=.4)

plt.show()

Note: Clusters are clearly visable

In [None]:
# visualising clusters
plt.figure(figsize=(18,5))
plt.subplot(1,3,1)
sns.barplot(x = 'KMean_clusterid', y = 'gdpp', data=df_country, palette="bright")
plt.title('GDP Percapita')

plt.subplot(1,3,2)
sns.barplot(x = 'KMean_clusterid', y = 'child_mort', data=df_country, palette="bright")
plt.title('Child Mortality Rate')

plt.subplot(1,3,3)
sns.barplot(x = 'KMean_clusterid', y = 'income', data=df_country, palette="bright")
plt.title('Income Per Person')

plt.tight_layout()

plt.show()

INSIGHT: 
- Its clearly showing that the cluster 2 having highest Child Mortality and lowest Income & GDPP and comes under undeveloped countries

**6.1.4 Clustering profiling using “gdpp, child_mort and income”**

In [None]:
# Cheking the cluter means
df_country.groupby(['KMean_clusterid']).mean().sort_values(['child_mort','income','gdpp'],ascending = [False,True,True])

In [None]:
#New dataframe for group by & analysis

df_country_analysis =  df_country.groupby(['KMean_clusterid']).mean().sort_values(['child_mort','income','gdpp'],ascending = [False,True,True])
df_country_analysis

INSIGHT: 
The mean of `gdpp, child_mort and income` shows that there is good inter cluster distance.

From  the mean of clusters we could see that,
> - cluster 0 : Developing
> - Cluster 1: Developed
> - Cluste 2: Undeveloped

We are intrested on cluster 2 in KMean Clusters

In [None]:
# Creating a new field for count of observations in each cluster

df_country_analysis['Observations']=df_country[['KMean_clusterid','child_mort']].groupby(['KMean_clusterid']).count()
df_country_analysis

In [None]:
# Creating a new field for proportion of observations in each cluster

df_country_analysis['Proportion']=round(df_country_analysis['Observations']/df_country_analysis['Observations'].sum(),2)


#Summary View
df_country_analysis[['child_mort','income','gdpp','Observations','Proportion']]

In [None]:
# Plot1 between income and gdpp against cluster_lables3
plt.figure (figsize = (15,10))

df_country_plot1 = df_country[['KMean_clusterid', 'gdpp', 'income']].copy()
df_country_plot1 = df_country_plot1.groupby('KMean_clusterid').mean()
df_country_plot1.plot.bar()

plt.tight_layout()
plt.show()

In [None]:
# Plot 2 between child_mort and cluster_labels

plt.figure (figsize = (15,10))

df_country_plot2 = df_country[['KMean_clusterid', 'child_mort']].copy()
df_country_plot2 = df_country_plot2.groupby('KMean_clusterid').mean()
df_country_plot2.plot.bar()

plt.tight_layout()
plt.show()


Interpretation of Clusters: 
- Cluster 2 has the Highest average Child Mortality rate of ~92 when compared to other clusters, and Lowest average GDPP & Income of ~ 1909 & 3897 respectively. 
- All these figures clearly makes this cluster the best candidate for the financial aid from NGO. We could also see that Cluster 2 comprises of ~29% of overall data, and has ~48 observations in comparision to 167 total observations

In [None]:
# sort based on 'child_mort','income','gdpp' in respective order
K_cluster_Undeveloped = df_country[df_country['KMean_clusterid']== 2]
K_top5 = K_cluster_Undeveloped.sort_values(by = ['gdpp','income','child_mort'],
                                                     ascending=[True, True, False]).head(5)

print( 'Top 5 countries dire need of aid  based on K cluster are:' , K_top5['country'].values )

Insights:
- Top 5 countries dire need of aid  based on K cluster are: 
    1. 'Burundi' 
    2. 'Liberia' 
    3. 'Congo, Dem. Rep.' 
    4. 'Niger' 
    5. 'Sierra Leone'

#### 6.2 Hierarchical Clustering

Hierarchical clustering is an unsupervised learning technique that groups data over a variety of scales by creating a cluster tree or dendrogram. The tree is not a single set of clusters, but rather a multilevel hierarchy, where clusters at one level are joined as clusters at the next level.

As mentioned in the 'Approach' section, we will use Hierarchical Clustering to identify appropriate cluster size with a good split of data (Max Intra-Cluster distance & Min Inter-Cluster Distance)

We will perform hierarchical clustering using methods single linkage, complete linkage, and ward and select the one which yields the best result.

In [None]:
# new dataset check
df_scaled.head()

**6.2.1 Use both Single and Complete linkage**

**A. <font color = brown> Single Linkage:<font>** Here, the distance between 2 clusters is defined as the `shortest` distance between points in the two clusters.

In [None]:
#  Utilise the single linkage method for clustering this dataset 
plt.figure(figsize = (18,8))
mergings = linkage(df_scaled, method="single", metric='euclidean')
dendrogram(mergings)
plt.show()

Insights:

- The clusters of the single linkage are not truly satisfying. The single linkage method appears to be placing each outlier in its own cluster.

- As you can clearly see, single linkage doesn't produce a good enough result for us to analyse the clusters. Hence, we need to go ahead and utilise the complete linkage method and then analyse the clusters once again.

**B. <font color = brown> Complete Linkage:<font>** Here, the distance between 2 clusters is defined as the `maximum` distance between points in the two clusters.

In [None]:
#  Utilise the complete linkage method for clustering this dataset.

plt.figure(figsize = (18,8))
mergings = linkage(df_scaled, method="complete", metric='euclidean')
dendrogram(mergings)
plt.show()

Insight:
- From the above Dendrograms, it is evident that 'Complete Linkage' give a better cluster formation. 
- So we will use Complete linkage output for our further analysis.
- We will build two iterations of clustering with 3 & 4 clusters and analyse the output

**6.2.2 Choose one method based on the results**

In [None]:
# Creating the labels
cluster_labels = cut_tree(mergings, n_clusters=3).reshape(-1, )
cluster_labels

In [None]:
# assign cluster labels
#df_country = df.copy()
df_country['H_ClusterID'] = pd.Series(cluster_labels)
df_country.head()

**6.2.3 Visualise the clusters**

In [None]:
# Cheking the new dataframe shape
df_country.shape

In [None]:
# Scatter plot on various variables to visualize the clusters based on them

plt.figure(figsize=(18, 5))
plt.subplot(1, 3, 1)
sns.scatterplot(x='gdpp', y='child_mort', hue='H_ClusterID', data=df_country, palette="bright", alpha=.4)

plt.subplot(1, 3, 2)
sns.scatterplot(x='income', y='child_mort', hue='H_ClusterID',data=df_country, palette="bright", alpha=.4)

plt.subplot(1, 3, 3)
sns.scatterplot(x='gdpp', y='income', hue='H_ClusterID', data=df_country, palette="bright", alpha=.4)

plt.show()

Note: Clearly clusters are visable

In [None]:
# visualising clusters
plt.figure(figsize=(18,5))
plt.subplot(1,3,1)
sns.barplot(x = 'H_ClusterID', y = 'gdpp', data=df_country, palette="bright")
plt.title('GDP Percapita')

plt.subplot(1,3,2)
sns.barplot(x = 'H_ClusterID', y = 'child_mort', data=df_country, palette="bright")
plt.title('Child Mortality Rate')

plt.subplot(1,3,3)
sns.barplot(x = 'H_ClusterID', y = 'income', data=df_country, palette="bright")
plt.title('Income Per Person')

plt.tight_layout()

plt.show()

INSIGHT: 
- Its clearly showing that the cluster 0 having highest Child Mortality and lowest Income & GDPP and its comes under undevloped conutries

In [None]:
# Cheking the cluster count
df_country.H_ClusterID.value_counts()

In [None]:
# Checking the countries in Cluster 2 to see which are the countries in that segment.
cluster_2 = df_country[df_country['H_ClusterID']==2]
cluster_2.head()

In [None]:
# Checking the countries in Cluster 1 to see which are the countries in that segment.
cluster_1 = df_country[df_country['H_ClusterID']== 1]
cluster_1.head()

Insight:
- Clusters 2 & 1 seems to be Developed or Developing countries, so our segmentation is good in terms of all our under developed countries are segmented under cluster 0. 

**6.2.4 Clustering profiling using “gdpp, child_mort and income”**

In [None]:
#New dataframe for group by & analysis
df_country_analysis = df_country.groupby(['H_ClusterID']).mean()
df_country_analysis

In [None]:
# Creating a new field for count of observations in each cluster
df_country_analysis['Observations'] = df_country[['H_ClusterID', 'child_mort']].groupby(['H_ClusterID']).count()
df_country_analysis

In [None]:
# Creating a new field for proportion of observations in each cluster
df_country_analysis['Proportion'] = round(df_country_analysis ['Observations'] / (df_country_analysis ['Observations'].sum()),2)
df_country_analysis

Insights:

From  the mean of clusters we could see that,
> - cluster 0 : Undeveloped
> - Cluster 1: Developing
> - Cluste 2 : Developed

We are intrested on cluster 0 as objective is to find top 5 undevelped countries

In [None]:
# Plot1 between income and gdpp against cluster_lables3
plt.figure (figsize = (15,10))

df_country_plot1 = df_country[['H_ClusterID', 'gdpp', 'income']].copy()
df_country_plot1 = df_country_plot1.groupby('H_ClusterID').mean()
df_country_plot1.plot.bar()

plt.tight_layout()
plt.show()

In [None]:
# Plot 2 between child_mort and cluster_labels

plt.figure (figsize = (15,10))

df_country_plot2 = df_country[['H_ClusterID', 'child_mort']].copy()
df_country_plot2 = df_country_plot2.groupby('H_ClusterID').mean()
df_country_plot2.plot.bar()

plt.tight_layout()
plt.show()


Interpretation of Clusters: 
- Cluster 0 has the Highest average Child Mortality rate of ~42 when compared to other 3 clusters, and Lowest average GDPP & Income of ~ 7551 & 12641 respectively. 
- All these figures clearly makes this cluster the best candidate for the financial aid from NGO. 
- We could also see that Cluster 1 comprises of ~89% of overall data, and has ~148 observations in comparision to 167 total observations This seems to be a problem. 
- This means that Hierarchical clustering is not giving us a good result as 89% of the data points are segmented into that cluster. 
- We also saw that increasing the cluster number is not solving this problem. We will perform K-Means Clustering and check how that turns out to be.

In [None]:
# sort based on 'child_mort','income','gdpp' in respective order
H_cluster_Undeveloped = df_country[df_country['H_ClusterID']== 0]
H_top5 = H_cluster_Undeveloped.sort_values(by = ['gdpp','income','child_mort'],
                                                     ascending=[True, True, False]).head(5)

print( 'Top 5 countries dire need of aid  based on H cluster are:' , H_top5['country'].values )

From Hierarchical Clusturing we could get top 5 undeveloped countries are:
1. 'Burundi' 
2. 'Liberia' 
3. 'Congo, Dem. Rep.'
4. 'Niger' 
5. 'Sierra Leone' `

### 7. Clustering Model Selection

In [None]:
# Get the % of cluster distribution
H_cluster_per = df_country.H_ClusterID.value_counts(normalize = True)*100
print('Hierarchical Clustering Countries %:')
print(df_country.H_ClusterID.value_counts(normalize = True)*100)

K_cluster_per = df_country.KMean_clusterid.value_counts(normalize = True)*100
print('\nKMean Clustering Countries %:')
print(df_country.KMean_clusterid.value_counts(normalize = True)*100)

Insights:
- The above data shows KMean is better cluseted in the term of distribution of countries. 
- So we will be creating the final cluser with KMean cluster and doing profiling considering the labels accordingly.

In [None]:
# barplot for cluster distribution
plt.figure(figsize=(16,4))
plt.subplot(1,2,1)
sns.barplot(x= H_cluster_per.index, y = H_cluster_per)
plt.title('Hierarchical Cluster Distribution')
plt.xlabel('ClusterID')
plt.ylabel('Percentage Distribution')

plt.subplot(1,2,2)
sns.barplot(x= K_cluster_per.index, y = K_cluster_per)
plt.title('KMean Cluster Distribution')
plt.xlabel('ClusterID')
plt.ylabel('Percentage Distribution')
plt.show()

**Cluster Summary**

- From above analysis we could see KMean is having better distributed cluster. So we will select final model as KMean cluster and doing profiling considering the labels accordingly.
- Kindly note that both the model has resulted the same coutries as top 5 undeveloped county.
- By comparing averages of K-means we can conclude that
    - Cluster 1 belongs to `Undeveloped` Countries,
    - Cluster 2 belongs to `Developed` Countries
    - Cluster 0 belongs to `Developing` Countries.

To differentiate the clusters of developed countries from the clusters of under-developed countries, the notation can be changes and labeled as, {0: developed, 1:developing, 2: Undeveloped}


In [None]:
# Final labels
df_country['ClusterLabels'] = df_country['KMean_clusterid'].map({0: 'Developed', 1:'Developing', 2: 'Undeveloped'})
df_country.head()                                                      

In [None]:
# select final data frame for profiling
K_cluster = df_country[['country','gdpp','child_mort','income','ClusterLabels']]
K_cluster.head()

**Final Cluster:** 
- Based on the above interpretation of the cluster, we now rename all the clusters accordingly. 
- The Cluster 2 now becomes 'Undeveloped Countries', which will be of our interest.
- We will further analyse the Cluster 'Undeveloped Countries' and get to know various metrics of that data set, based on which we could identify our final set of countries which needs the financial support from the NGO

### 8. Analysing the 'Under Developed Countries (UDC)' Cluster

In [None]:
# Subset data frame based on undeveloped countries
K_cluster_UDC = K_cluster[K_cluster['ClusterLabels'] == 'Undeveloped']
K_cluster_UDC.head()

We wiil prioritize soft the features on the order of gdpp, income and child_mort with below understanding,
 - `gdpp` and `income` are highly `+ve` correlated.
 - `gdpp` and `income` both have `-ve` correlation with `child_mort`.
 - Finalcial aid will directly improve `gdpp` and `income` and thus child_mort can be reduced.

In [None]:
# sort based on 'child_mort','income','gdpp' in respective order
K_top5=K_cluster_UDC.sort_values(by = ['gdpp','income', 'child_mort']).head(5).copy()

K_top5 = K_top5[['country','gdpp','income', 'child_mort']]
#Final country list
K_top5

Insight:
- Coutries (5 most) dire need of aid are: `['Burundi' 'Liberia' 'Congo, Dem. Rep.' 'Niger' 'Sierra Leone']`

In [None]:
# plot for final top5 countries based on child_nort, gdpp and income

KMean_plot = K_top5.set_index('country')
KMean_plot.plot.bar(figsize = (8,4))

plt.show()

In [None]:
# Bivariate Analysis of Cluster 'Under_Developed_Countries' (recommended 5)

# Scatter plot on various variables to visualize the clusters based on them

plt.figure(figsize=(15, 5))
plt.subplot(1, 3, 1)
sns.scatterplot(x='gdpp', y='child_mort', hue='country',
                data=K_top5, legend='full', palette="bright", s=300, c='lightblue')
plt.subplot(1, 3, 2)
sns.scatterplot(x='gdpp', y='income', hue='country',
                data=K_top5, legend='full', palette="bright", s=300, c='lightblue')
plt.subplot(1, 3, 3)
sns.scatterplot(x='income', y='child_mort', hue='country',
                data=K_top5, legend='full', palette="bright", s=300, c='lightblue')
plt.show()


In [None]:
# Descriptive Statistics
K_top5.describe()

### 9. <font color = Green> Conclusion

In [None]:
#TOP COUNTRIES recommended for Financial bases on KMean Clustering analysis
print( 'Top 5 countries dire need of aid:' , K_top5['country'].values )

**Final Result**:-

- Performed CLUSTERING on the socio-economic data provided for various countries to identify countries to recommend for Financial Aid from the NGO. 
- Based on our Clustering Analysis, I have identified the top countries under our 'Undeveloped Countries' cluster which are in dire need of the Financial Aid. This output is purely based on the dataset we used and various analytical methodology we performed.

**Finanical Aid required countries on priority bases:**
<font color = 'Brown'> 
1. Burundi
2. Liberia
3. Congo, Dem. Rep.
4. Niger
5. Sierra Leone 
<font>