# Introduction 


HELP International have been able to raise around $ 10 million. Now the CEO of the NGO needs to decide how to use this money strategically and effectively. So, CEO has to make decision to choose the countries that are in the direst need of aid. Hence, your Job as a Data scientist is to categorise the countries using some socio-economic and health factors that determine the overall development of the country. Then you need to suggest the countries which the CEO needs to focus on the most.


## Content:

1. [Import Libraries](#1)
2. [Load the Data](#2)
3. [Variable Descriptions](#3)
4. [Univariate Analysis](#4)
5. [Bivariate Analysis](#5)
6. [Outliers Detection](#6)
7. [Feature Scaling](#7)
8. [Hierarchical Clustering](#8)
9. [K-Means Clustering](#9)
10. [Visualisation](#10)
  * [Visualisation of Hierarchical Clustering](#11)
  * [Visualisation of K-Means Clustering](#12)
11. [Results Expected](#13)
  * [Results for Hierarchical Clustering](#14)
  * [Results for K-Means Clustering](#15)


<a id = 1> </a>
# Importing the Libraries

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

<a id = 2> </a>
# Load the Data

In [None]:
df = pd.read_csv('../input/unsupervised-learning-on-country-data/Country-data.csv')

In [None]:
df

In [None]:
df.columns

<a id = 3> </a>
# Variable Descriptions

country: Name of the country  
child_mort: Death of children under 5 years of age per 1000 live births  
exports: Exports of goods and services per capita. Given as %age of the GDP per capital  
health: Total health spending per capita. Given as %age of GDP per capita  
imports: Imports of goods and services per capita. Given as %age of the GDP per capita  
income: Net income per person  
iflation: The measurement of the annual growth rate of the Total GDP  
life_expect: The average number of years a new born child would live if the current mortality patterns are to remain the same  
total_fer: The number of children that would be born to each woman if the current age-fertility rates remain the same.  
gdpp: The GDP per capita. Calculated as the Total GDP divided by the total population.

<a id = 4> </a>
# Univariate Analysis

In [None]:
for col in df.columns[1:]:
    sns.distplot(df[col],  color="y")
    plt.grid(True)
    plt.show();
    

Here we can see that exports, import, inflation have normal distribution.

<a id = 5> </a>
# Bivariate Analysis

In [None]:
plt.figure(figsize = (4, 4))
sns.pairplot(df)
plt.show()

<a id = 6> </a>
# Outliers Detection

In [None]:
for col in df.columns[1:]:
    plt.figure(figsize = (10, 4))
    sns.boxplot(x= df[col],  color="y")
    plt.grid(True)
    plt.show();

I will keep outliers here.

In [None]:
X = df.iloc[:, [1,5,9]].values

<a id = 7> </a>
# Feature Scaling
I will use Standard Scaler. Transformation will be between 3 and -3.

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X = sc.fit_transform(X)

<a id = 8> </a>
# Hierarchical Clustering

## Dendrogram

In [None]:
import scipy.cluster.hierarchy as sch

dendrogram = sch.dendrogram(sch.linkage(X, method = 'ward'))
plt.title('Dendrogram')
plt.xlabel('Countries')
plt.ylabel('Euclidean Distances')
plt.show()

By looking at this dendrogram, it can be seen that the optimal number of cluster is 3.

In [None]:
from sklearn.cluster import AgglomerativeClustering

hc = AgglomerativeClustering(n_clusters = 3, affinity = 'euclidean', linkage = 'ward')
y_hc = hc.fit_predict(X)

#Add extra column for cluster values.
df['cluster_hierarchical'] = hc.labels_
df.head()

<a id = 9> </a>
# K-Means Clustering

## Figure out optimal number of clusters

In [None]:
from sklearn.cluster import KMeans

wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters = i , init = 'k-means++', random_state = 5)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)


##Elbow Graph

In [None]:
plt.plot(range(1, 11), wcss)
plt.title('the Elbow Method')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS')
plt.grid(True)
plt.show()

By looking at the elbow graph, it can be seen that the optimal number of cluster is 3.

In [None]:
kmeans = KMeans(n_clusters = 3, init = 'k-means++', random_state = 5)
y_hc = kmeans.fit_predict(X)

df['cluster_kmeans'] = kmeans.labels_
df.head()

<a id = 10> </a>
# Visualisation

I will focus on three variables:  
gdpp, child_morth and income

<a id = 11> </a>
## Visualisation of Hierarchical Clustering

In [None]:
fig, axs = plt.subplots(1,3, figsize = (20,5))

plt.subplot(1,3,1)
sns.scatterplot(data=df, x="gdpp", y="child_mort", hue="cluster_hierarchical")

plt.subplot(1,3,2)
sns.scatterplot(data=df, x="gdpp", y="income", hue="cluster_hierarchical")

plt.subplot(1,3,3)
sns.scatterplot(data=df, x="income", y="child_mort", hue="cluster_hierarchical")


* Cluster No 0 has low child mont, high gdpp and high income.  
* Cluster No 1 has medium child_mont, medium gdpp and medium income.  
* Cluster No 2 has high child_mont, low gdpp and low income.

CEO wants to help countries that are in the direst need of aid. So this category would be Cluster No 2.


<a id = 12> </a>
## Visualisation of K-Means Clustering

In [None]:
fig, axs = plt.subplots(1,3, figsize = (20,5))

plt.subplot(1,3,1)
sns.scatterplot(data=df, x="gdpp", y="child_mort", hue="cluster_kmeans")

plt.subplot(1,3,2)
sns.scatterplot(data=df, x="gdpp", y="income", hue="cluster_kmeans")

plt.subplot(1,3,3)
sns.scatterplot(data=df, x="income", y="child_mort", hue="cluster_kmeans")


* Cluster No 0 has medium child_mont, medium gdpp and medium income.
* Cluster No 1 has low child mont, high gdpp and high income.  
* Cluster No 2 has high child_mont, low gdpp and low income.

CEO wants to help countries that are in the direst need of aid. So this category would be Cluster No 2.

<a id = 13> </a>
# Results Expected
I will show top 10 countries which have lowest gdpp, lowest income, and highest child_mort.

<a id = 14> </a>
##  Results for Hierarchical Clustering

In [None]:
df[df['cluster_hierarchical'] == 2].sort_values(['gdpp','income','child_mort'],ascending=[True,True,False]).head(10)


<a id = 15> </a>
## Results for K-Means Clustering

In [None]:
df[df['cluster_kmeans'] == 2].sort_values(['gdpp','income','child_mort'],ascending=[True,True,False]).head(10)