# Identification of Countries in direst need of aid using Clustering

#### Problem Statement

HELP International is an international humanitarian NGO that is committed to fighting poverty and providing the people of backward countries with basic amenities and relief during the time of disasters and natural calamities. It runs a lot of operational projects from time to time along with advocacy drives to raise awareness as well as for funding purposes.

After the recent funding programs, they have been able to raise around $ 10 million. Now the CEO of the NGO needs to decide how to use this money strategically and effectively. The significant issues that come while making this decision are mostly related to choosing the countries that are in the direst need of aid. 

Your job is to categorise the countries using some socio-economic and health factors that determine the overall development of the country. Then you need to suggest the countries which the CEO needs to focus on the most.  The datasets containing those socio-economic factors and the corresponding data dictionary are provided below:

#### Data dictionary

- country:  Name of the country
- child_mort: Death of children under 5 years of age per 1000 live births
- exports: Exports of goods and services per capita. Given as %age of the GDP per capita
- health: Total health spending per capita. Given as %age of GDP per capita
- imports: Imports of goods and services per capita. Given as %age of the GDP per capita
- Income: Net income per person
- Inflation: Measurement of the annual growth rate of the GDP deflator
- life_expec`:The average number of years a new born child would live if the current mortality patterns are to remain the same
- total_fer:The number of children that would be born to each woman if the current age-fertility rates remain the same.
- gdpp: The GDP per capita. Calculated as the Total GDP divided by the total population.


# Reading and Understanding the Dataset

In [9]:
# Importing the required  libraries
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

from scipy.cluster.hierarchy import dendrogram
from scipy.cluster.hierarchy import linkage
from scipy.cluster.hierarchy import cut_tree

In [2]:
# Reading the dataset
country_data = pd.read_csv('Country-data.csv')
country_data.head()

Unnamed: 0,country,child_mort,exports,health,imports,income,inflation,life_expec,total_fer,gdpp
0,Afghanistan,90.2,10.0,7.58,44.9,1610,9.44,56.2,5.82,553
1,Albania,16.6,28.0,6.55,48.6,9930,4.49,76.3,1.65,4090
2,Algeria,27.3,38.4,4.17,31.4,12900,16.1,76.5,2.89,4460
3,Angola,119.0,62.3,2.85,42.9,5900,22.4,60.1,6.16,3530
4,Antigua and Barbuda,10.3,45.5,6.03,58.9,19100,1.44,76.8,2.13,12200


In [3]:
# Displaying the information of the columns
country_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 167 entries, 0 to 166
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   country     167 non-null    object 
 1   child_mort  167 non-null    float64
 2   exports     167 non-null    float64
 3   health      167 non-null    float64
 4   imports     167 non-null    float64
 5   income      167 non-null    int64  
 6   inflation   167 non-null    float64
 7   life_expec  167 non-null    float64
 8   total_fer   167 non-null    float64
 9   gdpp        167 non-null    int64  
dtypes: float64(7), int64(2), object(1)
memory usage: 13.2+ KB


In [4]:
# Displaying the numerical columns' information
country_data.describe()

Unnamed: 0,child_mort,exports,health,imports,income,inflation,life_expec,total_fer,gdpp
count,167.0,167.0,167.0,167.0,167.0,167.0,167.0,167.0,167.0
mean,38.27006,41.108976,6.815689,46.890215,17144.688623,7.781832,70.555689,2.947964,12964.155689
std,40.328931,27.41201,2.746837,24.209589,19278.067698,10.570704,8.893172,1.513848,18328.704809
min,2.6,0.109,1.81,0.0659,609.0,-4.21,32.1,1.15,231.0
25%,8.25,23.8,4.92,30.2,3355.0,1.81,65.3,1.795,1330.0
50%,19.3,35.0,6.32,43.3,9960.0,5.39,73.1,2.41,4660.0
75%,62.1,51.35,8.6,58.75,22800.0,10.75,76.8,3.88,14050.0
max,208.0,200.0,17.9,174.0,125000.0,104.0,82.8,7.49,105000.0


In [10]:
# Displaying the shape of the dataframe
country_data.shape

(167, 10)

# Cleaning The Dataset

In [11]:
# Calculating the missing percentage by column
round(country_data.isnull().sum()/country_data.shape[0]*100,2)

country       0.0
child_mort    0.0
exports       0.0
health        0.0
imports       0.0
income        0.0
inflation     0.0
life_expec    0.0
total_fer     0.0
gdpp          0.0
dtype: float64

- There are no missing rows in the dataset.  Also, all the columns are in required format so no need of changing the type of columns. We can proceed with the data transformation as there are some columns which are expressed in different perspective.

# Data Transformation

- From the data dictionary, the columns exports, health and imports are given in the form of % of GDP per capita, which has no clear meaning.
- This is because, mathematically, there can be some values present if given w.r.t to percentage, e.g. `same` values of exports/imports/heaths of % GDP per capita, but `different` values of % of GDP per capita.
-  Hence, to avoid this, we can convert those values into the normal independent values irrespective of the countries' % GDP per capita.

In [12]:
# Transforming the exports, health and imports columns into the indendepent values and ignoring the % of GDP per capita
country_data['exports'] = country_data['exports']*country_data['gdpp']/100
country_data['imports'] = country_data['imports']*country_data['gdpp']/100
country_data['health'] = country_data['health']*country_data['gdpp']/100

In [13]:
country_data.head()

Unnamed: 0,country,child_mort,exports,health,imports,income,inflation,life_expec,total_fer,gdpp
0,Afghanistan,90.2,55.3,41.9174,248.297,1610,9.44,56.2,5.82,553
1,Albania,16.6,1145.2,267.895,1987.74,9930,4.49,76.3,1.65,4090
2,Algeria,27.3,1712.64,185.982,1400.44,12900,16.1,76.5,2.89,4460
3,Angola,119.0,2199.19,100.605,1514.37,5900,22.4,60.1,6.16,3530
4,Antigua and Barbuda,10.3,5551.0,735.66,7185.8,19100,1.44,76.8,2.13,12200


- From the information (info) and description (describe) of the dataset, we have seen the columns have different units and with different weightages, so we have to scale the data before model building.
-  We can continue with exploratory data analysis before scaling the columns.

# Exploratory Data Analysis

In [15]:
# Let's plot the distribution plots of all the numeric columns to see the distribution of values.
def plot_hist():
    for i in range(0,3):
        

0       90.2
1       16.6
2       27.3
3      119.0
4       10.3
       ...  
162     29.2
163     17.1
164     23.3
165     56.3
166     83.1
Name: child_mort, Length: 167, dtype: float64