# Clustering without PCA

#### K-Means Clustering and Hierarchical Clustering

**Overview**<br>
Given the datset of countries with socioeconomic and health information. Categorise the countries using some socio-economic and health factors that determine the overall development of the country. Also, suggest the countries which needs more focus.

Apply both K-means clustering and Hierarchical Clustering.

The steps are broadly:

1. Read and understand the data
2. Clean the data
3. Prepare the data for modelling
4. Modelling
5. Final analysis and reco

# 1. Read and Visualize the data

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import datetime as dt

import sklearn
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

from scipy.cluster.hierarchy import linkage
from scipy.cluster.hierarchy import dendrogram
from scipy.cluster.hierarchy import cut_tree

In [None]:
# Suppress warnings
import warnings
warnings.filterwarnings("ignore")

##### Read the dataset

In [None]:
df_country = pd.read_csv("../input/unsupervised-learning-on-country-data/Country-data.csv", sep=",", encoding="ISO-8859-1", header=0)

In [None]:
df_country.head()

In [None]:
df_country.info()

In [None]:
df_country.columns

In [None]:
df_country.shape

# 2. Clean the data

- Check for missing values
- Check for outliers

In [None]:
# missing values
round(100*(df_country.isnull().sum())/len(df_country), 2)

<a We can see that there is no missing values in the dataset a>

-- We can observe that there is no missing values in the dataset

- So this data seems largely clean,but we have a lot of variables and making and visualising proper clusters will be a   difficulty.
- Let's plot the correlation matrix and check if the data is indeed highly correlated

# 3. Prepare the data for modelling

###### Health, import and export are given as percentage values, lets convert them to absolute values based on gdpp

In [None]:
#Converting exports,imports and health spending percentages to absolute values.
df_country['exports'] = df_country['exports']*df_country['gdpp']/100
df_country['imports'] = df_country['imports']*df_country['gdpp']/100
df_country['health'] = df_country['health']*df_country['gdpp']/100

In [None]:
df_country.head()

- Lets check the correlation matrix and Heatmap

In [None]:
df_country.corr()

In [None]:
plt.figure(figsize = (20,10))        
sns.heatmap(df_country.corr(),annot = True, cmap="YlGnBu")

We observe the following correlations from the plot.

- gdpp and income are most highly correlated with correlation of 0.9
- child_mortality and life_expentency are highly correlated with correlation of -0.89
- child_mortality and total_fertility are highly correlated with correlation of 0.85
- imports and exports are highly correlated with correlation of 0.74
- life_expentency and total_fertility are highly correlated with correlation of -0.76

In [None]:
#The final matrix would only contain the data columns. Hence let's drop the country column
data=df_country.drop(['country'],axis=1)
data.head()

- we observe that a lot of the data variables are of different orders of magnitude. 
- Let's standardise the values using Standard Scaler

In [None]:
from sklearn.preprocessing import StandardScaler
standard_scaler = StandardScaler()
df_c = standard_scaler.fit_transform(data)

In [None]:
df_c.shape

# EDA

Univariate Analysis

We need to choose the countries that are in the direst need of aid. Hence, we need to identify those countries with using some socio-economic and health factors that determine the overall development of the country.

We will have a look on the lowest 10 countries for each factor.

1. **Child Mortality Rate** : Death of children under 5 years of age per 1000 live births
2. **Fertility Rate**: The number of children that would be born to each woman if the current age-fertility rates remain the same
3. **Life Expectancy**: The average number of years a new born child would live if the current mortality patterns are to remain same
4. **Health** :Total health spending as %age of Total GDP.
5. **The GDP per capita** : Calculated as the Total GDP divided by the total population.
6. **Per capita Income** : Net income per person
7. **Inflation**: The measurement of the annual growth rate of the Total GDP
8. **Exports:** Exports of goods and services. Given as %age of the Total GDP
9. **Imports:** Imports of goods and services. Given as %age of the Total GDP

In [None]:
fig, axs = plt.subplots(3,3,figsize = (15,15))

# Child Mortality Rate : Death of children under 5 years of age per 1000 live births

top10_child_mort = df_country[['country','child_mort']].sort_values('child_mort', ascending = False).head(10)
plt1 = sns.barplot(x='country', y='child_mort', data= top10_child_mort, ax = axs[0,0])
plt1.set(xlabel = '', ylabel= 'Child Mortality Rate')

# Fertility Rate: The number of children that would be born to each woman if the current age-fertility rates remain the same
top10_total_fer = df_country[['country','total_fer']].sort_values('total_fer', ascending = False).head(10)
plt1 = sns.barplot(x='country', y='total_fer', data= top10_total_fer, ax = axs[0,1])
plt1.set(xlabel = '', ylabel= 'Fertility Rate')

# Life Expectancy: The average number of years a new born child would live if the current mortality patterns are to remain same

bottom10_life_expec = df_country[['country','life_expec']].sort_values('life_expec', ascending = True).head(10)
plt1 = sns.barplot(x='country', y='life_expec', data= bottom10_life_expec, ax = axs[0,2])
plt1.set(xlabel = '', ylabel= 'Life Expectancy')

# Health :Total health spending as %age of Total GDP.

bottom10_health = df_country[['country','health']].sort_values('health', ascending = True).head(10)
plt1 = sns.barplot(x='country', y='health', data= bottom10_health, ax = axs[1,0])
plt1.set(xlabel = '', ylabel= 'Health')

# The GDP per capita : Calculated as the Total GDP divided by the total population.

bottom10_gdpp = df_country[['country','gdpp']].sort_values('gdpp', ascending = True).head(10)
plt1 = sns.barplot(x='country', y='gdpp', data= bottom10_gdpp, ax = axs[1,1])
plt1.set(xlabel = '', ylabel= 'GDP per capita')

# Per capita Income : Net income per person

bottom10_income = df_country[['country','income']].sort_values('income', ascending = True).head(10)
plt1 = sns.barplot(x='country', y='income', data= bottom10_income, ax = axs[1,2])
plt1.set(xlabel = '', ylabel= 'Per capita Income')


# Inflation: The measurement of the annual growth rate of the Total GDP

top10_inflation = df_country[['country','inflation']].sort_values('inflation', ascending = False).head(10)
plt1 = sns.barplot(x='country', y='inflation', data= top10_inflation, ax = axs[2,0])
plt1.set(xlabel = '', ylabel= 'Inflation')


# Exports: Exports of goods and services. Given as %age of the Total GDP

bottom10_exports = df_country[['country','exports']].sort_values('exports', ascending = True).head(10)
plt1 = sns.barplot(x='country', y='exports', data= bottom10_exports, ax = axs[2,1])
plt1.set(xlabel = '', ylabel= 'Exports')


# Imports: Imports of goods and services. Given as %age of the Total GDP

bottom10_imports = df_country[['country','imports']].sort_values('imports', ascending = True).head(10)
plt1 = sns.barplot(x='country', y='imports', data= bottom10_imports, ax = axs[2,2])
plt1.set(xlabel = '', ylabel= 'Imports')

for ax in fig.axes:
    plt.sca(ax)
    plt.xticks(rotation = 90)
    
plt.tight_layout()
plt.savefig('eda')
plt.show()

**Observations**
1. Haiti has highest child mortality rate and least Life expectatncy.
2. Niger has highest Fertility Rate.
3. Eritrea has lowest spending on Health.
4. Burundi has lowest GDP per capita
5. The republic of Congo has lowest Per Capita Income
6. Inflation rates of Nigeria are the highest
7. Myanmar doesn't export or import much(lowest exports and imports)

### Outlier Analysis
We will see how values in each columns are distributed using boxplot

In [None]:
fig, axs = plt.subplots(3,3, figsize = (15,7.5))
plt1 = sns.boxplot(df_country['child_mort'], ax = axs[0,0])
plt2 = sns.boxplot(df_country['health'], ax = axs[0,1])
plt3 = sns.boxplot(df_country['life_expec'], ax = axs[0,2])
plt4 = sns.boxplot(df_country['total_fer'], ax = axs[1,0])
plt5 = sns.boxplot(df_country['income'], ax = axs[1,1])
plt6 = sns.boxplot(df_country['inflation'], ax = axs[1,2])
plt7 = sns.boxplot(df_country['gdpp'], ax = axs[2,0])
plt8 = sns.boxplot(df_country['imports'], ax = axs[2,1])
plt9 = sns.boxplot(df_country['exports'], ax = axs[2,2])


plt.tight_layout()

We observe the following about the outliers.

- All variables have outliers on the upper side (higher values) except for life_expec which has outliers on the down side (less values) indicating that usually the life expectancy in most countries is above 50 except for 3 countries.

In [None]:
df_country.describe()

- As we can see there are a number of outliers in the data.
- Keeping in mind we need to identify backward countries based on socio economic and health factors.
- We will cap the outliers to values accordingly for analysis.

In [None]:
data_c = ['child_mort', 'exports', 'health', 'imports', 'income',
       'inflation', 'life_expec', 'total_fer', 'gdpp']

for i in data_c:
    percentiles = data[i].quantile([0.05, 0.95]).values
    data[i][data[i] <= percentiles[0]] = percentiles[0]
    data[i][data[i] >= percentiles[1]] = percentiles[1]

In [None]:
data.head()

###### There are 2 types of outliers
1. Domain specific
2. Statistical

Let's treat statistical outliers

In [None]:
# Treating (statistical) outliers
grouped_df = df_country.groupby('country')['income'].sum()
grouped_df = grouped_df.reset_index()
grouped_df.head()

In [None]:
Q1 = grouped_df.income.quantile(0.05)
Q3 = grouped_df.income.quantile(0.95)
IQR = Q3 - Q1
grouped_df = grouped_df[(grouped_df.income >= Q1 - 1.5*IQR) & (grouped_df.income <= Q3 + 1.5*IQR)]

###### Rescaling the data

In [None]:
standard_scaler = StandardScaler()
df_scaled = standard_scaler.fit_transform(data)
df_scaled.shape

# 4. Clustering

#### Hopkin's Score

In [None]:
#Calculating the Hopkins statistic
from sklearn.neighbors import NearestNeighbors
from random import sample
from numpy.random import uniform
import numpy as np
from math import isnan
 
def hopkins(X):
    d = X.shape[1]
    #d = len(vars) # columns
    n = len(X) # rows
    m = int(0.1 * n) 
    nbrs = NearestNeighbors(n_neighbors=1).fit(X.values)
 
    rand_X = sample(range(0, n, 1), m)
 
    ujd = []
    wjd = []
    for j in range(0, m):
        u_dist, _ = nbrs.kneighbors(uniform(np.amin(X,axis=0),np.amax(X,axis=0),d).reshape(1, -1), 2, return_distance=True)
        ujd.append(u_dist[0][1])
        w_dist, _ = nbrs.kneighbors(X.iloc[rand_X[j]].values.reshape(1, -1), 2, return_distance=True)
        wjd.append(w_dist[0][1])
 
    H = sum(ujd) / (sum(ujd) + sum(wjd))
    if isnan(H):
        print(ujd, wjd)
        H = 0
 
    return H

In [None]:
hopkins(df_country.drop('country', axis = 1))

### - We have a good hopkins score implies that the dataset has high tendency to cluster
- Let's proceed with K-means and Hierarchical Clustering

###### K-means with some arbitatary K

In [None]:
# k-means with some arbitrary k
kmeans = KMeans(n_clusters=4, max_iter=50, random_state=100)
kmeans.fit(df_scaled)

In [None]:
kmeans.labels_

## Finding optimal number of clusters

#### Sum of squared Distance (SSD)

In [None]:
ssd = []
range_n_clusters = [2,3,4,5,6,7,8,9,10,11]

for num_clusters in range_n_clusters:
    kmeans = KMeans(n_clusters=num_clusters, max_iter=50, random_state=50)
    kmeans.fit(df_scaled)
    
    ssd.append(kmeans.inertia_)
    
# plot the SSDs for each n_clusters
# ssd
plt.plot(ssd)

### Silhouette Analysis

$$\text{silhouette score}=\frac{p-q}{max(p,q)}$$

$p$ is the mean distance to the points in the nearest cluster that the data point is not a part of

$q$ is the mean intra-cluster distance to all the points in its own cluster.

* The value of the silhouette score range lies between -1 to 1. 

* A score closer to 1 indicates that the data point is very similar to other data points in the cluster, 

* A score closer to -1 indicates that the data point is not similar to the data points in its cluster.

In [None]:
# silhouette analysis
range_n_clusters = [2, 3, 4, 5, 6, 7, 8,9,10,11]

for num_clusters in range_n_clusters:
    
    # intialise kmeans
    kmeans = KMeans(n_clusters=num_clusters, max_iter=50, random_state=50)
    kmeans.fit(df_scaled)
    
    cluster_labels = kmeans.labels_
    
    # silhouette score
    silhouette_avg = silhouette_score(df_scaled, cluster_labels)
    print("For n_clusters={0}, the silhouette score is {1}".format(num_clusters, silhouette_avg))
    
    

### Model with k=3

In [None]:
# final model with k=3
kmeans = KMeans(n_clusters=3, max_iter=50, random_state=50)
kmeans.fit(df_scaled)

In [None]:
kmeans.labels_

In [None]:
# assign the label
data['cluster_id'] = kmeans.labels_
data.head()

In [None]:
#plot
sns.boxplot(x='cluster_id', y='income', data=data)

**Observations**
- Cluster 1 comprises of highest income countries, Cluster 0 forms mid income and cluster 0 forms lowest income countries
- There is a huge gap between hioghest and lowest incomes' countries

In [None]:
#plot
sns.boxplot(x='cluster_id', y='gdpp', data=data)

In [None]:
#plot
sns.boxplot(x='cluster_id', y='child_mort', data=data)

In [None]:
fig, axs = plt.subplots(3,3, figsize = (15,7.5))
plt1 = sns.boxplot(x='cluster_id', y = 'child_mort', data=data, ax = axs[0,0])
plt2 = sns.boxplot(x='cluster_id', y = 'health', data=data, ax = axs[0,1])
plt3 = sns.boxplot(x='cluster_id', y = 'life_expec', data=data, ax = axs[0,2])
plt4 = sns.boxplot(x='cluster_id', y = 'total_fer', data=data, ax = axs[1,0])
plt5 = sns.boxplot(x='cluster_id', y = 'income', data=data, ax = axs[1,1])
plt6 = sns.boxplot(x='cluster_id', y = 'inflation', data=data, ax = axs[1,2])
plt7 = sns.boxplot(x='cluster_id', y = 'gdpp', data=data, ax = axs[2,0])
plt8 = sns.boxplot(x='cluster_id', y = 'imports', data=data, ax = axs[2,1])
plt9 = sns.boxplot(x='cluster_id', y = 'exports', data=data, ax = axs[2,2])


plt.tight_layout()

**Observations**
1. We can see a direct relation between the socioeconomic factors in the above boxplots
2. The cluster with highest income group has better health, highest gdpp, higher life expectancy, higher imports and exports, lowest child mortality, lower inflation rates and lower fertility rates.
3. While the countries with lower gdpp, imports and exports have lower income, higher child mortality, inflation and fertility.

- Cluster 1 are the better placed countries in terms of socio-economic status, health and development.
- **Cluster 2 countries would need help for development as this group forms highest child mortality, lowest gdpp and lowest income group.**

### Let's see which are those countries


In [None]:
df_country['cluster_id'] = kmeans.labels_
cluster_2_kmeans = df_country[['cluster_id','country', 'child_mort', 'gdpp', 'income' ]].loc[df_country['cluster_id'] == 2].reset_index()
cluster_2_kmeans.shape

**There are 48 countries in cluster 2 which need help for development as per K-means Clustering**

In [None]:
cluster_2_kmeans.sort_values(by = ['child_mort', 'gdpp', 'income'], ascending = [False, True, True]).head(5)

**Haiti is in the highest need for HELP foundation followed by Sierre Leone, Chad, CAR, Mali among other AFRICAN countries.**

# Hierarchical Clustering

In [None]:
df_scaled = pd.DataFrame(df_scaled)
df_scaled.head()

In [None]:
data.head()

###### Single Linkage

In [None]:
# single linkage
mergings = linkage(df_scaled, method="single", metric='euclidean')
dendrogram(mergings)
plt.show()

###### Complete Linkage

In [None]:
# complete linkage
mergings = linkage(df_scaled, method="complete", metric='euclidean')
dendrogram(mergings)
plt.show()

In [None]:
# 3 clusters
cluster_labels = cut_tree(mergings, n_clusters=3).reshape(-1, )
cluster_labels

In [None]:
# assign cluster labels
data['cluster_labels'] = cluster_labels
data.head()

In [None]:
# plots
fig, axs = plt.subplots(3,3, figsize = (15,7.5))
plt1 = sns.boxplot(x='cluster_labels', y = 'child_mort', data=data, ax = axs[0,0])
plt2 = sns.boxplot(x='cluster_labels', y = 'health', data=data, ax = axs[0,1])
plt3 = sns.boxplot(x='cluster_labels', y = 'life_expec', data=data, ax = axs[0,2])
plt4 = sns.boxplot(x='cluster_labels', y = 'total_fer', data=data, ax = axs[1,0])
plt5 = sns.boxplot(x='cluster_labels', y = 'income', data=data, ax = axs[1,1])
plt6 = sns.boxplot(x='cluster_labels', y = 'inflation', data=data, ax = axs[1,2])
plt7 = sns.boxplot(x='cluster_labels', y = 'gdpp', data=data, ax = axs[2,0])
plt8 = sns.boxplot(x='cluster_labels', y = 'imports', data=data, ax = axs[2,1])
plt9 = sns.boxplot(x='cluster_labels', y = 'exports', data=data, ax = axs[2,2])


plt.tight_layout()

**Observations**
- Here cluster 2 forms the better placed countries in terms of socio-economic status and Cluster 0 is a group of countries which would need help with lowest gdpp, income and highest child mortality rates.

**Below are the countries which need help from HELP Foundation and the CEO should focus upon**

In [None]:
df_country['cluster_labels'] = cluster_labels
cluster_0_hc = df_country[['cluster_labels','country', 'child_mort', 'gdpp', 'income' ]].loc[df_country['cluster_labels'] == 0].reset_index()
cluster_0_hc.shape

In [None]:
cluster_0_hc.sort_values(by = ['child_mort', 'gdpp', 'income'], ascending = [False, True, True]).head(5)

### We can see that both K means and Hierarchical clustering with K=3 gave us similar results. We are able to group the countries which we can provide to HELP international CEO to focus on the most.

- There are 43-48 backward countries which HELP international CEO should target to focus on.

**Top 5 of those are:
1. Haiti
2. Sierra Leone
3. Chad
4. Central African Republic
5. Mali**

among other African countries.

Thank you for going through this notebook. :)