<h2 style='text-align:center;font-size:40px;background-color:#5F4B8B;border:10px;color:white'>Clustering, Outlier Analysis and EDA<h2>
<h1 style='text-align:center;font-size:20px;background-color:#5F4B8B;border:10px;color:white'>This notebook is my attempt to perform Clustering and EDA on dataset provided by HELP International to assist them in making an informed choice regarding their budget expenditure. The core aim of this notebook will be to come up with names of countries which deserve to heleped the most.<h1>

<h2 style='text-align:center;font-size:30px;background-color:#FE7176;border:20px;color:white'>OBJECTIVES<h2>
    
#### The main task is to cluster the countries on the basis of socio-economic factors and provide the NGO with a list of countries in need of help.
- Data inspection and EDA tasks suitable for this dataset - data cleaning, univariate analysis, bivariate analysis etc.
- Outlier Analysis: Performing the Outlier Analysis on the dataset.
- Create model using both K-means and Hierarchical clustering(both single and complete linkage) on this dataset to create the clusters.
- Analyse the clusters and identify the ones which are in dire need of aid. 
- Perform visualisations on the clusters that have been formed using the features selected for building the clustering model.

<h2 style='text-align:center;font-size:30px;background-color:#FE7176;border:20px;color:white'>DATA DICTIONARY<h2>
    
#### The country.csv contains the economic and health data of all countries 
    
- **country** : Name of the country
- **child_mort** : Death of children under 5 years of age per 1000 live births
- **exports** : Exports of goods and services per capita. Given as %age of the GDP per capita
- **health** : Total health spending per capita. Given as %age of GDP per capita
- **imports** : Imports of goods and services per capita. Given as %age of the GDP per capita
- **income** : Net income per person
- **inflation** : The measurement of the annual growth rate of the Total GDP
- **life_expec** : The average number of years a new born child would live if the current mortality patterns are to remain the same
- **total_fer** : The number of children that would be born to each woman if the current age-fertility rates remain the same
- **gdpp** : The GDP per capita. Calculated as the Total GDP divided by the total population

<h2 style='text-align:center;font-size:30px;background-color:#63D7CF;border:20px;color:white'>IMPORTING LIBRARIES<h2>

In [None]:
# Data Analysis & Data wrangling
import numpy as np
import pandas as pd
import missingno as mn
from random import sample
from numpy.random import uniform
from math import isnan

# Static Visualization
import matplotlib.pyplot as plt
import matplotlib.style as style
import seaborn as sns
import matplotlib.cm as cm
%matplotlib inline

# Plotly Libraris
import plotly.express as px
import plotly.graph_objects as go
from plotly.tools import FigureFactory as FF
from plotly import tools
from plotly.colors import n_colors
from plotly.subplots import make_subplots
from plotly.offline import init_notebook_mode, iplot
import cufflinks
cufflinks.go_offline()
cufflinks.set_config_file(world_readable=True, theme='pearl')
from IPython.display import display, HTML
init_notebook_mode(connected=True)

# ML Libraries 
# SKLearn
import sklearn
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics import silhouette_samples, silhouette_score

# SciPy
from scipy.cluster.hierarchy import linkage
from scipy.cluster.hierarchy import dendrogram
from scipy.cluster.hierarchy import cut_tree

In [None]:
# Ignoring warnings
import warnings
warnings.filterwarnings('ignore')

# Setting up the view options
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
pd.set_option('display.expand_frame_repr', False)

<h2 style='text-align:center;font-size:30px;background-color:#63D7CF;border:20px;color:white'>DATA GATHERING AND INSPECTION<h2>

In [None]:
# Reading the dataset
country = pd.read_csv('../input/help-international/Country-data.csv')

country.head(10)

In [None]:
# Checking the shape of the dataframe
country.shape

In [None]:
# Inspecting the distribution of numerical values
country.describe()

<h2 style='text-align:center;font-size:30px;background-color:#63D7CF;border:20px;color:white'>DATA CLEANING<h2>

In [None]:
# Checking for duplicates rows of data
country_duplicate = country.copy()
country_duplicate.drop_duplicates(subset=None, inplace=True)
country_duplicate.shape

# Null Value Visualization
mn.matrix(country)

# Checking Column wise null values
country.isnull().sum()

# Checking row wise null values
(country.isnull().sum(axis=1) * 100 / len(country)).value_counts(ascending=False)

<div class="alert alert-block alert-success">
    <span style='font-family:Georgia'>
        <b>Result: </b><br>
        - There are no missing or null values in the dataframe.<br>
        - There are no duplicate values in the dataset.
    </span>    
</div>

<h2 style='text-align:center;font-size:30px;background-color:#63D7CF;border:20px;color:white'>UNIVARIATE ANALYSIS<h2>

In [None]:
# Function for visualizing the distribution of numerical columns
def uni(df,col,v,hue =None):

    sns.set(style="darkgrid")
    
    if v == 0:
        fig, ax=plt.subplots(nrows =1,ncols=3,figsize=(20,8))
        ax[0].set_title("Distribution Plot")
        sns.distplot(df[col],ax=ax[0], color="#4FAAA7")
        plt.yscale('log')
        ax[1].set_title("Violin Plot")
        sns.violinplot(data =df, x=col,ax=ax[1], inner="quartile", color="#9DE4AC")
        plt.yscale('log')
        ax[2].set_title("Box Plot")
        sns.boxplot(data =df, x=col,ax=ax[2],orient='v', color="#CBFC53")
        plt.yscale('log')
    
    if v == 1:
        temp = pd.Series(data = hue)
        fig, ax = plt.subplots()
        width = len(df[col].unique()) + 6 + 4*len(temp.unique())
        fig.set_size_inches(width , 7)
        ax = sns.countplot(data = df, x= col, color="#4CB391", order=df[col].value_counts().index,hue = hue) 
        if len(temp.unique()) > 0:
            for p in ax.patches:
                ax.annotate('{:1.1f}%'.format((p.get_height()*100)/float(len(loan))), (p.get_x()+0.05, p.get_height()+20))  
        else:
            for p in ax.patches:
                ax.annotate(p.get_height(), (p.get_x()+0.32, p.get_height()+20)) 
        del temp
    else:
        exit
        
    plt.show()
    

# distribution of 'child-mort' column
uni(df=country,col='child_mort',v=0)

<div class="alert alert-block alert-info">
    <span style='font-family:Georgia'>
        <b>Insight: </b><br>
        Child Mortality values have some outliers.
    </span>    
</div>

In [None]:
# distribution of 'health' column

uni(df=country,col='health',v=0)

<div class="alert alert-block alert-info">
    <span style='font-family:Georgia'>
        <b>Insight: </b><br>
        Health values have only a outliers.
    </span>    
</div>

In [None]:
# distribution of 'income' column

uni(df=country,col='income',v=0)

<div class="alert alert-block alert-info">
    <span style='font-family:Georgia'>
        <b>Insight: </b><br>
        Income values have several outliers.
    </span>    
</div>

In [None]:
# distribution of 'inflation' column

uni(df=country,col='inflation',v=0)

<div class="alert alert-block alert-info">
    <span style='font-family:Georgia'>
        <b>Insight: </b><br>
        Infation values have some outliers.
    </span>    
</div>

In [None]:
# distribution of 'life_expec' column
uni(df=country,col='life_expec',v=0)

<div class="alert alert-block alert-info">
    <span style='font-family:Georgia'>
        <b>Insight: </b><br>
        Life expectancy values have only outliers but below the first quartile.
    </span>    
</div>

In [None]:
# distribution of 'total_fert' column

uni(df=country,col='total_fer',v=0)

<div class="alert alert-block alert-info">
    <span style='font-family:Georgia'>
        <b>Insight: </b><br>
        Total fertility values have only one outlier.
    </span>    
</div>

In [None]:
# distribution of 'gdpp' column

uni(df=country,col='gdpp',v=0)

<div class="alert alert-block alert-info">
    <span style='font-family:Georgia'>
        <b>Insight: </b><br>
        GDP values have several outliers.
    </span>    
</div>

<div class="alert alert-block alert-info">
    <span style='font-family:Georgia'>
        <h1>
            Except for Total Fertility Value and Life Expetancy, all other parameters have some outliers.</h1>
    </span>    
</div>

<h2 style='text-align:center;font-size:30px;background-color:#63D7CF;border:20px;color:white'>BIVARIATE AND GEOGRAPHICAL ANALYSIS<h2>

In [None]:
# CHILD MORTALITY
temp = country[['country','child_mort']]
# 20 largest child_mort values   
temps = temp.nlargest(20, columns=['child_mort'])
temps.reset_index(drop=True, inplace=True)

colors = ['#FE7176'] * 20
colors[0] = 'crimson'
colors[1] = 'crimson'
colors[2] = 'crimson'
colors[10] = '#FF956A'
colors[11] = '#FF956A'
colors[12] = '#FF956A'
colors[13] = '#FF956A'
colors[13] = '#FF956A'
colors[14] = '#FF956A'
colors[15] = '#FF956A'
colors[16] = '#FF956A'
colors[17] = '#FF956A'
colors[18] = '#FF956A'
colors[19] = '#84D2C3'

fig = go.Figure(data=[go.Bar(
    x=temps['country'].values,
    y=temps['child_mort'].values,
    marker_color=colors 
)])
fig.update_layout(title_text='20 Countries with the HIGHEST Child Mortality rate')
fig.show()

In [None]:
# Plotting World Map
df_fed = country.groupby('country')['child_mort'].sum().reset_index()

fig = px.choropleth(df_fed, locations="country",
                    color="child_mort",
                    locationmode = 'country names',
                    hover_name="country", 
                    color_continuous_scale="Reds",
                    title = 'Country wise Child Mortality Rate')
fig.show()

In [None]:
# GDP
temp = country[['country','gdpp']]
    
# smallest 20 values
temps = temp.nsmallest(20, columns=['gdpp'])
temps.reset_index(drop=True, inplace=True)

colors = ['#FE7176'] * 20
colors[0] = 'crimson'
colors[1] = 'crimson'
colors[2] = 'crimson'
colors[10] = '#FF956A'
colors[11] = '#FF956A'
colors[12] = '#FF956A'
colors[13] = '#FF956A'
colors[13] = '#FF956A'
colors[14] = '#FF956A'
colors[15] = '#FF956A'
colors[16] = '#FF956A'
colors[17] = '#FF956A'
colors[18] = '#FF956A'
colors[19] = '#84D2C3'

fig = go.Figure(data=[go.Bar(
    x=temps['country'].values,
    y=temps['gdpp'].values,
    marker_color=colors
)])
fig.update_layout(title_text='20 Countries with the LOWEST GDP')
fig.show()

In [None]:
#exports
temp = country[['country','exports']]
    

temps = temp.nsmallest(20, columns=['exports'])
temps.reset_index(drop=True, inplace=True)

colors = ['#FE7176'] * 20
colors[0] = 'crimson'
colors[1] = 'crimson'
colors[2] = 'crimson'
colors[10] = '#FF956A'
colors[11] = '#FF956A'
colors[12] = '#FF956A'
colors[13] = '#FF956A'
colors[13] = '#FF956A'
colors[14] = '#FF956A'
colors[15] = '#FF956A'
colors[16] = '#FF956A'
colors[17] = '#FF956A'
colors[18] = '#FF956A'
colors[19] = '#84D2C3'

fig = go.Figure(data=[go.Bar(
    x=temps['country'].values,
    y=temps['exports'].values,
    marker_color=colors 
)])
fig.update_layout(title_text='20 Countries with the LOWEST export')
fig.show()

In [None]:
#health
temp = country[['country','health']]
    

temps = temp.nsmallest(20, columns=['health'])
temps.reset_index(drop=True, inplace=True)

colors = ['#FE7176'] * 20
colors[0] = 'crimson'
colors[1] = 'crimson'
colors[2] = 'crimson'
colors[10] = '#FF956A'
colors[11] = '#FF956A'
colors[12] = '#FF956A'
colors[13] = '#FF956A'
colors[13] = '#FF956A'
colors[14] = '#FF956A'
colors[15] = '#FF956A'
colors[16] = '#FF956A'
colors[17] = '#FF956A'
colors[18] = '#FF956A'
colors[19] = '#84D2C3'

fig = go.Figure(data=[go.Bar(
    x=temps['country'].values,
    y=temps['health'].values,
    marker_color=colors
)])
fig.update_layout(title_text='20 Countries which spends the LEAST amount on Healthcare')
fig.show()

In [None]:
# Imports
temp = country[['country','imports']]
    

temps = temp.nsmallest(20, columns=['imports'])
temps.reset_index(drop=True, inplace=True)

colors = ['#FE7176'] * 20
colors[0] = 'crimson'
colors[1] = 'crimson'
colors[2] = 'crimson'
colors[10] = '#FF956A'
colors[11] = '#FF956A'
colors[12] = '#FF956A'
colors[13] = '#FF956A'
colors[13] = '#FF956A'
colors[14] = '#FF956A'
colors[15] = '#FF956A'
colors[16] = '#FF956A'
colors[17] = '#FF956A'
colors[18] = '#FF956A'
colors[19] = '#84D2C3'

fig = go.Figure(data=[go.Bar(
    x=temps['country'].values,
    y=temps['imports'].values,
    marker_color=colors 
)])
fig.update_layout(title_text='20 Countries with the LOWEST import')
fig.show()

In [None]:
temp = country[['country','income']]
    

temps = temp.nsmallest(20, columns=['income'])
temps.reset_index(drop=True, inplace=True)

colors = ['#FE7176'] * 20
colors[0] = 'crimson'
colors[1] = 'crimson'
colors[2] = 'crimson'
colors[10] = '#FF956A'
colors[11] = '#FF956A'
colors[12] = '#FF956A'
colors[13] = '#FF956A'
colors[13] = '#FF956A'
colors[14] = '#FF956A'
colors[15] = '#FF956A'
colors[16] = '#FF956A'
colors[17] = '#FF956A'
colors[18] = '#FF956A'
colors[19] = '#84D2C3'

fig = go.Figure(data=[go.Bar(
    x=temps['country'].values,
    y=temps['income'].values,
    marker_color=colors # marker color can be a single color value or an iterable
)])
fig.update_layout(title_text='20 Countries with the LOWEST income')
fig.show()

In [None]:
# inflation
temp = country[['country','inflation']]
    

temps = temp.nlargest(20, columns=['inflation'])
temps.reset_index(drop=True, inplace=True)

colors = ['#FE7176'] * 20
colors[0] = 'crimson'
colors[1] = 'crimson'
colors[2] = 'crimson'
colors[10] = '#FF956A'
colors[11] = '#FF956A'
colors[12] = '#FF956A'
colors[13] = '#FF956A'
colors[13] = '#FF956A'
colors[14] = '#FF956A'
colors[15] = '#FF956A'
colors[16] = '#FF956A'
colors[17] = '#FF956A'
colors[18] = '#FF956A'
colors[19] = '#84D2C3'

fig = go.Figure(data=[go.Bar(
    x=temps['country'].values,
    y=temps['inflation'].values,
    marker_color=colors 
)])
fig.update_layout(title_text='20 Countries with the HIGHEST inflation')
fig.show()

In [None]:
temp = country[['country','life_expec']]
    

temps = temp.nsmallest(20, columns=['life_expec'])
temps.reset_index(drop=True, inplace=True)

colors = ['#FE7176'] * 20
colors[0] = 'crimson'
colors[1] = 'crimson'
colors[2] = 'crimson'
colors[10] = '#FF956A'
colors[11] = '#FF956A'
colors[12] = '#FF956A'
colors[13] = '#FF956A'
colors[13] = '#FF956A'
colors[14] = '#FF956A'
colors[15] = '#FF956A'
colors[16] = '#FF956A'
colors[17] = '#FF956A'
colors[18] = '#FF956A'
colors[19] = '#84D2C3'

fig = go.Figure(data=[go.Bar(
    x=temps['country'].values,
    y=temps['life_expec'].values,
    marker_color=colors # marker color can be a single color value or an iterable
)])
fig.update_layout(title_text='20 Countries with the LOWEST Life Expectancy')
fig.show()

In [None]:
temp = country[['country','total_fer']]
    

temps = temp.nlargest(20, columns=['total_fer'])
temps.reset_index(drop=True, inplace=True)

colors = ['#FE7176'] * 20
colors[0] = 'crimson'
colors[1] = 'crimson'
colors[2] = 'crimson'
colors[10] = '#FF956A'
colors[11] = '#FF956A'
colors[12] = '#FF956A'
colors[13] = '#FF956A'
colors[13] = '#FF956A'
colors[14] = '#FF956A'
colors[15] = '#FF956A'
colors[16] = '#FF956A'
colors[17] = '#FF956A'
colors[18] = '#FF956A'
colors[19] = '#84D2C3'

fig = go.Figure(data=[go.Bar(
    x=temps['country'].values,
    y=temps['total_fer'].values,
    marker_color=colors 
)])
fig.update_layout(title_text='20 Countries with the HIGHEST fertility')
fig.show()

<div class="alert alert-block alert-info">
    <span style='font-family:Georgia'>
        <h1><b>Insights: </b><br>
            From the world heatmap we can see that Central and West African countries are extremes in many aspects.</h1>
    </span>    
</div>

<h2 style='text-align:center;font-size:30px;background-color:#63D7CF;border:20px;color:white'>MULTIVARIATE ANALYSIS<h2>

In [None]:
sns.set_context("notebook", font_scale=1, rc={"lines.linewidth": 2.5})

In [None]:
# plotting a pair plot
fig = FF.create_scatterplotmatrix(country.iloc[:,1:10], diag='box', size=2, height=1100, width=1100)
iplot(fig)

In [None]:
# Plotting a correlation matrix
plt.figure(figsize = (17, 13))
sns.heatmap(country.corr(), annot = True, cmap="Wistia")
plt.savefig('Correlation')
plt.show()

<div class="alert alert-block alert-info">
    <span style='font-family:Georgia'>
        <b>Insights: </b><br>
        <ul>
            <li> GDP and Income has high positive correlation (0.9). This means countries where people have high income has high GDP. </li>
            <li> Life Expectency and Child Mortality have high negative correlation (-0.89)</li>
            <li> Total Fertility and Child Mortality has high correlation. </li>
            <li> Number of children per woman/ fertility is also negatively correlated with life-expectency. </li>
            <li> Exports and Imports have high correlation. </li>
            <li> Income has postivite correlation (0.61) with life expectency and negative correlation (-0.52) with child mortality. This signifies that countries with higher income values can spend more in healhcare which reduced the child mortality and increases average life espectency. </li>
            <li> GDP is negatively correlated with fertility (-0.45) signifying that developed countries prefer less children in a family compared to underdeveloped countries. </li>            
        </ul>
    </span>    
</div>

<h2 style='text-align:center;font-size:30px;background-color:#63D7CF;border:20px;color:white'>OUTLIER TREATMENT<h2>

In [None]:
# Checking for outliers
features = country.columns[1:]
fig = make_subplots(rows=3, cols=3)
count = 0

for i in range(1,4):
    for j in range (1,4):
        col = features[count]
        count = count+1
        fig.add_trace(
            go.Violin(y=country[col],
                      box_visible=True, 
                      line_color='black',
                       meanline_visible=True,
                      fillcolor='#3AD44D', 
                      opacity=0.6,
                      x0=col
                     ),row=i, col=j)
fig.update_layout(height=800, width=800, title_text="Distribution of Numerical Columns")
fig.update_traces(showlegend=False)

fig.show()        

<div class="alert alert-block alert-danger">
    <span style='font-family:Georgia'>
        <b>Warning: </b><br>
        Removal of outliers will change the ranking of few countries with respect to requirement of Financial Aid. If treat the outliers by Deletion based on IQR values, this will remove few countries from the list that would have really deserved the Financial Aid. If we do not treat the Outliers then it can impact the clustering model, as the presence of Outlier can change the CENTROID (K-Means) of the cluster. <br><br>
    </span>    
</div>

<div class="alert alert-block alert-info">
    <span style='font-family:Georgia'>
        <b>Approach: </b><br>
        Instead of deleting the outliers, we will approach to cap only the necessary features ( upper or lower based on feature importance). The capping will be done based on 1% for lower and 99% for the upper values. <br><br>
    </span>    
</div>

In [None]:
#Capping Values
ugdpp = country['gdpp'].quantile(0.99)
uincome = country['income'].quantile(0.99)
uhealth = country['health'].quantile(0.99)
uimport = country['imports'].quantile(0.99)
uexport = country['exports'].quantile(0.99)
uinflation = country['inflation'].quantile(0.99)

print('Total number of rows getting capped for dgpp : ', len(country[country['gdpp']>ugdpp]))
print('Total number of rows getting capped for income : ', len(country[country['income']>uincome]))
print('Total number of rows getting capped for health : ', len(country[country['health']>uhealth]))
print('Total number of rows getting capped for imports : ', len(country[country['imports']>uimport]))
print('Total number of rows getting capped for exports : ', len(country[country['exports']>uexport]))
print('Total number of rows getting capped for inflation : ', len(country[country['inflation']>uinflation]))


# capping the gdpp and income values
country['gdpp'][country['gdpp']>ugdpp] = ugdpp
country['income'][country['income']>uincome] = uincome
country['health'][country['health']>uhealth] = uhealth
country['imports'][country['imports']>uimport] = uimport
country['exports'][country['exports']>uexport] = uexport
country['inflation'][country['inflation']>uinflation]= uinflation

In [None]:
# Checking for outliers after capping
features = country.columns[1:]
fig = make_subplots(rows=3, cols=3)
count = 0

for i in range(1,4):
    for j in range (1,4):
        col = features[count]
        count = count+1
        fig.add_trace(
            go.Violin(y=country[col],
                      box_visible=True, 
                      line_color='black',
                       meanline_visible=True,
                      fillcolor='#FEFD00', 
                      opacity=0.6,
                      x0=col
                     ),row=i, col=j)
fig.update_layout(height=800, width=800, title_text="Distribution of Numerical Columns")
fig.update_traces(showlegend=False)

fig.show()

In [None]:
# Checking the distribution after capping
plt.figure(figsize=(15, 15))
features = country.columns[1:]
for i in enumerate(features):
    ax = plt.subplot(3, 3, i[0]+1)
    sns.distplot(country[i[1]],color = '#3AD44D')
    plt.xticks(rotation=20)

<div class="alert alert-block alert-success">
    <span style='font-family:Georgia'>
        <b>Result: </b><br>
        We still have some outliers but this will not impact our model.<br><br>
    </span>    
</div>

<h2 style='text-align:center;font-size:30px;background-color:#63D7CF;border:20px;color:white'>SCALING<h2>

In [None]:
# creating a dataframe with only numerical columns
country_num = country.drop(['country'], axis=1)
features = country_num.columns
country_num.head(2)

In [None]:
scaler = StandardScaler()

# fit_transform
country_scaled = scaler.fit_transform(country_num)
country_scaled

In [None]:
#Checking the scaled data
country_scaled = pd.DataFrame(country_scaled)
country_scaled.columns = features
country_scaled.head()

<h2 style='text-align:center;font-size:30px;background-color:#63D7CF;border:20px;color:white'>HOPKINS STATISTICS<h2>

  ##### To check cluster tendency, we use the Hopkins test.
The Hopkins statistic (introduced by Brian Hopkins and John Gordon Skellam) is a way of measuring the cluster tendency of a data set.It acts as a statistical hypothesis test where the null hypothesis is that the data is generated by a Poisson point process and are thus uniformly randomly distributed. A value close to 1 tends to indicate the data is highly clustered, random data will tend to result in values around 0.5, and uniformly distributed data will tend to result in values close to 0.

- If the value is between {0.01, ...,0.3}, the data is regularly spaced.

- If the value is around 0.5, it is random.

- If the value is between {0.7, ..., 0.99}, it has a high tendency to cluster.

In [None]:
def hopkins(X):
    d = X.shape[1]
    #d = len(vars) # columns
    n = len(X) # rows
    m = int(0.1 * n) 
    nbrs = NearestNeighbors(n_neighbors=1).fit(X.values)
 
    rand_X = sample(range(0, n, 1), m)
 
    ujd = []
    wjd = []
    for j in range(0, m):
        u_dist, _ = nbrs.kneighbors(uniform(np.amin(X,axis=0),np.amax(X,axis=0),d).reshape(1, -1), 2, return_distance=True)
        ujd.append(u_dist[0][1])
        w_dist, _ = nbrs.kneighbors(X.iloc[rand_X[j]].values.reshape(1, -1), 2, return_distance=True)
        wjd.append(w_dist[0][1])
 
    H = sum(ujd) / (sum(ujd) + sum(wjd))
    if isnan(H):
        print(ujd, wjd)
        H = 0
 
    return H

In [None]:
# Hopkins Score for Scaled Features
hopkins(country_scaled)

In [None]:
# Hopkins Score for Unscaled Features
hopkins(country_num)

<div class="alert alert-block alert-success">
    <span style='font-family:Georgia'>
        <b>Result: </b><br>
        Our Hopkins Score is above 0.7 and closer to 1 which means our data has a high tendency to cluster.<br><br>
    </span>    
</div>

<h2 style='text-align:center;font-size:40px;background-color:crimson;border:20px;color:white'>K-MEANS CLUSTERING<h2>
    
   ##### K-means clustering is one of the simplest and popular unsupervised machine learning algorithms.

The algorithm works as follows:

First we initialize k points, called means, randomly. We categorize each item to its closest mean and we update the meanâ€™s coordinates, which are the averages of the items categorized in that mean so far. We repeat the process for a given number of iterations and at the end, we have our clusters.

<h2 style='text-align:center;font-size:30px;background-color:#63D7CF;border:20px;color:white'>SSD (ELBOW CURVE)<h2>
 
##### For the Optimal Number of Clusters we use Elbow Curve.
    
A fundamental step for any unsupervised algorithm is to determine the optimal number of clusters into which the data may be clustered. The Elbow Method is one of the most popular methods to determine this optimal value of k.

In [None]:
# elbow-curve/SSD
ssd = []
range_n_clusters = [2, 3, 4, 5, 6, 7, 8]
for num_clusters in range_n_clusters:
    kmeans = KMeans(n_clusters=num_clusters, max_iter=50)
    kmeans.fit(country_scaled)
    
    ssd.append(kmeans.inertia_)
    
# plot the SSDs for each n_clusters
ssddf = pd.DataFrame(ssd)
ssddf.columns = ['SSD']
fig = go.Figure(data=go.Scatter(x=range_n_clusters, y=ssddf['SSD']))
fig.update_layout(height=500, width=800, title_text="SSD/Elbow Curve", shapes=[
    dict(
      type= 'line',
      yref= 'paper', y0= 0, y1= 1,
      xref= 'x', x0= 4, x1= 4
    )
] )
fig.show()

<div class="alert alert-block alert-success">
    <span style='font-family:Georgia'>
        <b>Result: </b><br>
        Looking at the above elbow curve it looks good to proceed with 4 or 5 clusters.<br><br>
    </span>    
</div>

<h2 style='text-align:center;font-size:30px;background-color:#63D7CF;border:20px;color:white'>SILHOUETTE ANSLYSIS<h2>

### Equation:
<span style="font-size:6mm" >
    <span style ='font-family:Georgia'>
        <font color = blue > Silhouette score =
            <span style="display: inline-block;vertical-align: middle;">
                <div style="text-align: center;border-bottom: 1px solid black;">p-q</div>
                <div style="text-align: center;">max(p,q)</div> 
            </span>
        </font>
    </span>
</span>
   
    
    
    
<p>
    <span style='font-family:Georgia'>
        <b>p</b>  is the mean distance to the points in the nearest cluster that the data point is not a part of <br>
        <b>q</b>  is the mean intra-cluster distance to all the points in its own cluster.
    </span>
</p>

<span style='font-family:Georgia'>
    <ul>
        <li>The value of the silhouette score range lies between -1 to 1.</li>
        <li>A score closer to 1 indicates that the data point is very similar to other data points in the cluster</li>
        <li>A score closer to -1 indicates that the data point is not similar to the data points in its cluster</li>
    </ul>
</span>
    


In [None]:
sse_ = []
for k in range(2, 15):
    kmeans = KMeans(n_clusters=k).fit(country_scaled)
    sse_.append([k, silhouette_score(country_scaled, kmeans.labels_)])



ssedf = pd.DataFrame(sse_)
fig = go.Figure(data=go.Scatter(x=pd.DataFrame(sse_)[0], y=pd.DataFrame(sse_)[1]))
fig.update_layout(height=500, width=800, title_text="Silhouette Analysis", shapes=[
    dict(
      type= 'line',
      yref= 'paper', y0= 0, y1= 1,
      xref= 'x', x0= 4, x1= 4
    )
])
fig.show()

<div class="alert alert-block alert-success">
    <span style='font-family:Georgia'>
        <b>Result: </b><br>
         From the Silhouette analysis we can see that 4 clusters are optimal for our model. 
    </span>    
</div>

<h2 style='text-align:center;font-size:30px;background-color:#63D7CF;border:20px;color:white'>MODEL BUILDING<h2>

In [None]:
kmeans_4 = KMeans(n_clusters=4, max_iter=500, init='k-means++', n_init=10, random_state= 350)
kmeans_4.fit(country_scaled)
kmeans_4.labels_

In [None]:
country['KCluster_4_Label'] = kmeans_4.labels_
country.head()

In [None]:
country['KCluster_4_Label'].value_counts()

<h2 style='text-align:center;font-size:30px;background-color:#63D7CF;border:20px;color:white'>CLUSTER ANALYSIS<h2>

In [None]:
im = country['KCluster_4_Label'].value_counts()
df = pd.DataFrame({'labels': ['0', '1', '3', '2'],'values': im.values})
df.iplot(kind='pie',labels='labels',values='values', title='K-Means Clustering - Cluster Size Comparison', hole = 0.5, colors=['#63D7CF','#FD7B80', '#FCBF8A', '#F7EDCD'])

In [None]:
df_fed = country.groupby('country')['KCluster_4_Label'].sum().reset_index()

fig = px.choropleth(df_fed, locations=country['country'],
                    color='KCluster_4_Label',
                    locationmode = 'country names',
                    hover_name=country['country'], 
                    color_continuous_scale="picnic_r",
                    title = 'Cluster of Countries')
fig.show()

<div class="alert alert-block alert-info">
    <span style='font-family:Georgia'>
        <b>Insight: </b><br>
        We can see that majority of African countries are in Cluster-1.
    </span>    
</div>

In [None]:
#Checking the spread and density of clusters
plt.figure(figsize=(18, 8))
plt.subplot(1, 3, 1)
sns.swarmplot(x='KCluster_4_Label', y='child_mort', data=country, palette="cool")
plt.subplot(1, 3, 2)
sns.swarmplot(x='KCluster_4_Label', y='gdpp', data=country, palette="cool")
plt.subplot(1, 3, 3)
sns.swarmplot(x='KCluster_4_Label', y='income', data=country, palette="cool")

plt.show()

<div class="alert alert-block alert-info">
    <span style='font-family:Georgia'>
        <b>Insight: </b><br>
    We can see that Cluster 1 has the highest child mortality rate.
    </span>    
</div>

In [None]:
df_fed = country.groupby('country')['child_mort', 'income', 'gdpp', 'KCluster_4_Label'].sum().reset_index()
fig = px.scatter(df_fed, x="gdpp", y="child_mort", color='KCluster_4_Label', hover_data=['country'], color_continuous_scale='sunsetdark', title="GDP vs Child Mortality of Clusters")
fig.show()

In [None]:
df_fed = country.groupby('country')['child_mort', 'income', 'gdpp', 'KCluster_4_Label'].sum().reset_index()
fig = px.scatter(df_fed, x="income", y="child_mort", color='KCluster_4_Label', hover_data=['country'],title="Income vs Child Mortality of Clusters", color_continuous_scale='sunsetdark')
fig.show()

In [None]:
df_fed = country.groupby('country')['child_mort', 'income', 'gdpp', 'KCluster_4_Label'].sum().reset_index()
fig = px.scatter(df_fed, x="gdpp", y="income", color='KCluster_4_Label', size='child_mort', hover_data=['country'],title="GDP vs Income of Clusters sized on Child Mortality Rate", color_continuous_scale='sunsetdark')
fig.show()

<div class="alert alert-block alert-info">
    <span style='font-family:Georgia'>
        <b>Insight: </b><br>
    From the scatter plot we now know that we have to focus on cluster 1.
    </span>    
</div>

In [None]:
df_fed = country.groupby('country')['child_mort', 'income', 'gdpp', 'KCluster_4_Label'].sum().reset_index()
fig = px.scatter_3d(df_fed, x="gdpp", y="income", color='KCluster_4_Label', z='child_mort', hover_data=['country'],title="GDP vs Income of Clusters sized on Child Mortality Rate", color_continuous_scale='rainbow', opacity=0.6)
fig.update_layout(margin=dict(l=0, r=0, b=0, t=0))
fig.show()

<div class="alert alert-block alert-info">
    <span style='font-family:Georgia'>
        <b>Insight: </b><br>
    We can see from the 3D Scatter plot that cluster 1 has the highest child mortality rate as well as lowest income and GDP as compared to other clusters
    </span>    
</div>

<h2 style='text-align:center;font-size:30px;background-color:#63D7CF;border:20px;color:white'>MODEL INTERPRETATION<h2>

In [None]:
# Checking the countries under Cluster 1
country[country['KCluster_4_Label']==1]

In [None]:
# Group by analysis
k_analysis =  country.groupby(['KCluster_4_Label']).mean()
k_analysis

In [None]:
# count of countries in each cluster
k_analysis['Count']=country.groupby('KCluster_4_Label')['country'].count()
k_analysis

In [None]:
# Proportion out of 1
k_analysis['Proportion']=round(k_analysis['Count']/k_analysis['Count'].sum(),2)
k_analysis

In [None]:
features =k_analysis.columns[:-2]
fig = make_subplots(rows=3, cols=3)
count = 0

for i in range(1,4):
    for j in range (1,4):
        col = features[count]
        count = count+1
        fig.add_trace(
            go.Bar(x=k_analysis.index,
                   y=k_analysis[col],
                   marker={'color': k_analysis.index,
                   'colorscale': 'oryel'},
                   text=round(k_analysis[col],2),
                   textposition = "outside",
                   ),row=i, col=j)
        fig.update_yaxes(title_text=col, row=i, col=j)

fig.update_layout(height=1500, width=1200, title_text="K-Means Cluster - Feature Mean Values")
fig.update_traces(showlegend=False)

fig.show()

<div class="alert alert-block alert-info">
    <span style='font-family:Georgia'>
        <b>Insight: </b><br>
        Based on the graphs above we should consider cluster 1 countries for NGO aid, because : 
        <ul>
            <li>It has highest child mortality</li>
            <li>Lowest income</li>
            <li>Lowest GDP</li>
            <li>Lowest health expenditure</li>
            <li>The highest inflation</li>
            <li>Compartively low life expectancy</li>
            <li>Highest total fertility</li>
        </ul>
    </span>    
</div>

In [None]:
cluster_km=country[country['KCluster_4_Label']==1]
cluster_km.sort_values(['gdpp','income','child_mort','health','inflation','life_expec','total_fer','imports','exports'], 
                      ascending=[True,True,False,True,False,True,False,False,True]).head(5)

<div class="alert alert-block alert-success">
    <span style='font-family:Georgia'>
        <b>Outcome: </b><br>
        Based on above data, the following 5 countries requires NGO Aid based on K-Means clustering : 
        <ul>
            <li>Burundi</li>
            <li>Liberia</li>
            <li>Congo</li>
            <li>Niger</li>
            <li>Sierra Leone</li>
        </ul>
    </span>    
</div>

<h2 style='text-align:center;font-size:30px;background-color:#63D7CF;border:20px;color:white'>FINAL ANALYSIS<h2>

In [None]:
aid = cluster_km.sort_values(['gdpp','income','child_mort','health','inflation','life_expec','total_fer','imports','exports'], 
                      ascending=[True,True,False,True,False,True,False,False,True]).head(5)

In [None]:
aid

In [None]:
features = aid.columns[1:]
fig = make_subplots(rows=3, cols=3)
count = 0

for i in range(1,4):
    for j in range (1,4):
        col = features[count]
        count = count+1
        fig.add_trace(
            go.Violin(y=aid[col],
                      box_visible=True, 
                      line_color='black',
                       meanline_visible=True,
                      fillcolor='#EA4335', 
                      opacity=0.6,
                      x0=col
                     ),row=i, col=j)
        #fig.update_xaxes(title_text=col, row=i, col=j)
fig.update_layout(height=1000, width=1000, title_text="Data Distribution")
fig.update_traces(showlegend=False)

fig.show()

In [None]:
plt.figure(figsize=(18, 8))
plt.subplot(1, 3, 1)
sns.scatterplot(x='gdpp', y='child_mort', hue='country',
                data=aid, legend='full', palette="prism", s=300)
plt.subplot(1, 3, 2)
sns.scatterplot(x='gdpp', y='income', hue='country',
                data=aid, legend='full', palette="prism", s=300)
plt.subplot(1, 3, 3)
sns.scatterplot(x='income', y='child_mort', hue='country',
                data=aid, legend='full', palette="prism", s=300)
plt.show()

In [None]:
df_fed = aid.groupby('country')['child_mort', 'income', 'gdpp'].sum().reset_index()
fig = px.scatter_3d(aid, x="gdpp", y="income", color='country', z='child_mort', hover_data=['country'],title="GDP vs Income of Clusters sized on Child Mortality Rate", color_discrete_sequence=["crimson", "#2AD7E7", "#FA6B16", "goldenrod", "#8E0067"], opacity=0.6)
fig.update_layout(margin=dict(l=0, r=0, b=0, t=0))
fig.show()

In [None]:
print ("GDP Statistics of Selected Countries : ")
print ("Max GDP : ", max(aid.gdpp))
print ("Min GDP : ", min(aid.gdpp))
print ("Avg GDP : ", aid.gdpp.mean())
print('-'*50)
print ("Income Statistics of Selected Countries : ")
print ("Max Income : ", max(aid.income))
print ("Min Income : ", min(aid.income))
print ("Avg Income : ", aid.income.mean())
print('-'*50)
print ("Child Mortality Statistics of Selected Countries : ")
print ("Max Child Mortality : ", max(aid['child_mort']))
print ("Min Child Mortality : ", min(aid['child_mort']))
print ("Avg Child Mortality : ", round(aid['child_mort'].mean(),1))

In [None]:
# Ranking of countries
aid.reset_index(drop=True, inplace=True)
aid['Rank']= aid.index+1
aid

<h2 style='text-align:center;font-size:40px;background-color:#CBFC53;border:20px;color:black'>CONCLUSION<h2>
    
<div class="alert alert-block alert-success">
    <span style='font-family:Georgia'>
        <b>Outcome: </b><br>
        The following 5 countries require NGO Aid . These fall under Underdeveloped Country list and are performing worst among the Under Developed Countries : 
        <ul>
            <li>Burundi</li>
            <li>Liberia</li>
            <li>Congo</li>
            <li>Niger</li>
            <li>Sierra Leone</li>
        </ul>
    </span>   
    <br>
    <br>
    <span style='font-family:Georgia'>
        <b>Reasons for Aid: </b><br>
        <ul>
            <li>High child mortality</li>
            <li>Low Income </li>
            <li>Low GDP</li>
            <li>Low health spent </li>
            <li>High Inflation </li>
            <li>Lower life expectency</li>
            <li>High fertility rate (i.e more number of children per family/woman)</li>
        </ul>
    </span>
    
</div>

<h2 style='text-align:center;font-size:30px;background-color:#63D7CF;border:20px;color:white'>MAP<h2>

In [None]:
df_fed = aid.groupby('country')['Rank'].sum().reset_index()

fig = px.choropleth(df_fed, locations=aid['country'],
                    color='Rank',
                    locationmode = 'country names',
                    hover_name=aid['country'], 
                    color_continuous_scale="ylorrd_r",
                    title = 'Countries that require Aid')
fig.show()

<h2 style='text-align:center;font-size:40px;background-color:#5F4B8B;border:10px;color:white'>THANK YOU! <h2>