# Using ML to Allocate Funding for Development Aid

### Find here: Unsupervised Learning with K-Means Clustering, Cluster Analysis, Regression

- <a href='#intro'>1. Introduction</a>
- <a href='#2'>2. Libraries and datasets</a>
     - <a href='#21'>2.1 Import libraries and packages</a>
     - <a href='#22'>2.2 Import data</a>
- <a href='#3'>3. Data description and distribution</a>
    - <a href='#31'>3.1. Data description</a> 
    - <a href='#32'>3.2. Data distribution</a>
- <a href='#4'>4. Data evaluation and reduction</a>
    - <a href='#41'>4.1. Correlation</a>
    - <a href='#42'>4.2. Scaling</a> 
    - <a href='#43'>4.3. PCA: Principal Component Analysis</a> 
- <a href='#5'>5. Model: K-Means Clustering</a>
    - <a href='#51'>5.1. Model set up</a>
    - <a href='#52'>5.2. Optimal number of clusters: Elbow Method</a>
    - <a href='#53'>5.3. Optimal number of clusters: Silhouette Method</a>
- <a href='#6'>6. Cluster analysis</a>
    - <a href='#61'>6.1. Cluster plotting and visualisation</a>
    - <a href='#62'>6.2. Cluster characteristics</a>
    - <a href='#63'>6.3. Cluster descriptions</a>
    - <a href='#64'>6.4. Clusters and their location in the world</a>   
- <a href='#7'>7. Further analysis to complement clustering </a>
    - <a href='#71'>7.1. Dropping features with high correlation</a>   
    - <a href='#72'>7.2. Further analysis of clusters</a>  
    - <a href='#73'>7.3. Linear and Multivariate regression</a>
        - <a href='#73a'>7.3.a. Linear regression</a>
        - <a href='#73b'>7.3.b. Multivariate regression</a>
    - <a href='#74'>7.4. Further clustering of clusters</a>   
- <a href='#8'>8. Answer to the question and learnings</a>
- <a href='#9'>9. References</a>

## <a id='intro'>1. Intoduction</a>

**About this Notebook**

This notebook is a deliverable from my experience being a part of the **BIPOC programme** here at Kaggle. I applied to the Kaggle BIPOC programme because I want to develop skills to create impactful use of data to help society and its members thrive and progress their quality of life. This particular case was very interesting to me because having worked in the NfP/education/D&I sectors before, I came across similar cases where this application of ML would have been *very* useful to support my stakeholders.

This is a comprehensive notebook in which I explore many alternatives at every step of the analysis. This notebook provides a range of approaches and my interpretation of them, as opposed to a clear-cut, *straight to the point* notebook with an efficient solution. This has helped me learn about more tools, practice coding, and develop my analytical thinking. I added notes about my **Findings** throughout the notebook and links to the sources that have helped towards its development. 

I'd also like to give a special thanks to my mentor Frank for his guidance in this programme and for his help in this and my other notebooks and all the learnings that came with them. 



**Background**

According to the International Monetary Fund (IMF), *development aid* is aid given by governments and other agencies to support the economic, environmental, social, and political development of developing countries.


**Problem Statement (taken from Dataset)**

HELP International have been able to raise around 10 million dollars. Now the CEO of the NGO needs to decide how to use this money strategically and effectively. 

So, the CEO has to make decision to choose the countries that are in the direst need of aid. 

Hence, the goal is to categorise the countries using some socio-economic and health factors that determine the overall development of the country. Then you need to suggest the countries which the CEO needs to focus on the most.

### Which countries should receive funding and why?

## <a id='2'>2. Libraries and datasets</a>

### <a id='21'>2.1. Import libraries and packages</a>

In [None]:
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# scaling 
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

# kmeans clustering 
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from yellowbrick.cluster import SilhouetteVisualizer

# multivariate linear regression
from sklearn import linear_model

# geo data
import geopandas as gpd
from geopandas import GeoDataFrame as gdf
import plotly.express as px

### <a id='22'>2.2. Import data</a>

In [None]:
data_path = '../input/unsupervised-learning-on-country-data'

In [None]:
data = pd.read_csv(
    f'{data_path}/Country-data.csv')

## <a id='3'>3. Data description and distribution</a>

### <a id='31'>3.1. Data description</a>

**Feature Description** 

* country:      Name of the country

* child_mort:   Death of children under 5 years of age per 1000 live births

* exports:      Exports of goods and services per capita. Given as %age of the GDP per capita

* health:       Total health spending per capita. Given as %age of GDP per capita

* imports:      Imports of goods and services per capita. Given as %age of the GDP per capita

* Income:       Net income per person

* Inflation:    The measurement of the annual growth rate of the Total GDP

* life_expec:   The average number of years a new born child would live if the current mortality patterns are to remain the same

* total_fer:    The number of children that would be born to each woman if the current age-fertility rates remain the same

* gdpp:         The GDP per capita. Calculated as the Total GDP divided by the total population

In [None]:
# quick view of columns and values
data.head()

In [None]:
# how many columns and rows in dataframe
data.shape

In [None]:
# are there any missing values?
data.isnull().sum()

In [None]:
# are there duplicate values?
format(len(data[data.duplicated()]))

In [None]:
# standard statistical measures
data.describe(percentiles = [.25, .5, .75, .90 ,.95, .99])

**Findings**

* small dataset
* no missing values
* no duplicate values
* some outliers and skewed distribution

### <a id='32'>3.2. Data distribution</a>

In [None]:
plt.figure(figsize=(12,5))
plt.title("Child Mortality: Death of children under 5 years of age per 1000 live births")
ax = sns.histplot(data["child_mort"])

In [None]:
plt.figure(figsize=(12,5))
plt.title("Exports: Exports of goods and services per capita. Given as %age of the GDP per capita")
ax = sns.histplot(data["exports"])

In [None]:
plt.figure(figsize=(12,5))
plt.title("Imports: Imports of goods and services per capita. Given as %age of the GDP per capita")
ax = sns.histplot(data["imports"])

In [None]:
plt.figure(figsize=(12,5))
plt.title("Health: Total health spending per capita. Given as %age of GDP per capita")
ax = sns.histplot(data["health"])

In [None]:
plt.figure(figsize=(12,5))
plt.title("Income: Net income per person")
ax = sns.histplot(data["income"])

In [None]:
plt.figure(figsize=(12,5))
plt.title("Inflation: The measurement of the annual growth rate of the Total GDP")
ax = sns.histplot(data["inflation"])

In [None]:
plt.figure(figsize=(12,5))
plt.title("Life expectancy: The average number of years a new born child would live if the current mortality patterns are to remain the same")
ax = sns.histplot(data["life_expec"])

In [None]:
plt.figure(figsize=(12,5))
plt.title("New Population(?) :The number of children that would be born to each woman if the current age-fertility rates remain the same.")
ax = sns.histplot(data["total_fer"])

In [None]:
plt.figure(figsize=(12,5))
plt.title("GDP: The GDP per capita. Calculated as the Total GDP divided by the total population.")
ax = sns.histplot(data["gdpp"])

**Findings**

Looking at the data distribution we can see that there are some features that do indeed have outliers.

For the purpose of this analysis, outliers will not be removed since they could be considered very informative in that they could point out countries that are in critical condition and in need of help.

For example, Child Mortality is a strong indicator of poverty and necessity, so the outliers in this feature show that there are countries with a higher than normal/critical number in child mortality.
 

## <a id='4'>4. Data evaluation and reduction</a>

### <a id='41'>4.1. Correlation</a>

In [None]:
# pearson
plt.figure(figsize=(15,10))
sns.heatmap(data.corr(method='pearson', min_periods=1),annot=True)

In [None]:
# kendall
plt.figure(figsize=(15,10))
sns.heatmap(data.corr(method='kendall', min_periods=1),annot=True)

In [None]:
# spearman
plt.figure(figsize=(15,10))
sns.heatmap(data.corr(method='spearman', min_periods=1),annot=True)

**Findings** 

Are there feature(s) that we could do without due to having high correlation with another feature?

After looking at Pearson, Kendall and Spearman correlation, we can see that there are a few features that might be considered for elimination due to high correlation.

- life_expect, due to high correlation with child mortality
- total_fertility, due to high correlation with child mortality
- income, due to high correlation with gdpp


### <a id='42'>4.2. Scaling</a>

Why scale the data in this case? 

* the features have incomparable units (metrics are percentages, dollar values, whole numbers)
* the range values of the features also vary (one for example is 0 to 200, and another 0 to 100,000), so here for example, a change of 50 in one feature is quite significant, whereas in another it is almost unnoticeable
* this level of variance can negatively impact the performance of this model, as this model is based on measuring distances, it can do this by giving more weight to some features 
* by scaling we are removing potential bias that the model can have towards features with higher magnitudes


In [None]:
# eliminate the column that contains the country information, as only numeric values should be used in this case for unsupervised learning
dataset = data.drop(['country'], axis =1)
dataset.head()

#### Scale the data: MinMaxScaler (normalised)

In [None]:
# columns argument ==> we'll use this later to create a new dataframe with the rescaled data 
columns = dataset.columns

# the scaler to use will be 
scaler = MinMaxScaler()

# 'scaler' is for the rescaling technique, 'fit' function is to find the x_min and the x_max, 'transform' function applies formula to all elements of data
rescaled_dataset_minmax = scaler.fit_transform(dataset)
rescaled_dataset_minmax

#### Scale the data: StandardScaler (standardised)

In [None]:
# in standardisation, all features will be transformed to have the properties of standard normal distribution with mean=0 and standard deviation=1
# 
# columns argument ==> we'll use this later to create a new dataframe with the rescaled data 
columns = dataset.columns

# the scaler to use will be 
scaler = StandardScaler()

# 'scaler' is for the rescaling technique, 'fit' function is to find the x_min and the x_max, 'transform' function applies formula to all elements of data
rescaled_dataset_standard = scaler.fit_transform(dataset)
rescaled_dataset_standard

#### Scaled dataframes

In [None]:
# minmax
# we need to create a new dataframe with the column lables and the rescaled values 
df_minmax = pd.DataFrame(data= rescaled_dataset_minmax , columns = columns )
df_minmax

In [None]:
# standardisation
# we need to create a new dataframe with the column lables and the rescaled values 
df_standard = pd.DataFrame(data= rescaled_dataset_standard , columns = columns)
df_standard

#### Comparing scaling methods

In [None]:
plt.scatter(df_standard['gdpp'], df_standard['child_mort'],color = 'black')
plt.scatter

plt.xlabel('GDP per Person')
plt.ylabel('Child Mortality')

In [None]:
plt.scatter(df_minmax['gdpp'], df_minmax['child_mort'],color = 'black')
plt.scatter

plt.xlabel('GDP per Person')
plt.ylabel('Child Mortality')

### <a id='423'>4.3. PCA: Principal Component Analysis</a>

#### PCA with data scaled with StandardScaler

In [None]:
# import PCA 
from sklearn.decomposition import PCA

# fit and transform
pca = PCA()
pca.fit(df_standard)
pca_data_standard = pca.transform(df_standard)

# percentage variation 
per_var = np.round(pca.explained_variance_ratio_*100, decimals =1)
labels = ['PC' + str(x) for x in range (1, len(per_var)+1)]

# plot the percentage of explained variance by principal component
plt.bar(x=range(1,len(per_var)+1), height=per_var, tick_label = labels)
plt.ylabel('Percentage of Explained Variance')
plt.xlabel('Principal Component')
plt.title('Scree Plot')
plt.show()

# plot pca
pca_df_standard = pd.DataFrame(pca_data_standard, columns = labels)
plt.scatter(pca_df_standard.PC1, pca_df_standard.PC2)
plt.title('PCA')
plt.xlabel('PC1 - {0}%'.format(per_var[0]))
plt.ylabel('PC2 - {0}%'.format(per_var[1]))

#### PCA with data scaled with MinMaxScaler

In [None]:
# import PCA 
from sklearn.decomposition import PCA

# fit and transform
pca = PCA()
pca.fit(df_minmax)
pca_data_minmax = pca.transform(df_minmax)

# percentage variation 
per_var = np.round(pca.explained_variance_ratio_*100, decimals =1)
labels = ['PC' + str(x) for x in range (1, len(per_var)+1)]

# plot the percentage of explained variance by principal component
plt.bar(x=range(1,len(per_var)+1), height=per_var, tick_label = labels)
plt.ylabel('Percentage of Explained Variance')
plt.xlabel('Principal Component')
plt.title('Scree Plot')
plt.show()

# plot pca

pca_df_minmax = pd.DataFrame(pca_data_minmax, columns = labels)
plt.scatter(pca_df_minmax.PC1, pca_df_minmax.PC2)
plt.title('PCA')
plt.xlabel('PC1 - {0}%'.format(per_var[0]))
plt.ylabel('PC2 - {0}%'.format(per_var[1]))

In [None]:
# dataframe with PC1, PC2, P3, PC4
data2 = pca_df_standard.drop(['PC5','PC6','PC7','PC8','PC9'], axis = 1)
data2

**Findings**

After doing PCA with both standardised and normalised versions of the original dataset, we can see that there are 4 principal components can explain about 90% of the distribution of the original data.


## <a id='5'>5. Model: K-Means Clustering</a>

### <a id='51'>5.1. Model set up</a>

In [None]:
km = KMeans (
    n_clusters = 3, # number of clusters/centroids to create
    init = 'random', # ‘random’: choose n_clusters observations (rows) at random from data for the initial centroids
    n_init = 10, # this is the default value. This is the number of times the k-means algorithm will be run with different centroid seeds
    max_iter = 300, # this is the default value. This is the maximum number of iterations of the k-means algorithm for a single run.
    tol = 1e-4, # this is the default value. This is the relative tolerance with regards to Frobenius norm of the difference in the cluster centers of two consecutive iterations to declare convergence.
    random_state = 0 # this is the default value. Determines random number generation for centroid initialization. Use an int to make the randomness deterministic.
)

#### Run model with different versions of the dataset

In [None]:
# normalised dataset
# method to compute the clusters and assign the labels
y_predicted_minmax = km.fit_predict(df_minmax) # fit_predict --> Compute cluster centers and predict cluster index for each sample.
y_predicted_minmax

In [None]:
# standardised dataset
# method to compute the clusters and assign the labels
y_predicted_standard = km.fit_predict(df_standard) # fit_predict --> Compute cluster centers and predict cluster index for each sample.
y_predicted_standard

In [None]:
# data2 is the original dataset with standard scaling and 4 principal components found with PCA
# method to compute the clusters and assign the labels
y_predicted_data2 = km.fit_predict(data2) # fit_predict --> Compute cluster centers and predict cluster index for each sample.
y_predicted_data2

In [None]:
# add the cluster column to the dataframe 
df_minmax['cluster'] = y_predicted_minmax
df_minmax.head()

In [None]:
# add the cluster column to the dataframe 
df_standard['cluster'] = y_predicted_standard
df_standard.head()

In [None]:
# add the cluster column to the dataframe (dataset does not include feature 'country')
dataset['cluster'] = y_predicted_data2
dataset.head()

### <a id='52'>5.2. Optimal number of clusters: Elbow Method</a>

In [None]:
# calculate Sum of Squared Errors (SSE), also called distorsions,  for a range of number of cluster - with df scaled with StandardScaler

sse = []
for i in range(1, 11):
    km = KMeans(
        n_clusters=i, init='random',
        n_init=10, max_iter=300,
        tol=1e-04, random_state=0
    )
    km.fit(df_standard)
    sse.append(km.inertia_)

# plot
plt.plot(range(1, 11), sse, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('SSE')
plt.show()

In [None]:
# calculate Sum of Squared Errors (SSE), also called distorsions, for a range of number of cluster - with df scaled with MinMax

sse = []
for i in range(1, 11):
    km = KMeans(
        n_clusters=i, init='random',
        n_init=10, max_iter=300,
        tol=1e-04, random_state=0
    )
    km.fit(df_minmax)
    sse.append(km.inertia_)

# plot
plt.plot(range(1, 11), sse, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('SSE')
plt.show()

In [None]:
# calculate Sum of Squared Errors (SSE), also called distorsions, for a range of number of cluster - with df scaled with StandardScaler + PCA
sse = []
for i in range(1, 11):
    km = KMeans(
        n_clusters=i, init='random',
        n_init=10, max_iter=300,
        tol=1e-04, random_state=0
    )
    km.fit(dataset)
    sse.append(km.inertia_)

# plot
plt.plot(range(1, 11), sse, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('SSE')
plt.show()

**Findings**

After running the K-Means model with the a normalised dataset, a standardised dataset, and a PCA with 4 components (with standardised scaling) we can see that the optimal number of clusters is still 3 with different levels of inertia. Two clusters could also be considered as per results of dataset after PCA.

### <a id='53'>5.3. Optimal number of clusters: Silhouette Method</a>

#### With standardised data



In [None]:
# calculate Silhoutte Score - stardardised
score = silhouette_score(df_standard, km.labels_, metric='euclidean')
print('Silhouette Score: %.3f' % score)

# A value near 0 represents overlapping clusters with samples very close to the decision boundary of the neighboring clusters. 

In [None]:
fig,ax = plt.subplots(2,2, figsize = (15,8))
for i in [2,3,4,5]:

    # create kmeans instance for different numbers of clusters
    km = KMeans(n_clusters=i, init= 'random', n_init =10, max_iter = 300, random_state = 0)
    q, mod = divmod(i,2)
    
    #create visualiser
    visualizer = SilhouetteVisualizer(km, colors = 'yellowbrick', ax=ax[q-1][mod])
    visualizer.fit(df_standard)

#### With normalised data

In [None]:
# Calculate Silhoutte Score - normalised
score = silhouette_score(df_minmax, km.labels_, metric='euclidean')
print('Silhouette Score: %.3f' % score)

# # A value near 0 represents overlapping clusters with samples very close to the decision boundary of the neighboring clusters. 

In [None]:
fig,ax = plt.subplots(2,2, figsize = (15,8))
for i in [2,3,4,5]:

    # create kmeans instance for different numbers of clusters
    km = KMeans(n_clusters=i, init= 'random', n_init =10, max_iter = 300, random_state = 0)
    q, mod = divmod(i,2)
    
    #create visualiser
    visualizer = SilhouetteVisualizer(km, colors = 'yellowbrick', ax=ax[q-1][mod])
    visualizer.fit(df_minmax)

#### With standardised data + PCA

In [None]:
# Calculate Silhoutte Score - stardardised + PCA
score = silhouette_score(dataset, km.labels_, metric='euclidean')
print('Silhouette Score: %.3f' % score)

# A value near 0 represents overlapping clusters with samples very close to the decision boundary of the neighboring clusters. 

In [None]:
fig,ax = plt.subplots(2,2, figsize = (15,8))
for i in [2,3,4,5]:

    # create kmeans instance for different numbers of clusters
    km = KMeans(n_clusters=i, init= 'random', n_init =10, max_iter = 300, random_state = 0)
    q, mod = divmod(i,2)
    
    #create visualiser
    visualizer = SilhouetteVisualizer(km, colors = 'yellowbrick', ax=ax[q-1][mod])
    visualizer.fit(dataset)

**Findings**

Silhouette Scores are very close to 0 indicating that clusters are overlapping. An increase in clusters (to 5 for example) shows that there are negative values in the scale, meaning that this n of clusters might have samples that have been assigned to the wrong cluster.

## <a id='6'>6. Cluster analysis</a>

### <a id='61'>6.1. Cluster plotting and visualisation</a>

#### Visualise clusters by feature, scaled data with StandardScaler (standardisation)

In [None]:
# load example dataset from seaborn 
sns.get_dataset_names()

# plot
sns.load_dataset('penguins')
sns.pairplot(df_standard, hue="cluster")

# title
plt.suptitle('Pair Plot of Clusters by Feature', 
             size = 20);

#### Visualise clusters by feature, scaled data with MinMaxScaler (normalisation)

In [None]:
# load example dataset from seaborn 
sns.get_dataset_names()

# plot
sns.load_dataset('penguins')
sns.pairplot(df_minmax, hue="cluster")

# title
plt.suptitle('Pair Plot of Clusters by Feature', 
             size = 20);

#### Visualise clusters by feature, scaled data with StandardScaler and with reduction of features with PCA

In [None]:
# load example dataset from seaborn 
sns.get_dataset_names()

# plot
sns.load_dataset('penguins')
sns.pairplot(dataset, hue="cluster")

# title
plt.suptitle('Pair Plot of Clusters by Feature', 
             size = 20);

**Findings**

After running the model with 2 types of scaling and using PCA, we can see there tends to be overlapping between clusters.
Cluster 2 is more spread out and clusters 0 and 1 tend to overlap.

### <a id='62'>6.2. Cluster characteristics</a>

In [None]:
# add cluster column to original dataset with countries and non-scaled values
data['cluster'] = y_predicted_standard.tolist()
data

#### Visualise clusters by feature, original data with no scaling

In [None]:
# load example dataset from seaborn 
sns.get_dataset_names()

# plot
sns.load_dataset('penguins')
sns.pairplot(data, hue="cluster")

# title
plt.suptitle('Pair Plot of Clusters by Feature', 
             size = 20);

### <a id='63'>6.3. Cluster descriptions</a>

In [None]:
# table of clusters showing mean values per cluster and per feature
clusters_table = pd.pivot_table(data, index=['cluster'])
clusters_table

In [None]:
# cluster 0 
cluster_0 = data.loc[data['cluster'] == 0]

# list of countries in this country
cluster_0.country.unique()

**Cluster 0: This cluster is characterised by showing average values for all features when comparing with other clusters**

- child mortality,    avg
- exports,            avg
- gdpp,               avg
- health,             same as cluster 1
- imports,            avg
- income,             avg
- inflation,          avg
- life_expect,        +70 years
- total_fer,          avg, 2 children per woman (number of children that would be born to each woman if the current age-fertility rates remain the same)

In [None]:
# cluster 1 
cluster_1 = data.loc[data['cluster'] == 1]

# list of countries in this country
cluster_1.country.unique()

**Cluster 1: This cluster is characterised by having the most negative values: high child mortality, lowest economic development, low gdpp, exports and imports, lowest life expectancy**

- child mortality, highest
- exports, lowest
- gdpp, lowest
- health, same as cluster 0
- imports, lowest
- income, significantly lower than other clusters
- inflation, highest
- life_expect, +50 years
- total_fer, highest, 5 children per woman (number of children that would be born to each woman if the current age-fertility rates remain the same)

In [None]:
# cluster 2 
cluster_2 = data.loc[data['cluster'] == 2]

# list of countries in this country
cluster_2.country.unique()

**Cluster 2: This cluster is characterised by showing really strong or positive values such as good economic development, high life expectancy, low child mortality**


- child mortality, lowest
- exports,  highest
- gdpp, highest by a lot
- health, higher than both other clusters
- imports, highest
- income, significantly higher than other clusters
- inflation, lowest
- life_expect, +80 years
- total_fer, lowest age-fertility rate, 1 child per woman (number of children that would be born to each woman if the current age-fertility rates remain the same)


### <a id='64'>6.4. Clusters and their location in the world</a>

In [None]:
# load example data from geodataframe 
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
world.head()

In [None]:
# look at country names from gpd list
print(sorted(world['name'].unique()))

# look at country names for analysis list
print(sorted(data['country'].unique()))

In [None]:
# compare 2 lists to idenitfy country names that need to be adjusted
world_list = ['Afghanistan', 'Albania', 'Algeria', 'Angola', 'Argentina', 'Armenia', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bangladesh', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bhutan', 'Bolivia', 'Bosnia and Herz.', 'Botswana', 'Brazil', 'Brunei', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Central African Rep.', 'Chad', 'Chile', 'China', 'Colombia', 'Congo', 'Costa Rica', 'Croatia', 'Cuba', 'Cyprus', 'Czechia', "Côte d'Ivoire", 'Dem. Rep. Congo', 'Denmark', 'Djibouti', 'Dominican Rep.', 'Ecuador', 'Egypt', 'El Salvador', 'Eq. Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Falkland Is.', 'Fiji', 'Finland', 'Fr. S. Antarctic Lands', 'France', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Greece', 'Greenland', 'Guatemala', 'Guinea', 'Guinea-Bissau', 'Guyana', 'Haiti', 'Honduras', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran', 'Iraq', 'Ireland', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jordan', 'Kazakhstan', 'Kenya', 'Kosovo', 'Kuwait', 'Kyrgyzstan', 'Laos', 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Lithuania', 'Luxembourg', 'Macedonia', 'Madagascar', 'Malawi', 'Malaysia', 'Mali', 'Mauritania', 'Mexico', 'Moldova', 'Mongolia', 'Montenegro', 'Morocco', 'Mozambique', 'Myanmar', 'N. Cyprus', 'Namibia', 'Nepal', 'Netherlands', 'New Caledonia', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'North Korea', 'Norway', 'Oman', 'Pakistan', 'Palestine', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Romania', 'Russia', 'Rwanda', 'S. Sudan', 'Saudi Arabia', 'Senegal', 'Serbia', 'Sierra Leone', 'Slovakia', 'Slovenia', 'Solomon Is.', 'Somalia', 'Somaliland', 'South Africa', 'South Korea', 'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'Sweden', 'Switzerland', 'Syria', 'Taiwan', 'Tajikistan', 'Tanzania', 'Thailand', 'Timor-Leste', 'Togo', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Turkmenistan', 'Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom', 'United States of America', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela', 'Vietnam', 'W. Sahara', 'Yemen', 'Zambia', 'Zimbabwe', 'eSwatini']
data_list = ['Afghanistan', 'Albania', 'Algeria', 'Angola', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bhutan', 'Bolivia', 'Bosnia and Herzegovina', 'Botswana', 'Brazil', 'Brunei', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 'Central African Republic', 'Chad', 'Chile', 'China', 'Colombia', 'Comoros', 'Costa Rica', 'Croatia', 'Cyprus', 'Czech Republic', 'Denmark', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Fiji', 'Finland', 'France', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Greece', 'Grenada', 'Guatemala', 'Guinea', 'Guinea-Bissau', 'Guyana', 'Haiti', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran', 'Iraq', 'Ireland', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati', 'Kuwait', 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Lithuania', 'Luxembourg', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Mauritania', 'Mauritius', 'Moldova', 'Mongolia', 'Montenegro', 'Morocco', 'Mozambique', 'Namibia', 'Nepal', 'Netherlands', 'New Zealand', 'Niger', 'Nigeria', 'Norway', 'Oman', 'Pakistan', 'Panama', 'Paraguay', 'Peru', 'Philippines', 'Poland', 'Portugal', 'Qatar', 'Romania', 'Russia', 'Rwanda', 'Samoa', 'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone', 'Singapore', 'Slovenia', 'Solomon Islands', 'South Africa', 'South Korea', 'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'Sweden', 'Switzerland', 'Tajikistan', 'Tanzania', 'Thailand', 'Timor-Leste', 'Togo', 'Tonga', 'Tunisia', 'Turkey', 'Turkmenistan', 'Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom', 'United States', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela', 'Vietnam', 'Yemen', 'Zambia']

list_difference = []
for item in world_list:
  if item not in data_list:
    list_difference.append(item)

print(list_difference)

In [None]:
# update names on world gdp dataset to match with df
world['name'] = world['name'].replace(
    ['Bosnia and Herz.', 'Central African Rep.', 'Congo', "Côte d'Ivoire", 'Dem. Rep. Congo', 'Dominican Rep.', 'Eq. Guinea', 'Macedonia', 'Myanmar', 'N. Cyprus', 'S. Sudan', 'Slovakia', 'Solomon Is.', 'United States of America'],
    ['Bosnia and Herzegovina','Central African Republic','Congo, Rep.',"Cote d'Ivoire",'Congo, Dem. Rep.','Dominican Republic','Equatorial Guinea','Macedonia, FYR','Myanmar','Cyprus','Sudan','Slovak Republic','Solomon Islands','United States'])

# check output
world.name.unique()

In [None]:
# change column name
world_copy = world.copy()
world_copy.rename(columns = {'name' : 'country'}, inplace = True)
world_copy.head()

# append geodataframe data with data_combined data
world_data = pd.merge(
        data,
        world_copy,
        on='country',
        how= 'inner'
)

# convert df into geodf
world_data = gdf(world_data)

# plot 
import geoplot
import mapclassify
cluster = world_data['cluster']


# Note: this code sample requires geoplot>=0.4.0.
geoplot.choropleth(
    world_data, 
    hue=cluster,
    cmap='Greys', 
    figsize=(16, 8),
    legend = True
)

**Findings**

* Countries in Cluster 2 (characterised by showing really strong or positive values such as good economic development, high life expectancy, low child mortality) are located in North America, Europe, Oceania and a couple in Asia. 
* Countries in Cluster 1 (characterised by having the most negative values: high child mortality, lowest economic development) are located across Africa and Asia.
* Countries in Cluster 0 (characterised by showing average values for all features when comparing with other clusters) are located across South America, parts of Africa, Europe and Asia.

Blank spaces (like Mexico) are of countries with no available data.

## <a id='7'>7. Further analysis to complement clustering </a>




We've evaluated the results of the clustering by: 

  a) plotting the relationship of features by cluster in Cluster plotting

  b) comparing average values of each feature in Cluster characteristics


Based on an initial assessment of the average values of each cluster, *Cluster 1* could be focus for further analysis. However, when we plot the clusters and look at the graphs, we see that there is overlapping of clusters as well as spread out clusters.

Utilising PCA as an alternative did not result in a significant difference.

We've been able to identify some patters in the data and group countries into 3 clusters. However, we should not rely solely on this result to make the recommendation of countries that should receive funding. There are a few alternatives to explore before we can make this recommendation. 

The implementation of a clustering model in this case did not bring up patters that we might have not found otherwise, in a way, it only confirmed general knowledge of intuition about this topic. The clustering can be considered as a preprocessing step and further analysis is required. Here are some alternatives to explore:


### <a id='71'>7.1. Dropping features with high correlation</a>



In [None]:
# df without these features 
dataset_reduced = data.drop(['country','life_expec','total_fer','income'], axis =1)
dataset_reduced.head()

In [None]:
# scale with standard scaling
columns = dataset_reduced.columns

# the scaler to use will be 
scaler = StandardScaler()

rescaled_dataset_reduced = scaler.fit_transform(dataset_reduced)
rescaled_dataset_reduced

In [None]:
# standardisation
# we need to create a new dataframe with the column lables and the rescaled values 
df_reduced = pd.DataFrame(data= rescaled_dataset_reduced , columns = columns)
df_reduced


In [None]:
# run the model with the standardised reduced dataset
# method to compute the clusters and assign the labels
y_predicted_reduced = km.fit_predict(df_reduced) 
y_predicted_reduced

In [None]:
# add the cluster column to the dataframe 
df_reduced['cluster'] = y_predicted_reduced
df_reduced.head()

In [None]:
# calculate Sum of Squared Errors (SSE), also called distorsions, for a range of number of cluster - with df scaled with StandardScaler + PCA
sse = []
for i in range(1, 11):
    km = KMeans(
        n_clusters=i, init='random',
        n_init=10, max_iter=300,
        tol=1e-04, random_state=0
    )
    km.fit(df_reduced)
    sse.append(km.inertia_)

# plot
plt.plot(range(1, 11), sse, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('SSE')
plt.show()

**Findings** 

Dropping the features identified as having high correlation earlier in this notebook has resulted in 2 clusters with high inertia. There are no significant changes compared to what we have found in previous steps. 

### <a id='72'>7.2. Further analysis of clusters</a>

Because we've decided to do further analysis of clusters 0 and 1 in 7.4. Further clustering of clusters, we'll dedicate this step to further analysing cluster 2, to further understand how is it composed and if there are countries in this clusters which might be reconsidered for the final recomendation of funding. 

In [None]:
# df cluster 2 
df3 = data[data.cluster == 2]

# load example dataset from seaborn 
sns.get_dataset_names()

# plot
sns.load_dataset('penguins')
sns.pairplot(df3, hue='cluster')

# title
plt.suptitle('Cluster 2 by Feature', 
             size = 20);

**Findings** 

Outliers found in features in this cluster are generally more positive and distant from values from clusters 0 and 1. Countries in this cluster are not going to be considered for funding. 

### <a id='73'>7.3. Linear and Multivariate regression</a>

We identified that there are about 2 or 3 clusters with high levels of overlapping, therefore we need to explore further analysis that can help us answer the question at hand with more supporting evidence. 

For this we'll incorporate a new feature called **Multidimensional Poverty Index (MPI)** from the [Multidimensional Poverty Measures](http://www.kaggle.com/ophi/mpi) dataset. 

*The global Multidimensional Poverty Index (MPI) is an international measure of acute multidimensional poverty covering over 100 developing countries. 
It complements traditional monetary poverty measures by capturing the acute deprivations in health, education, and living standards that a person faces simultaneously. Read more about the MPI [here](http://ophi.org.uk/multidimensional-poverty-index/).*

The MPI will act as the Dependent Variable (DV) and the data we already been working on will act as the Independent Variables (IV). We'll use Multiple Linear Regression as a way to quantify the relationship between several IV and a DV. 

In [None]:
# import data
mpi_data = pd.read_csv("/kaggle/input/mpi/MPI_national.csv")
mpi_data

**Feature Description** 

* ISO: Unique ID for country
* Country: country name
* MPI Urban: Multi-dimensional poverty index for urban areas within the country
* Headcount Ratio Urban: Poverty headcount ratio (% of population listed as poor) within urban areas within the country
* Intensity of Deprivation Urban: Average distance below the poverty line of those listed as poor in urban areas
* MPI Rural: Multi-dimensional poverty index for rural areas within the country
* Headcount Ratio Rural: Poverty headcount ratio (% of population listed as poor) within rural areas within the country
* Intensity of Deprivation Rural: Average distance below the poverty line of those listed as poor in rural areas


For the purpose of this analysis we will focus on **MPI Urban** and **MPI Rural**. This is because the MPI measure reflects both:

a) the incidence of poverty (the percentage of the population who are poor) and, 

b) the intensity of poverty (the percentage of deprivations suffered by each person or household on average). M0 is calculated by multiplying the incidence (H) by the intensity (A). M0 = H x A.

In [None]:
# drop columns
mpi_data_short =  mpi_data.drop(['ISO','Headcount Ratio Urban','Intensity of Deprivation Urban','Headcount Ratio Rural','Intensity of Deprivation Rural'], axis=1)

# rename column
mpi_data_short.rename(
    columns = {'Country':'country',
               'MPI Urban':'mpi_urban',
               'MPI Rural':'mpi_rural'
              },
    inplace = True)

# head
mpi_data_short.head(3)

In [None]:
plt.figure(figsize=(12,5))
plt.title('MPI Urban')
ax = sns.histplot(mpi_data_short['mpi_urban'])

In [None]:
plt.figure(figsize=(12,5))
plt.title('MPI Rural')
ax = sns.histplot(mpi_data_short['mpi_rural'])

In [None]:
# append data df with mpi_data df
combined = pd.merge(
    data,
    mpi_data_short,
    on='country',
    how='inner'
)

# check
combined.head()

In [None]:
# pearson
plt.figure(figsize=(15,10))
sns.heatmap(combined.corr(method='pearson', min_periods=1),annot=True)

**Findings**

MPI urban and rural have high correlation. MPI urban to be considered as DV.
Features with high correlation with MPI urban are child_mort, income, life_expect, total_fer, gdpp.
We'll keep child_mort and not use life_expect and total_fer as they are highly correlated and there might be *multicollinearity* between them. 


### <a id='73a'>7.3.a. Linear regression</a>

We'll run a simple linear regression model for all features (IV) from the original dataset and use MPI urban as the DV.

In [None]:
# create linear regression class object 
reg = linear_model.LinearRegression()

# libraries for plotting of residual plots
import statsmodels.api as sm
from statsmodels.formula.api import ols

In [None]:
#fit simple linear regression model
model = ols('mpi_urban ~ child_mort', data=combined).fit()

#view model summary
print(model.summary())

#define figure size
fig = plt.figure(figsize=(12,8))

#produce regression plots
fig = sm.graphics.plot_regress_exog(model, 'child_mort', fig=fig)

In [None]:
# fit simple linear regression model
model = ols('mpi_urban ~ exports', data=combined).fit()

#view model summary
print(model.summary())

#define figure size
fig = plt.figure(figsize=(12,8))

#produce regression plots
fig = sm.graphics.plot_regress_exog(model, 'exports', fig=fig)

In [None]:
# fit simple linear regression model
model = ols('mpi_urban ~ health', data=combined).fit()

#view model summary
print(model.summary())

#define figure size
fig = plt.figure(figsize=(12,8))

#produce regression plots
fig = sm.graphics.plot_regress_exog(model, 'health', fig=fig)

In [None]:
# fit simple linear regression model
model = ols('mpi_urban ~ imports', data=combined).fit()

#view model summary
print(model.summary())

#define figure size
fig = plt.figure(figsize=(12,8))

#produce regression plots
fig = sm.graphics.plot_regress_exog(model, 'imports', fig=fig)

In [None]:
#fit simple linear regression model
model = ols('mpi_urban ~ income', data=combined).fit()

#view model summary
print(model.summary())

#define figure size
fig = plt.figure(figsize=(12,8))

#produce regression plots
fig = sm.graphics.plot_regress_exog(model, 'income', fig=fig)

In [None]:
#fit simple linear regression model
model = ols('mpi_urban ~ inflation', data=combined).fit()

#view model summary
print(model.summary())

#define figure size
fig = plt.figure(figsize=(12,8))

#produce regression plots
fig = sm.graphics.plot_regress_exog(model, 'inflation', fig=fig)

In [None]:
#fit simple linear regression model
model = ols('mpi_urban ~ life_expec', data=combined).fit()

#view model summary
print(model.summary())

#define figure size
fig = plt.figure(figsize=(12,8))

#produce regression plots
fig = sm.graphics.plot_regress_exog(model, 'life_expec', fig=fig)

In [None]:
#fit simple linear regression model
model = ols('mpi_urban ~ total_fer', data=combined).fit()

#view model summary
print(model.summary())

#define figure size
fig = plt.figure(figsize=(12,8))

#produce regression plots
fig = sm.graphics.plot_regress_exog(model, 'total_fer', fig=fig)

In [None]:
#fit simple linear regression model
model = ols('mpi_urban ~ gdpp', data=combined).fit()

#view model summary
print(model.summary())

#define figure size
fig = plt.figure(figsize=(12,8))

#produce regression plots
fig = sm.graphics.plot_regress_exog(model, 'gdpp', fig=fig)

**Findings**

* Multicollinearity: Income and GDP per Person (gdpp) show high multicollinearity with MPI urban.

* Heteroscedasticity: Based on the interpretation of the 'Residuals vs Feature" plot these are the features that might show heteroscedasticity: child_mort, income, total,fert, life_expec ("cone" shape of fitted values as opposed to randomly scattered).




### <a id='73b'>7.3.b. Multivariate regression</a>

#### With all features of the original dataset

In [None]:
# train model
reg.fit(combined[['child_mort','exports','health','imports','income','inflation','life_expec','total_fer','gdpp']],combined.mpi_urban)

In [None]:
# accuracy assessment
# R-squared: indicates the proportion of variance in y (mpi_urban), explained by x (other features selected)
reg.score(combined[['child_mort','exports','health','imports','income','inflation','life_expec','total_fer','gdpp']],combined.mpi_urban)

In [None]:
# accuracy assessment
# adjusted R-squared: The adjusted R-squared is a modified version of R-squared that adjusts for the number of predictors in a regression model.
1 - (1-reg.score(combined[['child_mort','exports','health','imports','income','inflation','life_expec','total_fer','gdpp']],combined.mpi_urban))*(len(combined.mpi_urban)-1)/(len(combined.mpi_urban)-combined[['child_mort','exports','health','imports','income','inflation','life_expec','total_fer','gdpp']].shape[1]-1)

#### Without features with multicollinearity and heteroscedasticity

In [None]:
# train model
reg.fit(combined[['exports','health','imports','inflation','gdpp']],combined.mpi_urban)

In [None]:
# accuracy assessment
# adjusted R-squared: The adjusted R-squared is a modified version of R-squared that adjusts for the number of predictors in a regression model.
1 - (1-reg.score(combined[['exports','health','imports','inflation','gdpp']],combined.mpi_urban))*(len(combined.mpi_urban)-1)/(len(combined.mpi_urban)-combined[['exports','health','imports','inflation','gdpp']].shape[1]-1)

#### With features with highest R-squared value found on linear regression

In [None]:
# train model
reg.fit(combined[['child_mort','total_fer']],combined.mpi_urban)

# accuracy assessment
# adjusted R-squared: The adjusted R-squared is a modified version of R-squared that adjusts for the number of predictors in a regression model.
1 - (1-reg.score(combined[['child_mort','total_fer']],combined.mpi_urban))*(len(combined.mpi_urban)-1)/(len(combined.mpi_urban)-combined[['child_mort','total_fer']].shape[1]-1)

**Findings**

* With all features of the original dataset: the adjusted R-squared value is >80% which is considered a good result, but we need to consider that there are features included there which show multicolliearity and heteroscedasticity. This means that the model might not have been well-fitted. This is because every time you add a predictor to a model, the R-squared increases, even if due to chance alone. Consequently, a model with more terms may appear to have a better fit simply because it has more terms, but this does not necessarily mean it is the best selection of features.

* Without features with multicollinearity and heteroscedasticity: the adjusted R-squared value significantly decreases to 39% if we only select the features that did not have multicollinearity and heteroscedasticity when running simple linear regression with them. 

* With features with highest R-squared value found on simple linear regression: the adjusted R-squared value is of 79%. Based on initial correlation analysis of these features, they have high positive correlation.

### <a id='74'>7.4. Further clustering of clusters</a>

We can use the findings that we've found so far to narrow our features and run a new clustering model.

We'll include countries listed on clusters 0 and 1 and combine them to work on a new dataset. The features that we will use to cluster this new dataset are 

* child_mort: child mortality is a strong indicator of need for development aid
* gdpp: to include a monetary measure more related to traditional measures of poverty/development 
* MPI urban: captures not only the proportion of the population in poverty but also the intensity of these deprivations

In [None]:
mpi_data_short.head()

In [None]:
# append clusters 0 and 1
sub_cluster = cluster_0.append(cluster_1, ignore_index = True)
sub_cluster.head()


In [None]:
# append df with cluster 0 and 1 with mpi_data df
sub_cluster_mpi = pd.merge(
    sub_cluster,
    mpi_data_short,
    on='country',
    how='inner'
)

sub_cluster_mpi.head()

In [None]:
# eliminate the column that contains the country information and cluster as only numeric values should be used in this case for unsupervised learning
sub_cluster_data = sub_cluster_mpi.drop(['country','cluster','exports','health','imports','income','inflation','life_expec','total_fer','mpi_rural'], axis =1)


# columns argument ==> we'll use this later to create a new dataframe with the rescaled data 
columns = sub_cluster_data.columns

# the scaler to use will be 
scaler = StandardScaler()

# 'scaler' is for the rescaling technique, 'fit' function is to find the x_min and the x_max, 'transform' function applies formula to all elements of data
rescaled_sub_cluster_data = scaler.fit_transform(sub_cluster_data)
rescaled_sub_cluster_data

# we need to create a new dataframe with the column lables and the rescaled values 
df_sub_cluster = pd.DataFrame(data= rescaled_sub_cluster_data , columns = columns)
df_sub_cluster

In [None]:
km2 = KMeans (
    n_clusters = 3, # number of clusters/centroids to create
    init = 'random', # ‘random’: choose n_clusters observations (rows) at random from data for the initial centroids
    n_init = 10, # this is the default value. This is the number of times the k-means algorithm will be run with different centroid seeds
    max_iter = 300, # this is the default value. This is the maximum number of iterations of the k-means algorithm for a single run.
    tol = 1e-4, # this is the default value. This is the relative tolerance with regards to Frobenius norm of the difference in the cluster centers of two consecutive iterations to declare convergence.
    random_state = 0 # this is the default value. Determines random number generation for centroid initialization. Use an int to make the randomness deterministic.
)

In [None]:
# method to compute the clusters and assign the labels
y_predicted_sub = km2.fit_predict(df_sub_cluster)
y_predicted_sub

In [None]:
# add the cluster column to the dataframe 
df_sub_cluster['sub_clusters'] = y_predicted_sub
df_sub_cluster.head()

In [None]:
# calculate Sum of Squared Errors (SSE), also called distorsions,  for a range of number of cluster - with df scaled with StandardScaler

sse = []
for i in range(1, 11):
    km2 = KMeans(
        n_clusters=i, init='random',
        n_init=10, max_iter=300,
        tol=1e-04, random_state=0
    )
    km2.fit(df_sub_cluster)
    sse.append(km2.inertia_)

# plot
plt.plot(range(1, 11), sse, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('SSE')
plt.show()

In [None]:
# bring dataset with real values
# add cluster column to original dataset with countries and non-scaled values
sub_cluster_mpi['sub_clusters'] = y_predicted_sub.tolist()
sub_cluster_mpi

In [None]:
# bring dataset with real values
# add cluster column to original dataset with countries and non-scaled values
# eliminate the column that contains the country information and cluster as only numeric values should be used in this case for unsupervised learning
sub_cluster_narrow = sub_cluster_mpi.drop(['exports','health','cluster','imports','income','inflation','life_expec','total_fer','mpi_rural'], axis =1)
sub_cluster_narrow.head()

In [None]:
# load example dataset from seaborn 
sns.get_dataset_names()

# plot
sns.load_dataset('penguins')
sns.pairplot(sub_cluster_narrow, hue="sub_clusters")

# title
plt.suptitle('Pair Plot of Sub Clusters by Feature', 
             size = 20);

In [None]:
# change column name
world_copy = world.copy()
world_copy.rename(columns = {'name' : 'country'}, inplace = True)
world_copy.head()

# append geodataframe data with data_combined data
world_data = pd.merge(
        sub_cluster_narrow,
        world_copy,
        on='country',
        how= 'inner'
)

# convert df into geodf
world_data = gdf(world_data)

# plot 
import geoplot
import mapclassify
clusters = world_data['sub_clusters']


# Note: this code sample requires geoplot>=0.4.0.
geoplot.choropleth(
    world_data, 
    hue=clusters,
    cmap='Greys', 
    figsize=(12, 6),
    legend = True
)

In [None]:
# sub_cluster 0 
sub_cluster_0 = sub_cluster_narrow.loc[sub_cluster_narrow['sub_clusters'] == 0]

# list of countries in this country
sub_cluster_0.country.unique()

#### Look for additional countries from other clusters

In [None]:
# Declaring the figure or the plot (y, x) or (width, height)
plt.figure(figsize=[14, 10])

# Passing the parameters to the bar function, this is the main function which creates the bar plot
# For creating the horizontal make sure that you append 'h' to the bar function name
plt.barh(sub_cluster_0['country'], sub_cluster_0['mpi_urban'], label = "MPI urban", color = 'r')
# Creating the legend of the bars in the plot
plt.legend()
# Namimg the x and y axis
plt.xlabel('MPI')
plt.ylabel('Countries')
# Giving the tilte for the plot
plt.title('MPI by Country')

# Displaying the bar plot
plt.show()

In [None]:
# Declaring the figure or the plot (y, x) or (width, height)
plt.figure(figsize=[14, 10])

# Passing the parameters to the bar function, this is the main function which creates the bar plot
# For creating the horizontal make sure that you append 'h' to the bar function name
plt.barh(sub_cluster_0['country'], sub_cluster_0['child_mort'], label = "Child Mortality", color = 'g')
# Creating the legend of the bars in the plot
plt.legend()
# Namimg the x and y axis
plt.xlabel('Child Mortality')
plt.ylabel('Countries')
# Giving the tilte for the plot
plt.title('Child Mortality by Country')

# Displaying the bar plot
plt.show()

In [None]:
# sub_cluster 1 
sub_cluster_1 = sub_cluster_narrow.loc[sub_cluster_narrow['sub_clusters'] == 1]

# list of countries in this country
sub_cluster_1.country.unique()

In [None]:
# Declaring the figure or the plot (y, x) or (width, height)
plt.figure(figsize=[14, 10])

# Passing the parameters to the bar function, this is the main function which creates the bar plot
# For creating the horizontal make sure that you append 'h' to the bar function name
plt.barh(sub_cluster_1['country'], sub_cluster_1['mpi_urban'], label = "MPI urban", color = 'r')
# Creating the legend of the bars in the plot
plt.legend()
# Namimg the x and y axis
plt.xlabel('MPI')
plt.ylabel('Countries')
# Giving the tilte for the plot
plt.title('MPI by Country')

# Displaying the bar plot
plt.show()

In [None]:
# Declaring the figure or the plot (y, x) or (width, height)
plt.figure(figsize=[14, 10])

# Passing the parameters to the bar function, this is the main function which creates the bar plot
# For creating the horizontal make sure that you append 'h' to the bar function name
plt.barh(sub_cluster_1['country'], sub_cluster_1['child_mort'], label = "Child Mortality", color = 'g')
# Creating the legend of the bars in the plot
plt.legend()
# Namimg the x and y axis
plt.xlabel('Child Mortality')
plt.ylabel('Countries')
# Giving the tilte for the plot
plt.title('Child Mortality by Country')

# Displaying the bar plot
plt.show()

In [None]:
# sub_cluster 2 
sub_cluster_2 = sub_cluster_narrow.loc[sub_cluster_narrow['sub_clusters'] == 2]

# list of countries in this country
sub_cluster_2.country.unique()

**Findings**

After completing the clustering of clusters exercise we can see that countries listed on sub_cluster 2 have the most critical measures of child_mortality, MPI and gdpp.


## <a id='8'>8. Answer to the question and learnings</a>

**Answer to the question**

Recommended countries to allocate funding for development aid: 

* Countries listed on sub cluster 2, with most critical results based on MPI, child mortality and GDP/person: Benin, Burkina Faso, Burundi, Central African Republic, Chad, Cote d'Ivoire, Gambia, Guinea, Guinea-Bissau,Haiti, Mali, Mozambique, Niger, Nigeria, Sierra Leone.

* Countries listed on sub clusters 0 and 1, with critieral results of MPI: Liberia and Tmor-Leste, and child mortality: Cameroon.


**Learnings**

* The clustering method alone was not sufficient to provide a final recommendation, however it did contribute to guide actions for further analysis and explore the data in more detail. 

* Further analysis could be done by adding more features related to the context and constraints that the recommended countries might be facing, or systemic challenges that could hinder funding value. Issues like corruption, political/civic society crisis/ natural disasters and other risks could expand this analysis to develop a more suitable criteria for funding depending on the current context of a country beyond these macro indicators.


## <a id='9'>9. References</a>

**Libraries and Code**

https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

https://seaborn.pydata.org/generated/seaborn.load_dataset.html

https://seaborn.pydata.org/generated/seaborn.pairplot.html

https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler

PCA in Python:  https://www.youtube.com/watch?v=Lsue2gEM9D0 , https://www.youtube.com/watch?v=SBYdqlLgbGk

https://geopandas.org/docs/user_guide/mapping.html

**PCA**

https://builtin.com/data-science/step-step-explanation-principal-component-analysis

https://online.stat.psu.edu/stat505/lesson/11/11.4


**Similar Cases**

https://upzoning.berkeley.edu/download/Classifying_Neighborhoods_Methodology.pdf


**K-Means Model**

https://www.youtube.com/watch?v=EItlUEPCIzM&list=LL&index=1

https://towardsdatascience.com/k-means-clustering-with-scikit-learn-6b47a369a83c

https://medium.com/analytics-vidhya/why-is-scaling-required-in-knn-and-k-means-8129e4d88ed7

https://developer.squareup.com/blog/so-you-have-some-clusters-now-what/


**Visualisations**

https://towardsdatascience.com/visualizing-data-with-pair-plots-in-python-f228cf529166

https://towardsdatascience.com/mastering-the-bar-plot-in-python-4c987b459053


**Silhouette Score**

https://dzone.com/articles/kmeans-silhouette-score-explained-with-python-exam

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html



**Linear Regression**

https://www.statology.org/adjusted-r-squared-in-python/

https://statisticsbyjim.com/regression/interpret-r-squared-regression/

https://www.statology.org/residual-plot-python/

https://www.statology.org/heteroscedasticity-regression/

https://towardsdatascience.com/how-do-you-check-the-quality-of-your-regression-model-in-python-fa61759ff685








