# Socioeconomic Country Clustering

HELP International is an international humanitarian NGO that is committed to fighting poverty and providing the people of backward countries with basic amenities and relief during the time of disasters and natural calamities. It runs a lot of operational projects from time to time along with advocacy drives to raise awareness as well as for funding purposes.

After the recent funding programmes, they have been able to raise around $ 10 million. Now the CEO of the NGO needs to decide how to use this money strategically and effectively. The significant issues that come while making this decision are mostly related to **choosing the countries that are in the direst need of aid.**

And this is where you come in as a data analyst. Your job is to **categorise the countries** using some socio-economic and health factors that determine the overall development of the country. Then you need to suggest the countries which the CEO needs to focus on the most.

<img src="../images/countries.png" style="width: 500px"/>

## Data preparation

For each country, the following attributes are available: 
* country: name of the country
* child_mort: Deaths of children under 5 years of age per 1000 live births
* exports: Exports of goods and services.* 
* health: Total health spending.* 
* imports: Imports of goods and services.*  
* income: Net income per person
* Inflation: The measurement of annual growth rate of the total GDP. 
* life_expec: The average number of years a new born child would live.
* total_fer: The number of children that would be born to each woman if the current age-fertility rates remain the same
* gdpp: The GDP per capita. Calculated as the Total GDP divided by the total population.


\* _Given as percentage of the Total GDP_

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
data = pd.DataFrame(pd.read_csv('../data/country-data.csv'))
data.head()

In [None]:
X = data.drop(['country'], axis=1)
y = data['country']

In [None]:
X.shape

The data is preprocessed. There were no missing values.

**Questions** 
1. How many features do you have? 
2. Do you have any categorical variables? Would that pose a problem for k-means? 
3. What does _y_ mean in this case?

###  Exercise 1
The ranges vary between the different features. Scale your data and create `X_scaled`.

In [None]:
# %load ../answers/kmeans_scaling.py

## Number of clusters

The downside of k-means is that it requires you to define in advance how many clusters there are expected to be in the data. You are going to use the Elbow method to get a first indication of what an appropriate number of clusters would be. 

###  Exercise 2
Implement the Elbow method here. When instantiating your KMeans object, use `n_init=10` as one of its parameters.

In [None]:
# %load ../answers/kmeans_elbow.py

**Question** 
Does the Elbow method give a clear answer? What would you say possible choices are? 

### Exercise 3

Next up is the Silhouette score. Implement silhouette score here. 

In [None]:
# %load ../answers/kmeans_silhouette.py

**Question** What is your definite answer to the number of clusters appropriate for this data problem? 

### Exercise 4
You have prepared your data and scaled it, and determined the number of clusters to use. Let's get ready to use k-means! Implement k-means here. Don't forget to also set `n_init` when instantiating your object.

In [None]:
# %load ../answers/kmeans_kmeans.py

### Exercise 5

Exlain why it's best to set `n_init` to a number (much) higher than 1. What does this parameter do?

<details>
    
  <summary><span style="color:blue">Show answer</span></summary>
  
K-means is pretty sensitive to initialisation. To ensure you have not become stuck in a local minima, you can run K-Means multiple times and choose the centroid for which the inertia is the lowest. 

</details>

## Analyse result

Let's convert your results to a pandas dataframe for easy data wrangling. Let's see how many points were assigned to each cluster.

In [None]:
labels = pd.DataFrame(kmeans.labels_, columns = ['labels'])
labels.value_counts() 

Hmmmh, there is something interesting going on here, wouldn't you say? Based on the number of countries assigned to each cluster, do you still think the previous choice of number of clusters based on the Elbow method and the Silhouette score was the right choice? Or is there a different value you would like to try out now? 

### Exercise 6
Retry k-means with a different number of clusters. 

In [None]:
# %load ../answers/kmeans_retry.py

In [None]:
labels = pd.DataFrame(kmeans.labels_, columns = ['labels'])
labels.value_counts() 

This seems more like it! As there is no ground truth to compare your labels against, you cannot simply verify that the clustering was done correctly. However, you can investigate your features for the various labels and see if you can find some differentiating factors.

In [None]:
data_kmeans = pd.concat([data.copy(), labels], axis=1)
data_kmeans.head(10)

In [None]:
columns = data_kmeans.drop(['country', 'labels'], axis=1).columns

for i, column in enumerate(columns): 
    plt.figure()
    sns.boxplot(x='labels', y=column, data=data_kmeans)

**Questions**
1. For which features do there seem to be notable differences between the clusters? 
2. For which features do the differences seem **not** that notable? 
3. How would you characterise the different labels in terms of their features? 

### Conclusion 

What clusters have you found in the data and how would you characterise these clusters? Explore the data with their corresponding labels. Are these findings according to your own expectations? 

### Bonus Exercise
Investigate for what countries the clustering went awry on the first try. What is special about these countries that they were assigned to different clusters? 