# Apply clustering to form 3 groups based on SPI score


## Scope
- [x] Classify countries into different social progress categories based on their SPI scores (e.g., low, medium, high).
- [x] Analyze the characteristics of each cluster.

## Summary
- 57 countries out of 168 falls under the category of `High SPI score`
- 61% of High SPI-scoring countries are in Europe.
- Africa only has 1 country in the High SPI category and that is `Mauritius`.

## Imports

In [45]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

In [46]:
# Read csv
df = pd.read_csv('C:/Users/Tanish/Desktop/SPI-Analysis-Project/Data/new_spi.csv')

In [47]:
# Preview
df.head()

Unnamed: 0,spi_rank,country,spi_score,basic_human_needs,wellbeing,opportunity,basic_nutri_med_care,water_sanitation,shelter,personal_safety,access_basic_knowledge,access_info_comm,health_wellness,env_quality,personal_rights,personal_freedom_choice,inclusiveness,access_adv_edu,continent
0,1,Norway,92.63,95.29,93.3,89.3,98.81,98.33,93.75,90.29,98.66,95.8,89.32,89.44,96.34,91.16,83.77,85.92,Europe
1,2,Finland,92.26,95.62,93.09,88.07,98.99,99.26,96.48,87.75,96.32,95.14,85.73,95.15,96.13,88.1,82.81,85.23,Europe
2,3,Denmark,92.15,95.3,92.74,88.41,98.62,98.21,94.92,89.46,97.44,98.18,85.15,90.2,97.08,90.03,81.64,84.89,Europe
3,4,Iceland,91.78,96.66,93.65,85.04,98.99,98.82,93.16,95.66,99.51,93.12,91.02,90.93,95.14,88.01,77.63,79.39,Europe
4,5,Switzerland,91.78,95.25,93.8,86.28,98.72,98.96,92.97,90.35,98.6,95.07,91.5,90.05,96.69,90.65,74.81,82.99,Europe


In [48]:
# Feature columns for clustering
feature_columns = ['spi_score', 'basic_human_needs', 'wellbeing', 'opportunity', 'basic_nutri_med_care',
                   'water_sanitation', 'shelter', 'personal_safety', 'access_basic_knowledge', 'access_info_comm',
                   'health_wellness', 'env_quality', 'personal_rights', 'personal_freedom_choice', 'inclusiveness',
                   'access_adv_edu']

In [49]:
# Normalize the features
scaler = StandardScaler()
df[feature_columns] = scaler.fit_transform(df[feature_columns])

Normalizing the features using the StandardScaler is a common preprocessing step in machine learning. It scales the numerical features so that they have a mean of 0 and a standard deviation of 1. The purpose of normalization is to ensure that all the features contribute equally to the analysis and prevent features with large scales from dominating the clustering process.

In [50]:
# Applying K-means clustering with 3 clusters
kmeans = KMeans(n_clusters=3, random_state=42)
df['cluster'] = kmeans.fit_predict(df[feature_columns])

In [51]:
# Check all the values are standardised
df.head()

Unnamed: 0,spi_rank,country,spi_score,basic_human_needs,wellbeing,opportunity,basic_nutri_med_care,water_sanitation,shelter,personal_safety,access_basic_knowledge,access_info_comm,health_wellness,env_quality,personal_rights,personal_freedom_choice,inclusiveness,access_adv_edu,continent,cluster
0,1,Norway,1.677613,1.177449,1.656728,1.955517,0.978581,0.947313,0.886934,1.639119,1.227206,1.422788,1.682828,1.549878,1.23847,1.87342,2.172321,1.712494,Europe,2
1,2,Finland,1.652964,1.197755,1.643087,1.877694,0.991069,0.987051,1.032072,1.462713,1.106977,1.390405,1.458922,1.948888,1.228713,1.670476,2.115866,1.675308,Europe,2
2,3,Denmark,1.645637,1.178064,1.620353,1.899206,0.965399,0.942186,0.949136,1.581475,1.164522,1.539563,1.422748,1.602986,1.272852,1.798477,2.047062,1.656984,Europe,2
3,4,Iceland,1.620988,1.261749,1.679462,1.685986,0.991069,0.96825,0.855567,2.012072,1.270878,1.291293,1.788856,1.653998,1.182717,1.664507,1.811246,1.360573,Europe,2
4,5,Switzerland,1.620988,1.174988,1.689206,1.764441,0.972337,0.974232,0.845466,1.643286,1.224123,1.38697,1.818793,1.592504,1.254732,1.839596,1.645411,1.554588,Europe,2


In [52]:
# Check the value counts for each cluster
df.cluster.value_counts()

cluster
0    75
1    49
2    44
Name: count, dtype: int64

**Note** - At this stage, we have created three clusters using the K-means algorithm, but we don't know which cluster represents the high, low, or medium SPI score category. The cluster numbers (0, 1, and 2) are arbitrary and do not carry any inherent meaning regarding the SPI score categories.

To determine which cluster represents the `high, low, or medium` SPI score category, we need to analyze the characteristics of each cluster and identify which cluster has countries with higher SPI scores, which has countries with lower SPI scores, and which falls in between.

In [53]:
# Analyzing the characteristics of each cluster
cluster_characteristics = df.groupby('cluster').mean(numeric_only=True)
cluster_characteristics

Unnamed: 0_level_0,spi_rank,spi_score,basic_human_needs,wellbeing,opportunity,basic_nutri_med_care,water_sanitation,shelter,personal_safety,access_basic_knowledge,access_info_comm,health_wellness,env_quality,personal_rights,personal_freedom_choice,inclusiveness,access_adv_edu
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
0,82.506667,0.051918,0.308042,0.020503,-0.188762,0.364994,0.365508,0.459458,-0.169002,0.238319,0.121544,0.012137,-0.422639,-0.24587,-0.043748,-0.350206,-0.001475
1,143.22449,-1.221333,-1.374674,-1.161219,-0.935287,-1.368508,-1.372787,-1.394006,-0.783906,-1.229737,-1.184521,-1.103642,-0.400939,-0.586709,-1.062497,-0.547331,-1.141073
2,22.5,1.271623,1.005815,1.258227,1.363322,0.901872,0.90576,0.769248,1.161058,0.963254,1.111948,1.208367,1.166907,1.072476,1.257806,1.20647,1.273254


One common approach is to calculate the mean SPI score for each cluster and compare those means. The cluster with the highest mean SPI score would represent the high SPI score category, the one with the lowest mean SPI score would represent the low SPI score category, and the remaining cluster would represent the medium SPI score category.

**Note** - The negative mean value for the spi_score in cluster 2 is likely due to the standardization step that was performed before applying K-means clustering.

In [54]:
# Determine SPI score categories based on cluster means
low_spi_threshold = cluster_characteristics['spi_score'].min() + (cluster_characteristics['spi_score'].max() - cluster_characteristics['spi_score'].min()) / 3
high_spi_threshold = cluster_characteristics['spi_score'].max() - (cluster_characteristics['spi_score'].max() - cluster_characteristics['spi_score'].min()) / 3

In [55]:
# Print low threshold value
low_spi_threshold

np.float64(-0.3903472319196173)

In [56]:
# Print high threshold value
high_spi_threshold

np.float64(0.44063806084159984)

To divide the range into three equal parts, we add one-third of the range to the minimum value to get the low_spi_threshold, and subtract one-third of the range from the maximum value to get the high_spi_threshold.

For instance, let's take numbers from `1 to 100`,

now the range is max-min -> 100-1 -> `99`,

now divide 99 in 3 equal parts that is `33`,

low_threshold -> min + 3 equal part -> 1+33 -> `34`

low_threshold -> max -  3 equal part -> 100-33 -> `67`

In [57]:
# Function that distrbutes the spi_score in the categories
def classify_spi_category(spi_score):
    if spi_score < low_spi_threshold:
        return 'Low'
    elif spi_score > high_spi_threshold:
        return 'High'
    else:
        return 'Medium'

In [58]:
# Add the SPI score categories to the original dataframe
df['spi_category'] = df['spi_score'].apply(classify_spi_category)

In [59]:
# Preview
df.head()

Unnamed: 0,spi_rank,country,spi_score,basic_human_needs,wellbeing,opportunity,basic_nutri_med_care,water_sanitation,shelter,personal_safety,...,access_info_comm,health_wellness,env_quality,personal_rights,personal_freedom_choice,inclusiveness,access_adv_edu,continent,cluster,spi_category
0,1,Norway,1.677613,1.177449,1.656728,1.955517,0.978581,0.947313,0.886934,1.639119,...,1.422788,1.682828,1.549878,1.23847,1.87342,2.172321,1.712494,Europe,2,High
1,2,Finland,1.652964,1.197755,1.643087,1.877694,0.991069,0.987051,1.032072,1.462713,...,1.390405,1.458922,1.948888,1.228713,1.670476,2.115866,1.675308,Europe,2,High
2,3,Denmark,1.645637,1.178064,1.620353,1.899206,0.965399,0.942186,0.949136,1.581475,...,1.539563,1.422748,1.602986,1.272852,1.798477,2.047062,1.656984,Europe,2,High
3,4,Iceland,1.620988,1.261749,1.679462,1.685986,0.991069,0.96825,0.855567,2.012072,...,1.291293,1.788856,1.653998,1.182717,1.664507,1.811246,1.360573,Europe,2,High
4,5,Switzerland,1.620988,1.174988,1.689206,1.764441,0.972337,0.974232,0.845466,1.643286,...,1.38697,1.818793,1.592504,1.254732,1.839596,1.645411,1.554588,Europe,2,High


In [60]:
# Get the count of all 3 categories
df.spi_category.value_counts()

spi_category
Low       60
High      57
Medium    51
Name: count, dtype: int64

In [61]:
# Display all the countries with high spi score
high_df = df[df['spi_category']== 'High']
high_df

Unnamed: 0,spi_rank,country,spi_score,basic_human_needs,wellbeing,opportunity,basic_nutri_med_care,water_sanitation,shelter,personal_safety,...,access_info_comm,health_wellness,env_quality,personal_rights,personal_freedom_choice,inclusiveness,access_adv_edu,continent,cluster,spi_category
0,1,Norway,1.677613,1.177449,1.656728,1.955517,0.978581,0.947313,0.886934,1.639119,...,1.422788,1.682828,1.549878,1.23847,1.87342,2.172321,1.712494,Europe,2,High
1,2,Finland,1.652964,1.197755,1.643087,1.877694,0.991069,0.987051,1.032072,1.462713,...,1.390405,1.458922,1.948888,1.228713,1.670476,2.115866,1.675308,Europe,2,High
2,3,Denmark,1.645637,1.178064,1.620353,1.899206,0.965399,0.942186,0.949136,1.581475,...,1.539563,1.422748,1.602986,1.272852,1.798477,2.047062,1.656984,Europe,2,High
3,4,Iceland,1.620988,1.261749,1.679462,1.685986,0.991069,0.96825,0.855567,2.012072,...,1.291293,1.788856,1.653998,1.182717,1.664507,1.811246,1.360573,Europe,2,High
4,5,Switzerland,1.620988,1.174988,1.689206,1.764441,0.972337,0.974232,0.845466,1.643286,...,1.38697,1.818793,1.592504,1.254732,1.839596,1.645411,1.554588,Europe,2,High
5,6,Canada,1.596339,1.142375,1.626199,1.789749,0.945974,0.912704,0.731163,1.772466,...,1.306994,1.561208,1.772793,1.210593,1.563699,1.923567,1.65914,North America,2,High
6,7,Sweden,1.58235,1.128838,1.562543,1.826445,0.984825,0.948168,0.783264,1.547444,...,1.309938,1.528776,1.811226,1.296547,1.755368,1.941209,1.514168,Europe,2,High
7,8,Netherlands,1.54038,1.10361,1.52292,1.770768,0.962624,0.960987,0.668961,1.584253,...,1.446339,1.533765,1.429685,1.275175,1.753378,1.6213,1.645128,Europe,2,High
8,9,Japan,1.53172,1.27344,1.632695,1.463907,0.864109,0.897749,1.052806,2.048187,...,1.310428,1.856214,1.424095,1.204553,1.44233,0.977364,1.522791,Asia,2,High
9,10,Germany,1.523726,1.134376,1.400155,1.811893,0.97095,0.955432,0.772631,1.58842,...,1.105336,1.246243,1.450649,1.302123,1.799803,1.69128,1.651056,Europe,2,High


In [62]:
# Distribution among continents
high_df.continent.value_counts()

continent
Europe           35
Asia              9
North America     7
South America     3
Oceania           2
Africa            1
Name: count, dtype: int64

In [63]:
# African country in High SPI score
high_df[high_df['continent'] == 'Africa']

Unnamed: 0,spi_rank,country,spi_score,basic_human_needs,wellbeing,opportunity,basic_nutri_med_care,water_sanitation,shelter,personal_safety,...,access_info_comm,health_wellness,env_quality,personal_rights,personal_freedom_choice,inclusiveness,access_adv_edu,continent,cluster,spi_category
44,45,Mauritius,0.656363,0.871016,0.591461,0.399069,0.533875,0.879803,0.758277,0.977249,...,0.388492,0.332532,0.699446,0.803126,0.332772,0.279326,-0.09831,Africa,0,High
