# Introduction 

In [None]:
local_crs = 3414

place = "singapore"

# Objectives

My main task is to cluster the countries by the factors mentioned above and then present the solution. The following approach is suggested :

- Start off with the necessary data inspection and EDA tasks suitable for this dataset - data cleaning, univariate analysis, bivariate analysis etc.




- **Outlier Analysis:** We must perform the Outlier Analysis on the dataset. However, We do have the flexibility of not removing the outliers if it suits the business needs or a lot of countries are getting removed. Hence, all we need to do is find the outliers in the dataset, and then choose whether to keep them or remove them depending on the results We get.


- Try both K-means and Hierarchical clustering(both single and complete linkage) on this dataset to create the clusters. [Note that both the methods may not produce identical results and We might have to choose one of them for the final list of countries.]


- Analyse the clusters and identify the ones which are in dire need of aid. We can analyse the clusters by comparing how these three variables - [**gdpp, child_mort and income**] vary for each cluster of countries to recognise and differentiate the clusters of developed countries from the clusters of under-developed countries.


- Also, We need to perform visualisations on the clusters that have been formed.  We can do this by choosing any two of the three variables mentioned above on the X-Y axes and plotting a scatter plot of all the countries and differentiating the clusters. Make sure We create visualisations for all the three pairs. We can also choose other types of plots like boxplots, etc. 


- Both K-means and Hierarchical may give different results because of previous analysis (whether We chose to keep or remove the outliers, how many clusters We chose,  etc.) Hence, there might be some subjectivity in the final number of countries that We think should be reported back to the CEO since they depend upon the preceding analysis as well. Here, make sure that We report back at least 5 countries which are in direst need of aid from the analysis work that we perform.

# Data Collected / Received

The datasets containing those socio-economic factors and the corresponding data dictionary are provided.

In [None]:
# Supress Warnings
import warnings
warnings.filterwarnings('ignore')
import json

In [None]:
# Importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# visulaisation
from matplotlib.pyplot import xticks
%matplotlib inline

In [None]:
# Data display coustomization
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', -1)

In [None]:
# To perform Hierarchical clustering
from scipy.cluster.hierarchy import linkage
from scipy.cluster.hierarchy import dendrogram
from scipy.cluster.hierarchy import cut_tree
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans

In [None]:
# import all libraries and dependencies for machine learning
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.decomposition import IncrementalPCA
from sklearn.neighbors import NearestNeighbors
from random import sample
from numpy.random import uniform
from math import isnan
from bokeh.plotting import figure, show

In [None]:
from clustergram import Clustergram
import os

# Data Preparation

## Data Loading

In [None]:
import geopandas as gpd

In [None]:
tessellation_raw = gpd.read_parquet(f"./out/{place}/tessellation_stats.pq")

tessellation = tessellation_raw.drop(columns=['geometry'])

In [None]:
tessellation.head()

## Data Dictionary

## Duplicate Check

In [None]:
tessellation.shape

The shape after running the drop duplicate command is same as the original dataframe.

Hence we can conclude that there were zero duplicate values in the dataset.

## Data Inspection

In [None]:
tessellation.shape

In [None]:
tessellation.info()

In [None]:
tessellation.describe()

## Data Cleaning

### Deal with null values


In [None]:
tessellation.fillna(0, inplace=True)
tessellation["building_neighbour_dist_25"].fillna(5000, inplace=True)
tessellation["building_neighbour_dist_50"].fillna(5000, inplace=True)
tessellation["building_neighbour_dist_75"].fillna(5000, inplace=True)

### Null Percentage: Columns

In [None]:
(tessellation.isnull().sum() * 100 / len(tessellation)).value_counts(ascending=False)

### Null Count: Columns

In [None]:
tessellation.isnull().sum().value_counts(ascending=False)

### Null Percentage: Rows

In [None]:
(tessellation.isnull().sum(axis=1) * 100 / len(tessellation)).value_counts(ascending=False)

In [None]:
tessellation

### Null Count: Rows

In [None]:
tessellation.isnull().sum(axis=1).value_counts(ascending=False)

There are no missing / Null values either in columns or rows

In [None]:
plt.figure(figsize = (30, 30))
sns.heatmap(tessellation.corr(), annot = True, cmap="rainbow")
plt.savefig('Correlation')
plt.show()

## Data Preparation

In [None]:
tessellation_drop = tessellation.copy()
uID = tessellation_drop.pop('uID')
tessellation_drop.head()

## Rescaling the Features

Most software packages use SVD to compute the principal components and assume that the data is scaled and centred, so it is important to do standardisation/normalisation. There are two common ways of rescaling:

- Min-Max scaling
- Standardisation (mean-0, sigma-1)


Here, we will use Standardisation Scaling.

In [None]:
tessellation_drop

In [None]:
# Standarisation technique for scaling
scaler = StandardScaler()
tessellation_scaled = scaler.fit_transform(tessellation_drop)

In [None]:
tessellation_scaled

## PCA Application

We are doing PCA because we want to remove the redundancies in the data and find the most important directions where the data was aligned. A somewhat similar heuristic is also used by the United Nations to calculate the Human Development Index(HDI) to rank countries on the basis of their development.

Principal component analysis (PCA) is one of the most commonly used dimensionality reduction techniques in the industry. By converting large data sets into smaller ones containing fewer variables, it helps in improving model performance, visualising complex data sets, and in many more areas.

Let's use PCA for dimensionality reduction as from the heatmap it is evident that correlation exists between the attributes.

In [None]:
pca = PCA(svd_solver='randomized', random_state=50)


In [None]:
# Lets apply PCA on the scaled data

pca.fit(tessellation_scaled)

In [None]:
# PCA components created 

pca.components_

In [None]:
# Variance Ratio

pca.explained_variance_ratio_

In [None]:
# Variance Ratio bar plot for each PCA components.
plt.figure(figsize = (10, 5))
ax = plt.bar(range(1,len(pca.explained_variance_ratio_)+1), pca.explained_variance_ratio_)
plt.xlabel("PCA Components",fontweight = 'bold')
plt.ylabel("Variance Ratio",fontweight = 'bold')

plt.show()

In [None]:
# calculate the cumulative sum of explained variance ratios
cumulative_sum = np.cumsum(pca.explained_variance_ratio_)

org_col = list(tessellation.drop(['uID'],axis=1).columns)

num_pc = np.argmax(cumulative_sum >= 0.95) + 1

pc_dict = {'Attribute': org_col}

pc_dict.update({f'PC_{i+1}':pca.components_[i] for i in range(num_pc)})

attributes_pca = pd.DataFrame(pc_dict)

In [None]:
# Scree plot to visualize the Cumulative variance against the Number of components

fig = plt.figure(figsize = (12,5))
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.vlines(x=num_pc, ymax=1, ymin=0, colors="r", linestyles="--")
plt.xlabel('Number of PCA components')
plt.ylabel('Cumulative Explained Variance')
plt.show()

In [None]:
attributes_pca

In [None]:
# # Plotting the above dataframe for better visualization with PC1 and PC2

# sns.pairplot(data=attributes_pca, x_vars=["PC_1"], y_vars=["PC_2"], hue = "Attribute" ,height=10)
# plt.xlabel("Principal Component 1",fontweight = 'bold')
# plt.ylabel("Principal Component 2",fontweight = 'bold')

# for i,txt in enumerate(attributes_pca.Attribute):
#     plt.annotate(txt, (attributes_pca.PC_1[i],attributes_pca.PC_2[i]))

In [None]:
# # Plotting the above dataframe with PC1 and PC3 to understand the components which explains inflation.

# sns.pairplot(data=attributes_pca, x_vars=["PC_1"], y_vars=["PC_3"], hue = "Attribute" ,height=8)
# plt.xlabel("Principal Component 1",fontweight = 'bold')
# plt.ylabel("Principal Component 3",fontweight = 'bold')

# for i,txt in enumerate(attributes_pca.Attribute):
#     plt.annotate(txt, (attributes_pca.PC_1[i],attributes_pca.PC_3[i]))

In [None]:
# Building the dataframe using Incremental PCA for better efficiency.

inc_pca = IncrementalPCA(n_components=num_pc)

In [None]:
# Fitting the scaled df on incremental pca

df_inc_pca = inc_pca.fit_transform(tessellation_scaled)
df_inc_pca

In [None]:
# Creating new dataframe with Principal components


df_pca = pd.DataFrame(df_inc_pca, columns=[f"PC_{i+1}" for i in range(num_pc)])
df_pca_final = pd.concat([uID, df_pca], axis=1)
df_pca_final.head()

In [None]:
# # Plotting Heatmap to check is there still dependency in the dataset.

# plt.figure(figsize = (30,30))        
# ax = sns.heatmap(df_pca.corr(),annot = True,cmap='winter')

As we can see from above heatmap that the correlation among the attributes is almost 0, we can proceed with this dataframe.

In [None]:
# # Scatter Plot to visualize the spread of data across PCA components

# sns.pairplot(data=df_pca, kind="hist")

## Outlier Analysis

Visualization each columns using violinplot

5 reasons why we used a violin graph over boxplot
- Violin graph is like box plot, but better
- Violin graph is like density plot, but much useful
- Violin graph is visually intuitive and attractive
- Violin graph is non-parametric
- There are many ways to use violin graphs

In [None]:
len(df_pca_final)

In [None]:
df_pca_final_minus_outliers = df_pca_final

In [None]:
# Loop over all PC components from PC_1 to PC_15
for i in range(1, num_pc+1):
    col_name = f'PC_{i}'
    # Calculate the quartiles and IQR for the current PC component
    Q1 = df_pca_final_minus_outliers[col_name].quantile(0.005)
    Q3 = df_pca_final_minus_outliers[col_name].quantile(0.995)
    IQR = Q3 - Q1
    # Apply the outlier treatment for the current PC component
    df_pca_final_minus_outliers = df_pca_final_minus_outliers[(df_pca_final_minus_outliers[col_name] >= Q1) & (df_pca_final_minus_outliers[col_name] <= Q3)]

In [None]:
# # Plot after Outlier removal 

# outliers = [f"PC_{i+1}" for i in range(num_pc)]
# plt.rcParams['figure.figsize'] = [20,5]
# sns.violinplot(data = df_pca_final[outliers], orient="v", palette="Set2" )
# plt.title("Outliers Variable Distribution", fontsize = 14, fontweight = 'bold')
# plt.ylabel("Range", fontweight = 'bold')
# plt.xlabel("PC Components", fontweight = 'bold')
# plt.show()

In [None]:
len(df_pca_final_minus_outliers)

In [None]:
# Reindexing the df after outlier removal
df_minus_outliers = df_pca_final_minus_outliers.reset_index(drop=True)
df_pca_final_minus_outliers = df_minus_outliers
df_pca_final_minus_outliers = df_pca_final_minus_outliers.drop(['uID'],axis=1)
df_pca_final_minus_outliers.head()

In [None]:

df_pca_final = df_pca_final.reset_index(drop=True)
df_pca_final_data = df_pca_final.drop(['uID'],axis=1)
df_pca_final_data.head()

## Hopkins Statistics Test

The Hopkins statistic (introduced by Brian Hopkins and John Gordon Skellam) is a way of measuring the cluster tendency of a data set.It acts as a statistical hypothesis test where the null hypothesis is that the data is generated by a Poisson point process and are thus uniformly randomly distributed. A value close to 1 tends to indicate the data is highly clustered, random data will tend to result in values around 0.5, and uniformly distributed data will tend to result in values close to 0.
- If the value is between {0.01, ...,0.3}, the data is regularly spaced.

- If the value is around 0.5, it is random.

- If the value is between {0.7, ..., 0.99}, it has a high tendency to cluster.

In [None]:
# Calculating Hopkins score to know whether the data is good for clustering or not.

def hopkins(X):
    d = X.shape[1]
    n = len(X)
    m = int(0.1 * n) 
    nbrs = NearestNeighbors(n_neighbors=1).fit(X.values)
 
    rand_X = sample(range(0, n, 1), m)
 
    ujd = []
    wjd = []
    for j in range(0, m):
        u_dist, _ = nbrs.kneighbors(uniform(np.amin(X,axis=0),np.amax(X,axis=0),d).reshape(1, -1), 2, return_distance=True)
        ujd.append(u_dist[0][1])
        w_dist, _ = nbrs.kneighbors(X.iloc[rand_X[j]].values.reshape(1, -1), 2, return_distance=True)
        wjd.append(w_dist[0][1])
 
    HS = sum(ujd) / (sum(ujd) + sum(wjd))
    if isnan(HS):
        print(ujd, wjd)
        HS = 0
 
    return HS


In [None]:
# Hopkins score
Hopkins_score=round(hopkins(df_pca_final),2)

In [None]:
print(Hopkins_score)

The Hopkins statistic (introduced by Brian Hopkins and John Gordon Skellam) is a way of measuring the cluster tendency of a data set.[1] It belongs to the family of sparse sampling tests. It acts as a statistical hypothesis test where the null hypothesis is that the data is generated by a Poisson point process and are thus uniformly randomly distributed.[2] A value close to 1 tends to indicate the data is highly clustered, random data will tend to result in values around 0.5, and uniformly distributed data will tend to result in values close to 0.[3]

# Model Building

## Clustergram

In [None]:
len(df_pca_final)

In [None]:
data = df_pca_final_data

In [None]:
data

In [None]:
cgram = Clustergram(range(1, 19), n_init=6, method='gmm', bic=True, covariance_type='diag')
cgram.fit(data)

In [None]:
score = cgram.silhouette_score()

In [None]:
fig, axs = plt.subplots(figsize=(10, 10), sharex=True)
score.plot(xlabel="Number of clusters (k)", ylabel="Silhouette score", ax=axs)

In [None]:
num_clusters = score[13:].idxmax()

In [None]:
reduced_array = np.mean(cgram.cluster_centers[num_clusters], axis=1)

In [None]:
weighted_difference_between_clusters = {i: k for i, k, in enumerate(reduced_array)}

In [None]:
def scale_dict(d):
    # Extract values and convert them to a numpy array
    values = np.array(list(d.values()))

    # Normalize values to [0,1]
    normalized_values = (values - np.min(values)) / (np.max(values) - np.min(values))

    # Scale values from [-10,10]
    scaled_values = (normalized_values * 20) - 10

    # Create a new dictionary with the scaled values
    scaled_dict = {key: value for key, value in zip(d.keys(), scaled_values)}

    return scaled_dict

In [None]:
weighted_difference_between_clusters = scale_dict(weighted_difference_between_clusters)

In [None]:
# fig, axs = plt.subplots(3, figsize=(10, 10), sharex=True)
# cgram.silhouette_score().plot(xlabel="Number of clusters (k)", ylabel="Silhouette score", ax=axs[0])
# cgram.calinski_harabasz_score().plot(xlabel="Number of clusters (k)", ylabel="Calinski-Harabasz score", ax=axs[1])
# cgram.davies_bouldin_score().plot(xlabel="Number of clusters (k)", ylabel=".davies_bouldin_score", ax=axs[2])
# sns.despine(offset=10)

## K- means Clustering

K-means clustering is one of the simplest and popular unsupervised machine learning algorithms.

The algorithm works as follows:

First we initialize k points, called means, randomly. We categorize each item to its closest mean and we update the mean’s coordinates, which are the averages of the items categorized in that mean so far. We repeat the process for a given number of iterations and at the end, we have our clusters.

# Finding the Optimal Number of Clusters

### Elbow Curve to get the right number of Clusters

A fundamental step for any unsupervised algorithm is to determine the optimal number of clusters into which the data may be clustered. The Elbow Method is one of the most popular methods to determine this optimal value of k.

In [None]:
# # Elbow curve method to find the ideal number of clusters.
# ssd = []
# for num_clusters in list(range(1, num_pc)):
#     model_clus = KMeans(n_clusters = num_clusters, max_iter=150,random_state= 50)
#     model_clus.fit(df_pca_final_data)
#     ssd.append(model_clus.inertia_)

# plt.plot(ssd)

Looking at the above elbow curve it looks good to proceed with either 1 clusters.

## Silhouette Analysis

silhouette score=(p−q)/max(p,q)
 
**p**  is the mean distance to the points in the nearest cluster that the data point is not a part of

**q**  is the mean intra-cluster distance to all the points in its own cluster.

The value of the silhouette score range lies between -1 to 1.

A score closer to 1 indicates that the data point is very similar to other data points in the cluster,

A score closer to -1 indicates that the data point is not similar to the data points in its cluster.

In [None]:
# # Silhouette score analysis to find the ideal number of clusters for K-means clustering

# range_n_clusters = [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]

# for num_clusters in range_n_clusters:
    
#     # intialise kmeans
#     kmeans = KMeans(n_clusters=num_clusters, max_iter=50,random_state= 100)
#     kmeans.fit(df_pca_final_data)
    
#     cluster_labels = kmeans.labels_
    
#     # silhouette score
#     silhouette_avg = silhouette_score(df_pca_final_data, cluster_labels)
#     print("For n_clusters={0}, the silhouette score is {1}".format(num_clusters, silhouette_avg))

In [None]:
# #K-means with k=7 clusters

# cluster7 = KMeans(n_clusters=7, max_iter=150, random_state= 50)
# cluster7.fit(df_pca_final_data)

In [None]:
# # Cluster labels

# cluster7.labels_

In [None]:

# # Assign the label

# df_pca_final['Cluster_Id'] = cluster7.labels_
# df_pca_final.head()

In [None]:
# # Number of countries in each cluster

# df_pca_final['Cluster_Id'].value_counts()

It seems there are good number of countries in each clusters.

In [None]:
# # Scatter plot on Principal components to visualize the spread of the data

# fig, axes = plt.subplots(1,2, figsize=(15,5))

# sns.scatterplot(x='PC_1',y='PC_2',hue='Cluster_Id',legend='full',palette="Set1",data=df_pca_final,ax=axes[0])
# sns.scatterplot(x='PC_1',y='PC_3',hue='Cluster_Id',legend='full',palette="Set2",data=df_pca_final,ax=axes[1])
# plt.show()

In [None]:
# df_pca_final

We have visualized the data on the principal components and saw some good clusters were formed but some were not so good hence let's now visualize the data on the original attributes.

In [None]:
# # Merging the df with PCA with original df

# df_merge = pd.merge(tessellation,df_pca_final,on='uID')
# list = tessellation.columns.tolist()
# list.append("Cluster_Id")
# df_merge_col = df_merge[list]

# df = []

# for column_name in tessellation.columns.tolist()[1:]:
#     df.append(pd.DataFrame(df_merge_col.groupby(["Cluster_Id"])[column_name].mean()))
    
# df_concat = pd.concat([pd.Series(range(7))] + df, axis=1)    
# df_concat.columns = ["Cluster_Id"] + tessellation.columns.tolist()[1:]
# df_concat.head()

In [None]:
# df_merge_col.head()

From the business understanding we have learnt that **Child_Mortality, Income, Gdpp** are some important factors which decides the development of any uID. We have also cross checked with Principal components and found that these variables have good score in PCA. Hence, we will proceed with analyzing these 3 components to build some meaningful clusters.

In [None]:
# # assuming you have a DataFrame called df_merge_col containing all the data,
# # and the x column is named 'Cluster_Id'

# # Get a list of all column names except for the x column
# y_columns = [col_name for col_name in df_merge_col.columns if col_name != 'Cluster_Id']

# # Set up the figure with subplots
# num_cols = 2
# num_rows = len(y_columns) // num_cols + (len(y_columns) % num_cols > 0)
# fig, axes = plt.subplots(nrows=num_rows, ncols=num_cols, sharey=True)

# # Flatten the axes array to simplify indexing
# axes = axes.flatten()

# # Loop over all column names except for the x column
# for i, col_name in enumerate(y_columns):
#     # Create a violin plot for the current column
#     sns.violinplot(x='Cluster_Id', y=col_name, data=df_merge_col, ax=axes[i])
#     # Set the title for the current plot
#     axes[i].set_title(col_name)

# # Remove empty plots
# for i in range(len(y_columns), num_rows*num_cols):
#     fig.delaxes(axes[i])

# # Adjust spacing between subplots
# fig.tight_layout()

# # Show the plot
# plt.show()

## Hierarchical Clustering

Hierarchical clustering involves creating clusters that have a predetermined ordering from top to bottom. For example, all files and folders on the hard disk are organized in a hierarchy. There are two types of hierarchical clustering,

- Divisive
- Agglomerative.

In [None]:
# df_pca_final_data.head()

### Single Linkage:

In single linkage hierarchical clustering, the distance between two clusters is defined as the shortest distance between two points in each cluster. For example, the distance between clusters “r” and “s” to the left is equal to the length of the arrow between their two closest points.

In [None]:
# import sys
# sys.setrecursionlimit(10000)

In [None]:
# df_pca_final_data

In [None]:
# # Single linkage

# mergings = linkage(df_pca_final_data, method='single',metric='euclidean')
# dendrogram(mergings)
# plt.show()

### Complete Linkage

In complete linkage hierarchical clustering, the distance between two clusters is defined as the longest distance between two points in each cluster. For example, the distance between clusters “r” and “s” to the left is equal to the length of the arrow between their two furthest points.

In [None]:
# # Complete Linkage

# mergings = linkage(df_pca_final_data, method='complete',metric='euclidean')
# dendrogram(mergings)
# plt.show()

In [None]:
# df_pca_hc = df_pca_final.copy()
# df_pca_hc = df_pca_hc.drop('Cluster_Id',axis=1)
# df_pca_hc.head()

In [None]:
# # Let cut the tree at height of approx 3 to get 4 clusters and see if it get any better cluster formation.

# clusterCut = pd.Series(cut_tree(mergings, n_clusters = 4).reshape(-1,))
# df_hc = pd.concat([df_pca_hc, clusterCut], axis=1)
# df_hc.columns = ['uID'] + ['PC_' + str(i) for i in range(1, num_pc+1)] + ['Cluster_Id']

In [None]:
# df_hc.head()

In [None]:
# # Scatter plot on Principal components to visualize the spread of the data

# fig, axes = plt.subplots(1,2, figsize=(15,5))

# sns.scatterplot(x='PC_1',y='PC_2',hue='Cluster_Id',legend='full',palette="Set1",data=df_hc,ax=axes[0])
# sns.scatterplot(x='PC_1',y='PC_3',hue='Cluster_Id',legend='full',palette="Set1",data=df_hc,ax=axes[1])
# plt.show()

**We have analyzed both K-means and Hierarchial clustering and found clusters formed are not identical. The clusters formed in both the cases are not that great but its better in K-means as compared to Hierarchial. So, we will proceed with the clusters formed by K-means and based on the information provided by the final clusters we will deduce the final list of countries which are in need of aid**

In [None]:
cgram.labels[num_clusters].values

In [None]:
df_pca_final["cluster_ID"] = cgram.labels[num_clusters].values

In [None]:
df_pca_final["cluster_ID"]

In [None]:
df_pca_final[["uID", "cluster_ID"]]

In [None]:
merged_df = tessellation_raw.join(df_pca_final[["uID", "cluster_ID"]].set_index('uID'), on="uID")

In [None]:
merged_df["cluster_ID"].fillna(999, inplace=True)

In [None]:
# Map for comparison
f, ax = plt.subplots(figsize=(100, 100))
merged_df.plot(ax=ax, column="cluster_ID", categorical=True, legend=True, cmap='Pastel1')
ax.set_axis_off()

In [None]:
# Convert the dictionary to a JSON-serializable format
json_data = json.dumps(weighted_difference_between_clusters)

# Create the directory if it does not exist
directory = f"./out/{place}/final/"
os.makedirs(directory, exist_ok=True)

# Write the JSON data to a file
with open(f'./out/{place}/final/weighted_difference_between_clusters.json', 'w') as f:
    f.write(json_data)

In [None]:
merged_df.to_parquet(f"./out/{place}/tessellation_stats_clusters.pq")