# K-Means Notebook

<font color='royalblue'>This notebook is meant to test our K-Means clustering algorithm implemented in the python script `KMeans.py`. </font><br><br>

<font color='royalblue'> To test our model we will use the <a href="https://www.kaggle.com/datasets/fernandol/countries-of-the-world"> World Countries Dataset </a> collected by the US Government in 2013. We'll keep in mind that the data we use is not that recent. Our goal in this notebook will be to use our KMeans class to group countries. We expect the countries with equivalent levels of development to be in the same cluster. The United Nations publish each year a <a href="https://www.un.org/development/desa/dpad/wp-content/uploads/sites/45/WESP2020_Annex.pdf">global countries classification</a> where every state of planet is assigned to one of the following groups : developed economies, devoloping economies and economies in transition. </font><br>

<font color='royalblue'> Let's see if when we run our KMeans clustering on the world countries data with K=3 clusters if the groups created are representative of the levels of development stated by the UN or not so much. </font>

<font color='gray'><br>The only libraries you need are `numpy` and `pandas`. Then you should be up and running to execute the whole notebook 👍 </font>

## Imports

In [1]:
import scipy
import KMeans
import random
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

from KMeans import *
from scipy import stats

In [2]:
#dataset
dataset_path = "../datasets/world_countries.csv"

df = pd.read_csv(dataset_path, index_col=False, decimal=',')

## Dataset description

In [3]:
df.describe()

Unnamed: 0,Population,Area (sq. mi.),Pop. Density (per sq. mi.),Coastline (coast/area ratio),Net migration,Infant mortality (per 1000 births),GDP ($ per capita),Literacy (%),Phones (per 1000),Arable (%),Crops (%),Other (%),Climate,Birthrate,Deathrate,Agriculture,Industry,Service
count,227.0,227.0,227.0,227.0,224.0,224.0,226.0,209.0,223.0,225.0,225.0,225.0,205.0,224.0,223.0,212.0,211.0,212.0
mean,28740280.0,598227.0,379.047137,21.16533,0.038125,35.506964,9689.823009,82.838278,236.061435,13.797111,4.564222,81.638311,2.139024,22.114732,9.241345,0.150844,0.282711,0.565283
std,117891300.0,1790282.0,1660.185825,72.286863,4.889269,35.389899,10049.138513,19.722173,227.991829,13.040402,8.36147,16.140835,0.699397,11.176716,4.990026,0.146798,0.138272,0.165841
min,7026.0,2.0,0.0,0.0,-20.99,2.29,500.0,17.6,0.2,0.0,0.0,33.33,1.0,7.29,2.29,0.0,0.02,0.062
25%,437624.0,4647.5,29.15,0.1,-0.9275,8.15,1900.0,70.6,37.8,3.22,0.19,71.65,2.0,12.6725,5.91,0.03775,0.193,0.42925
50%,4786994.0,86600.0,78.8,0.73,0.0,21.0,5550.0,92.5,176.2,10.42,1.03,85.7,2.0,18.79,7.84,0.099,0.272,0.571
75%,17497770.0,441811.0,190.15,10.345,0.9975,55.705,15700.0,98.0,389.65,20.0,4.44,95.44,3.0,29.82,10.605,0.221,0.341,0.6785
max,1313974000.0,17075200.0,16271.5,870.66,23.06,191.19,55100.0,100.0,1035.6,62.11,50.68,100.0,4.0,50.73,29.74,0.769,0.906,0.954


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 227 entries, 0 to 226
Data columns (total 20 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   Country                             227 non-null    object 
 1   Region                              227 non-null    object 
 2   Population                          227 non-null    int64  
 3   Area (sq. mi.)                      227 non-null    int64  
 4   Pop. Density (per sq. mi.)          227 non-null    float64
 5   Coastline (coast/area ratio)        227 non-null    float64
 6   Net migration                       224 non-null    float64
 7   Infant mortality (per 1000 births)  224 non-null    float64
 8   GDP ($ per capita)                  226 non-null    float64
 9   Literacy (%)                        209 non-null    float64
 10  Phones (per 1000)                   223 non-null    float64
 11  Arable (%)                          225 non-n

## Dataset cleaning

<font color='royalblue'> K-Means is an algorithm very sensitive to outliers. So before fitting the model to our data we will (i) delete countries for which NaN values appear (ii) remove outliers ie. countries which features distribution is very far from other countries.</font><br><br>

In [5]:
# features selection
df = df[['Country', 'Population', 'Net migration', 'Infant mortality (per 1000 births)',
         'GDP ($ per capita)', 'Literacy (%)', 'Phones (per 1000)', 'Birthrate', 'Deathrate',
         'Coastline (coast/area ratio)', 'Agriculture', 'Industry', 'Service']]

In [6]:
df = df.dropna()

In [7]:
outliers = df[(np.abs(scipy.stats.zscore(df.iloc[:,1:])) >= 3).all(axis=1)]

df = df[(np.abs(scipy.stats.zscore(df.iloc[:,1:])) < 3).all(axis=1)]

print(df)

                Country  Population  Net migration  \
1              Albania      3581655          -4.93   
2              Algeria     32930091          -0.39   
6             Anguilla        13477          10.76   
7    Antigua & Barbuda        69108          -6.15   
8            Argentina     39921833           0.61   
..                  ...         ...            ...   
218          Venezuela     25730435          -0.04   
219            Vietnam     84402966          -0.45   
224              Yemen     21456188           0.00   
225             Zambia     11502010           0.00   
226           Zimbabwe     12236805           0.00   

     Infant mortality (per 1000 births)  GDP ($ per capita)  Literacy (%)  \
1                                 21.52              4500.0          86.5   
2                                 31.00              6000.0          70.0   
6                                 21.03              8600.0          95.0   
7                                 19.46    

In [8]:
dataset = np.array(df)

X_train = dataset[:,1:]

## Countries clustering

In [9]:
model = KMeans()

In [10]:
clusters = model.fit(X_train, K=3, n_epochs_max = 50)

End of clustering
	7 epochs done 
	0 changes on last epochs 


In [11]:
#countries = np.reshape(dataset[:,0], (dataset.shape[0],1))
category = np.reshape(clusters[:,-1], (clusters.shape[0],1))

dataset_ = np.hstack([dataset, category])

for c in np.unique(category):
    
    print(f"CATEGORY {int(c)}\n")
    print(dataset_[dataset_[:,-1] == c, 0])
    print(f"\n------------------------------\n")



CATEGORY 0

['Bangladesh ' 'Benin ' 'Bhutan ' 'Burkina Faso ' 'Burma ' 'Burundi '
 'Cambodia ' 'Cameroon ' 'Central African Rep. ' 'Chad ' 'Comoros '
 'Congo, Dem. Rep. ' 'Congo, Repub. of the ' "Cote d'Ivoire " 'Djibouti '
 'Eritrea ' 'Ethiopia ' 'Gabon ' 'Gambia, The ' 'Ghana ' 'Guinea '
 'Haiti ' 'Kenya ' 'Laos ' 'Madagascar ' 'Malawi ' 'Mali ' 'Mauritania '
 'Mozambique ' 'Nepal ' 'Nigeria ' 'Pakistan ' 'Papua New Guinea '
 'Rwanda ' 'Senegal ' 'Sierra Leone ' 'Sudan ' 'Tajikistan ' 'Tanzania '
 'Togo ' 'Uganda ' 'Uzbekistan ' 'Vanuatu ' 'Yemen ' 'Zambia ' 'Zimbabwe ']

------------------------------

CATEGORY 1

['Albania ' 'Algeria ' 'Argentina ' 'Armenia ' 'Azerbaijan ' 'Belarus '
 'Belize ' 'Bolivia ' 'Brazil ' 'Brunei ' 'Bulgaria ' 'Cape Verde '
 'Chile ' 'Colombia ' 'Costa Rica ' 'Cuba ' 'Dominica '
 'Dominican Republic ' 'Ecuador ' 'Egypt ' 'El Salvador ' 'Fiji '
 'Georgia ' 'Grenada ' 'Guatemala ' 'Guyana ' 'Honduras ' 'Indonesia '
 'Iran ' 'Iraq ' 'Jamaica ' 'Jordan ' 'Kaz