In this dataset our goal is to find the most under-developed countries. We can use that as an opportunity to dive into the technique known as clustering. For this purpose, we'll use the most popular algorithm called K-Means Clustering. 

We'll first see how it works on the example of scikit learn's implementation and then we'll try to implement it ourselves.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
from sklearn.base import BaseEstimator
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
data = pd.read_csv('../input/unsupervised-learning-on-country-data/Country-data.csv')
data

# Data analysis

Let's see how many null values are there?

In [None]:
data.info()

We can see that there are no null values in this set.

Let's now inspect the statistical properties of the data

In [None]:
data.describe()

We can see that the data definitely needs scaling. Also, it seems that there are a couple of potential outliers in the dataset

One thing we don't for clustering is the country names' column, therefore we will drop it

In [None]:
country_names = data['country']
data = data.drop(['country'], axis='columns')

Let's inspect the correlation of columns

In [None]:
correlation_matrix = data.corr()
sns.heatmap(correlation_matrix, annot=True)

We can see a few columns have strong correlations:
 - child_mort - total_fer 
 - child_mort - life_expec
 - exports - imports
 - income - gdpp
 - life_expec - total_fer

In [None]:
# sns.pairplot(data)

# Data preprocessing

In [None]:
scikit_copy = data.copy()

scaler = StandardScaler()
scaled_data = scaler.fit_transform(scikit_copy)

# Clustering

We're going to use the K-Means Clustering algorithm. Since we don't have to choose a specific number of clusters. we will choose their number using the elbow method.

In kmeans, we can define a cost function as the sum of distances between each point and the center of the cluster it's assigned to. This cost if often called *inertia*. We want this cost to be as low as possible, but at the same time we don't want to many clusters, because that wouldn't be very informative.

The elbow method means that we will plot the inertia as a funciton of number of clusters. For the first couple values, the cost should drastically decrease with each new cluster. But, at some point curve will begin to "flatten". This point is exactly the value we're looking for. 

The plot we'll look a bit like bent arm and the point in question can remind the place where an elbow should be. Hence the name of this method.

In [None]:
cost_values = []
for k in range(1, 15):
    model = KMeans(n_clusters=k)
    model.fit(scaled_data)
    cost_values.append(model.inertia_)

plt.plot(cost_values)

We can see that the curve begins somewhere around 3. So that will be our k.

In [None]:
k = 3
model = KMeans(n_clusters=k)
clusters = model.fit_predict(scaled_data)

In [None]:
scikit_copy['country'] = country_names
scikit_copy['cluster'] = clusters
scikit_copy

In [None]:
scikit_copy['cluster'].value_counts()

Let's try to visualize the clusters

In [None]:
sns.pairplot(scikit_copy.drop(['country'], axis='columns'), hue='cluster')

So, from this plots we can see that:
 - cluster 0: 
     in most categories countries fall somewhere in between cluster 1 and cluster 2
 - cluster 1:
     - lowest children mortality
     - highest income
     - highest life expectancy
     - lowest total fertility
     - highest GDPP
 - cluster 2:
     - highest children mortality
     - lowest income
     - lowest life_expectancy
     - highest total fertility
     - lowest GDPP
     
     
This analysis leads me to believe that we can name these clusters as:
    - cluster 2 - least developed countries
    - cluster 0 - moderately developed countries
    - cluster 1 - highly developed countries

Let's see the countries on this list

In [None]:
print('Least developed countries:')
scikit_copy[scikit_copy['cluster'] == 2]

In [None]:
print('Moderately developed countries:')
scikit_copy[scikit_copy['cluster'] == 0]

In [None]:
print('Highly developed countries:')
scikit_copy[scikit_copy['cluster'] == 1]

# K-Means Algorithm - My Own Implementation

The algorithm runs as follows:
    1. as a starting point, initialize each centroid to a random sample's coordinates
    2. repeat until max_iter is reached:
        2a. assign each sample to the closest centroid
        2b. calculate new centroids' positions based on the newly assigned points
            (centroid position is calculated as a mean of coordinates of all the points assigned to it)
            
The algorithm is fairly simple, but there is a risk of it not finding the best fit and instead falling into what's called a local minimum of the cost function (in this case inertia). We can prevent it by running the algorithm multiple times, each time with different random samples as initial centroids and then choose the centroids with the lowest cost function.

For simplicity, we will ommit error checking and focus on the algorithm implementation

In [None]:
class CustomKMeans(BaseEstimator):
    '''
    Class implementing the K-Means Clustering algorithm. 
    Implements the scikit's BaseEstimator, which enables us to use it in conjunction
    with other scikit's tools such as Pipeline
    '''
    def __init__(self, n_clusters=3, n_init=10, max_iter=200):
        self.n_clusters = n_clusters
        self.n_init = n_init
        self.max_iter = max_iter
        
        
    def fit(self, X):
        '''
        Performs the K-Means Clustering algorithm.
        Args:
            X - data to be clustered
            n_clusters - number of clusters
            n_init - number of interations to initialize and perform the k means algorithm
        '''
        self.cost_history = []
        self.centroids_history = []
    
        for act_run in range(self.n_init):
            self.centroids = self.get_initial_centroids(X, self.n_clusters)
        
            for i in range(self.max_iter):
                # Assign each data point to the closest centroid.
                # sample_assignments[i] corresponds to i-th row of X, the index of the centroid assigned to example i
                sample_assignments = self.find_closest_centroids(X, self.centroids)
       
                old_centroids = self.centroids.copy()
                # Compute the new centroids based on the newly assigned samples
                self.update_centroids(X, sample_assignments, self.centroids)
            
                # if the centroids stayed the same, they won't change anymore, so we break the loop
                if np.all(old_centroids == self.centroids):
                    break
    
            self.cost_history.append(self.cost_function(X, sample_assignments, self.centroids))
            self.centroids_history.append(self.centroids)
            
        self.centroids = self.find_best_params()
        
        
    def predict(self, X):
        '''Assigns the samples from the dataset to clusters'''
        return self.find_closest_centroids(X, self.centroids)
    
    
    def fit_predict(self, X):
        self.fit(X)
        return self.predict(X)

            
    def get_initial_centroids(self, X, n_clusters):
        '''Chooses n_clusters random samples from the data as the initial centroids'''
        random_indexes = np.random.choice(X.shape[0], n_clusters)
        return X[random_indexes, :]


    def find_closest_centroids(self, X, centroids):
        '''Assign each sample to its closest centroid'''
        sample_assignments = []
        for sample in X:
            distances = np.linalg.norm(sample - centroids, axis=1)
            min_index = np.argmin(distances)
            sample_assignments.append(min_index)
        return np.array(sample_assignments)


    def update_centroids(self, X, sample_assignments, centroids):
        '''Computes new coordinates for each centroid based on the assigned samples'''
        for k in range(centroids.shape[0]):
            samples_assigned_to_centroid = (sample_assignments == k)
            centroids[k,:] = np.mean(X[samples_assigned_to_centroid], axis=0)
        
        
    def cost_function(self, X, sample_assignments, centroids):
        '''Calculates the inertia of the model with given centroids'''
        cost = 0
        for i in range(X.shape[0]):
            sample_centroid = sample_assignments[i]
            cost += np.linalg.norm(X[i,:] - centroids[sample_centroid,:])
        return cost / X.shape[0]


    def find_best_params(self):
        best_index = np.argmin(self.cost_history)
        return self.centroids_history[best_index]

In [None]:
model = CustomKMeans(n_clusters=3)

custom_copy = data.copy()
X_custom_scaled = scaler.transform(custom_copy)

custom_clusters = model.fit_predict(X_custom_scaled)

In [None]:
custom_copy['country'] = country_names
custom_copy['cluster'] = custom_clusters
custom_copy

In [None]:
print(custom_copy['cluster'].value_counts())
print()
print('Centroid coordinates:')
print(model.centroids)

So, as we can see the samples were divided into clusters the same way as scikit's implementation.

We can also look at the column of the least developed countries for comparison

In [None]:
print('Highly developed countries:')
custom_copy[custom_copy['cluster'] == 1]

Ok, so we can see that this cluster is the same as cluster of least developed countries from the scikit implementation.