The objective is to learn various types of clustering algorithms as available in sklearn

Data used is Kaggle- World Happiness Report Ref- ***https://www.kaggle.com/unsdsn/world-happiness/data***

Clustering techniques reference: ***http://scikit-learn.org/stable/modules/clustering.html#clustering***

Clustering techniques used:

1 **K-Means** - KMeans algorithm clusters data by trying to separate samples in n groups of equal variance, inimizing a criterion known as the within-cluster sum-of-squares.

2 **Mean Shift** - This clustering aims to discover blobs in a smooth density of samples.It is a centroid based algorithm, which works by updating candidates for centroids to be the mean of the points within a given region.These candidates are then filtered in a post-processing stage to eliminate near-duplicates to form the final set of centroids.

3 **Mini Batch K-Means** - Similar to kmeans but clustering is done in batches to reduce computation time

4 **Spectral clustering** - SpectralClustering does a low-dimension embedding of the affinity matrix between samples, followed by a KMeans in the low dimensional space. It is especially efficient if the affinity matrix is sparse. SpectralClustering requires the number of clusters to be specified.It works well for a small number of clusters but is not advised when using many clusters.

5 **DBSCAN** - The DBSCAN algorithm views clusters as areas of high density separated by areas of low density. Due to this rather generic view, clusters found by DBSCAN can be any shape, as opposed to k-means which assumes that clusters are convex shaped.

6 **Affinity Propagation** - Creates clusters by sending messages between pairs of samples until convergence.A dataset is then described using a small number of exemplars, which are identified as those most representative of other samples. The messages sent between pairs represent the suitability for one sample to be the exemplar of the other, which is updated in response to the values from other pairs.

7 **Birch** - The Birch builds a tree called the Characteristic Feature Tree (CFT) for the given data and clustering is performed as per the nodes of the tree

8 **Gaussian Mixture modeling** - It treats each dense region as if produced by a gaussian process and then goes about to find the parameters of the process

**Clustering Analysis**

In [None]:
#Call libraries
import time                   # To time processes
import warnings               # To suppress warnings
import numpy as np            # Data manipulation
import pandas as pd           # Dataframe manipulatio 
import matplotlib.pyplot as plt  # For graphics
import os                     # For os related operations
import sys                    # For data size

from sklearn import cluster, mixture              # For clustering
from sklearn.preprocessing import StandardScaler  # For scaling dataset
#%matplotlib inline            # To display plots inline
warnings.filterwarnings('ignore','UsageError')

Read and normalize data

In [None]:
os.chdir("../input")
df= pd.read_csv("2017.csv")

# Taken a 10% sample for analysis
X = df.sample(frac=0.1)

# Explore and scale dataset
X.columns.values
X.shape                 # 155 X 12
X = X.iloc[:, 2: ]      # Ignore Country and Happiness_Rank columns
X.head(2)
X.dtypes

# Normalization of dataset for easier parameter selection
ss = StandardScaler() #Instantiate scaler object
ss.fit_transform(X)

Create and set Parameters used in different clustering

In [None]:
n_clusters = 2   #for K-means clustering,, Mini Batch K-Means. No of clusters to use
bandwidth = 0.1  #for Mean-Shift Clustering. bandwidth dictates size of the region to search through
eps = 0.3 #for DBSCAN Clustering. eps decides the incremental search area within which density should be same
damping = 0.9; preference = -200  #for Affinity Propagation. preference - controls how many exemplars are used
# damping factor - damps the responsibility and availability messages to avoid numerical oscillations when updating these messages


Create cluster objects

In [None]:
km = cluster.KMeans(n_clusters =n_clusters )
km_result = km.fit_predict(X)
ms = cluster.MeanShift(bandwidth=bandwidth)
ms_result = ms.fit_predict(X)
two_means = cluster.MiniBatchKMeans(n_clusters=n_clusters)
two_means_result = two_means.fit_predict(X)
spectral = cluster.SpectralClustering(n_clusters=n_clusters)
sp_result= spectral.fit_predict(X)
dbscan = cluster.DBSCAN(eps=eps)
db_result= dbscan.fit_predict(X)
affinity_propagation = cluster.AffinityPropagation(damping=damping, preference=preference) 
affinity_propagation.fit(X)
birch = cluster.Birch(n_clusters=n_clusters)
birch_result = birch.fit_predict(X)
gmm = mixture.GaussianMixture( n_components=n_clusters, covariance_type='full')
gmm.fit(X)

Create Clustering Algorithm

In [None]:
clustering_algorithms = (
        ('KMeans', km),
        ('MeanShift', ms),
        ('MiniBatchKMeans', two_means),
        ('SpectralClustering', spectral),
        ('DBSCAN', dbscan),
        ('AffinityPropagation', affinity_propagation),
        ('Birch', birch),
        ('GaussianMixture', gmm)
    )


Execute the clusters in a for loop

In [None]:
result=algorithm.predict(X)
plot_num = 1 #for iteration
for name,algorithm in clustering_algorithms:
    y_pred = result
    y_pred = result
    plt.subplot(4, 2, plot_num)
    plt.scatter(X.iloc[:, 4], X.iloc[:, 5],c=result)
    plt.title(name, size=12)
    plot_num += 1
plt.show()

**Plot the world map with K-Means cluster**

In [None]:
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True) 

# Read data
whdata=pd.read_csv("2017.csv")
whdata = whdata.iloc[:, 2: ] 

# Instantiate scaler object
ss = StandardScaler()
# Use ot now to 'fit' &  'transform'
ss.fit_transform(whdata)

n_clusters = 2
km = cluster.KMeans(n_clusters =n_clusters )
km_result = km.fit_predict(whdata)

#Make a copy of the data set
whdata_map = whdata
whdata_map.head(2)
whdata.insert(0,'Country',df.iloc[:,0])
out=km_result

plt.subplot(4, 2, 1)
plt.scatter(whdata.iloc[:, 4], whdata.iloc[:, 5],  c=km_result)

whdata_map['clusters'] = out
data = dict(type = 'choropleth', 
           locations = whdata_map['Country'],
           locationmode = 'country names',
           z =  whdata_map['clusters'],
           text = whdata_map['Country'],
           colorbar = {'title':'Happiness'})
layout = dict(title = 'World Happiness Using K Means Clustering Method', 
             geo = dict(showframe = False, 
                       projection = {'type': 'Mercator'}))
choromap3 = go.Figure(data = [data], layout=layout)
iplot(choromap3)