# WEP24-MLB: Clustering

In this lecture, we will learn about clustering techniques and how to evaluate and compare the techniques.

## Clustering Techniques

### K-Means

The scikit-learn library has an implementation of the k-means algorithm. Let's apply it to a set of randomly generated blobs. The generated dataset is labeled but we will remove the labels as clustering is unsupervised technique.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
'''
We generate a sample data using sklearn library
'''
from sklearn.datasets import make_blobs
X,y = make_blobs(centers=4, n_samples=200, cluster_std=0.7)

In [None]:
type(X)

In [None]:
'''
TODO: display the values of the dataset
'''
print(X[:10])

Now we plot these points, but without coloring the points using the labels:

In [None]:
'''
TODO: use scatter plot to plot the data in X (y represents the labels only)
'''
plt.scatter(X[:,0], X[:,1])

We can see four clusters in the data set. Let's see if the k-means algorithm can recognize these clusters. First we create the instance of the k-means model and pass the number of clusters (4) as a parameter.

More information about using KMeans can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)

In [None]:
'''
TODO: import the KMeans from its library ... 
'''
from sklearn.cluster import KMeans
model = KMeans(4)
'''
TODO: train the model to find the clusters
'''
model = model.fit(X)


In [None]:
'''
Now, we can print the centers of the clusters
'''
print(model.cluster_centers_)

In [None]:
'''
We can also plot the data (using scatter plot) with color code based on the clusters
'''
plt.scatter(X[:,0], X[:,1], c = model.labels_);
plt.scatter(model.cluster_centers_[:,0], model.cluster_centers_[:,1], 
            s=100, color="red"); # Show the centres

The clustering looks good.

#### Another Example

The k-means algorithm can have difficulties when the clusters are not convex in shape:

In [None]:
from sklearn.datasets import make_moons
X,y = make_moons(200, noise=0.05)

In [None]:
plt.scatter(X[:,0], X[:,1]);

In [None]:
'''
TODO: use k-means to cluster the data into two cluster
'''
mdl = KMeans(n_clusters = 2)
mdl.fit(X)
plt.scatter(X[:,0], X[:,1], c = mdl.labels_);
plt.scatter(mdl.cluster_centers_[:,0], mdl.cluster_centers_[:,1], s=100, color="red"); # Show the centres

The clustering does not work well now, since it is not possible to separate the two clusters with a line. We could embed this data set into a higher dimensional space, where the separation is possible. And then apply the k-means clustering.

Alternatively, we can use a different type of clustering algorithm for this case. The *DBSCAN algorithm* is based on densities and works well on data whose density in the clusters is uniform.

## DBSCAN

In [None]:
from sklearn.cluster import DBSCAN

In [None]:
'''
TODO: regenerate another instance of the blobs dataset and plot it as a scatter plot
'''
X,y = make_blobs(centers=4, n_samples=200, cluster_std=0.7)
plt.scatter(X[:,0], X[:,1])

In [None]:
'''
TODO: use DBSCAN to cluster the blobs dataset 
    (play with the algorithm parameters and check the quality of the clusters)
'''
mdl = DBSCAN(eps = 0.8, min_samples = 8)
mdl.fit(X)
plt.scatter(X[:,0], X[:,1], c = mdl.labels_);

In [None]:
'''
TODO: regenerate another instance of the two moons dataset and plot it as a scatter plot
'''
X,y = make_moons(200, noise=0.05)
plt.scatter(X[:,0], X[:,1])

In [None]:
'''
TODO: use DBSCAN to cluster the two moons dataset 
    (play with the algorithm parameters and check the quality of the clusters)
'''
mdl = DBSCAN(eps = 0.3, min_samples = 8)
mdl.fit(X)
plt.scatter(X[:,0], X[:,1], c = mdl.labels_);

## 1.3. Hierarchical clustering


In [None]:
from sklearn.cluster import AgglomerativeClustering

In [None]:
'''
TODO: regenerate another instance of the blobs dataset and plot it as a scatter plot
'''
X,y = make_blobs(centers=4, n_samples=200, cluster_std=0.7)
plt.scatter(X[:,0], X[:,1])

In [None]:
'''
TODO: use Hierarchical clustering to cluster the blobs dataset 
'''
hier_mdl = AgglomerativeClustering(n_clusters = 4)
hier_mdl.fit(X)
plt.scatter(X[:,0], X[:,1], c=hier_mdl.labels_);

In [None]:
'''
TODO: use Hierarchical clustering to cluster the two moons dataset 
'''
X,y = make_moons(200, noise=0.05)
hier_mdl = AgglomerativeClustering(n_clusters = 2)
hier_mdl.fit(X)
plt.scatter(X[:,0], X[:,1], c=hier_mdl.labels_);

## Spectral clustering

In [None]:
from sklearn.cluster import SpectralClustering

In [None]:
'''
TODO: use Spectral clustering to cluster the blobs dataset 
'''
X,y = make_blobs(centers=4, n_samples=200, cluster_std=0.7)

mdl = SpectralClustering(n_clusters = 4, 
                                      affinity="nearest_neighbors")
mdl.fit(X)
plt.scatter(X[:,0], X[:,1], c = mdl.labels_);

In [None]:
'''
TODO: use Spectral clustering to cluster the two moons dataset 
'''
X,y = make_moons(200, noise=0.05)
mdl = SpectralClustering(n_clusters=2, 
                                      affinity="nearest_neighbors")
mdl.fit(X)
plt.scatter(X[:,0], X[:,1], c = mdl.labels_);

## Clustering Evaluation

In [None]:
'''
TODO: use a clustering technique to cluster the blobs dataset 
'''
X,y = make_blobs(centers=4, n_samples=200, cluster_std=0.7)
clf = KMeans(4)

In [None]:
from sklearn import metrics


clf.fit(X)                              #run the k-means clustering

print ('Final evaluation of the clustering:')

print('Inertia: %.2f' %  clf.inertia_)

print('Adjusted_rand_score %.2f' % metrics.adjusted_rand_score(y.ravel(), 
                                                               clf.labels_))

print('Homogeneity %.2f' %  metrics.homogeneity_score(y.ravel(), 
                                                      clf.labels_))

print('Completeness %.2f' %  metrics.completeness_score(y.ravel(), 
                                                        clf.labels_))
             
print('V_measure %.2f' %  metrics.v_measure_score(y.ravel(), clf.labels_))

print('Silhouette %.2f' %  metrics.silhouette_score(X, clf.labels_,  
                                                    metric='euclidean'))