<a href="https://colab.research.google.com/github/univ-3360-vu-smartcities/clustering-demo-2/blob/master/Clustering_Demo_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Clustering Algorithms

In this example, we will be looking at how to implement hierarchical and density based clustering using scikit-learn.

In [0]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets

In [0]:
def plot_cluster(data, labels, plot_title):
  plt.figure(figsize=(10,5))
  unique_labels = set(labels)
  core_samples_mask = np.zeros_like(labels, dtype=bool)
  colors = ['b','g','r','c','m','y','k']

  for k in unique_labels:
      class_member_mask = (labels == k)
      xy = data[class_member_mask & ~core_samples_mask]
      plt.scatter(xy[:,0], xy[:,1], color=colors[k%7])

  plt.title(plot_title)
  plt.show()

# Data Import

In this lesson, we will be working with the familar moon dataset. If you recall from the last demonstration, K Means was unable to cluster the moon data properly. Today, we will see what happens when we use hierarchical and density based clustering.

In [0]:
X_moon_data, _ = datasets.make_moons(n_samples=1500, noise=0.05, random_state=1)

plt.scatter(X_moon_data[:,0], X_moon_data[:,1], s=6)
plt.title("Moon Dataset")
plt.show()

# Agglomerative Hierarchical

First we will look at hierarchical clustering, specifically agglomerative hierarchical clustering. We will use scikit-learn to easily implement hierarchical clustering.

In [0]:
from sklearn.cluster import AgglomerativeClustering

Hierarchical clustering is unique in that it can cluster given a known number of clusters, or given a parameter called the distance threshold to find the number of clusters automatically. In this case, we know that there are two clusters, so we can specify this to the algorithm.

Recall also that hierarchical clustering requires a parameter to determine which type of linkage to use. For this example, lets first try average linkage.

In [0]:
hier = AgglomerativeClustering(n_clusters=2, linkage='average')
hier.fit(X_moon_data)

Now we can use the function defined earlier to plot the results of the clustering, using colors to represent the different clusters.

In [0]:
plot_cluster(X_moon_data, hier.labels_, "Hierarchical Clustering with Average Linkage")

So we can see that the algorithm did not do exactly what we wanted it to, but did produce a slightly more reasonable result than K Means. Let's now try varying the linkage parameter to see what effect it has on the resulting clustering.

In [0]:
hier = AgglomerativeClustering(n_clusters=2, linkage='single')
hier.fit(X_moon_data)
plot_cluster(X_moon_data, hier.labels_, "Hierarchical Clustering with Single Linkage")

And we can see this this worked exactly as expected! This is a good demonstration of why careful hyperparamter selection is very important to data science. For hierarchical clustering, different types of linkage will often give very different results, so it is best to several types and compare their performance.

# Density Based Clustering

Now we will look at density based clustering. For this example, we will specifically be using DBSCAN, which can be easily implemented with scikit-learn.

In [0]:
from sklearn.cluster import DBSCAN

As a first test, lets ry running DBSCAN with the default parameters.

In [0]:
db = DBSCAN()
db.fit(X_moon_data)

And once again, we can plot the results to evaluate the performance.

In [0]:
plot_cluster(X_moon_data, db.labels_, "DBSCAN Clustering with eps=0.5")

We can see that right out of the box, DBSCAN is not working very well. But it is important to remember that hyperparameter selection can make a very significant difference in performance, so lets try modifying the epsilon parameter.

In [0]:
db = DBSCAN(eps=0.25)
db.fit(X_moon_data)
plot_cluster(X_moon_data, db.labels_, "DBSCAN Clustering with eps=0.25")

And we can see again that with the proper hyperparameter selection, DBSCAN is very effective at separating the classes. In general, you may have to run an algorithm several times with different hyperparameters before you are able to find a good result.