In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib nbagg

<h1>Clustering Techniques</h1>

<p>Hereafter we compare different clustering techniques. Using once again the iris data set for easy visualization.</p>

In [2]:
from sklearn.datasets import load_iris
iris = load_iris()

In [3]:
data = iris.data

In [4]:
plt.scatter(data[:,1],data[:,2])

<IPython.core.display.Javascript object>

<matplotlib.collections.PathCollection at 0x7f37c3633890>

<p>In the figure above I haven't given any color to the different iris types. We just assume we don't know the labels and we will try different clustering techniques.</p>
<p>Nonetheless, from the scatter plot, it is clear that it is very unlikely for any technique to find three types of flowers.</p>

<h3>KMeans</h3>

In [5]:
from sklearn.cluster import KMeans
kms = KMeans(n_clusters=2)
#kms.fit(data[:,1:3:1])
#y_pred = kms.predict(data[:,1:3:1])
### The previous two lines can be replaced by a single command :
y_pred = kms.fit_predict(data[:,1:3:1])

In [6]:
plt.figure()
plt.scatter(data[:,1],data[:,2],c=y_pred,cmap='brg')

<IPython.core.display.Javascript object>

<matplotlib.collections.PathCollection at 0x7f37be486b10>

In [7]:
# Once the clusters have been found we can ask for the classification of any new points :
kms.predict([[3.0,2.0],[4.0,5.0]])

array([0, 1], dtype=int32)

<h3>MeanShift</h3>

<p>As it is based on density distribution the badnwidth parameter influence storngly the number of clusters the method will find</p>
<p>If we assume gaussian density distributions the algorithm is trying to find how gaussians I need to reproduce the density we observe. Therefore if we use a width too important only a single gaussian is necessary meaning that we find a single cluster, while if the bandwidth is too small we'll need as many gaussians as we have points (number of clusters = number of points).</p>

In [8]:
from sklearn.cluster import MeanShift
msh = MeanShift(bandwidth=0.6)
y_pred = msh.fit_predict(data[:,1:3:1])

In [9]:
plt.figure()
plt.scatter(data[:,1],data[:,2],c=y_pred,cmap='brg')

<IPython.core.display.Javascript object>

<matplotlib.collections.PathCollection at 0x7f37be0e9610>

<p>With a bandwidth of 0.6 we find here three clusters meaning that the algorithm needs three gaussians (as kernel = 'rbf') to represent our density distribution.</p>

<h3>Affinity Propagation</h3>

In [10]:
from sklearn.cluster import AffinityPropagation

afp = AffinityPropagation(damping=0.95)
y_pred = afp.fit_predict(data[:,1:3:1])

In [11]:
plt.figure()
plt.scatter(data[:,1],data[:,2],c=y_pred,cmap='brg')

<IPython.core.display.Javascript object>

<matplotlib.collections.PathCollection at 0x7f37bddd2110>

In [12]:
### One way to find the number of clusters is to use the following combination of functions :
np.unique(afp.labels_).size

4

<h3>Agglomerative Clustering</h3>

<p>In addition to the number of clusters expected, the affinity and linkage are the parameters for this algorithm.<br>
The <b>affinity</b> describes how we measure distances between points while the <b>linkage</b> describes how we measure distances between clusters of points.</p

In [13]:
from sklearn.cluster import AgglomerativeClustering
agl = AgglomerativeClustering(n_clusters=2,affinity='euclidean',linkage='average')
y_pred = agl.fit_predict(data[:,1:3:1])

In [14]:
plt.figure()
plt.scatter(data[:,1],data[:,2],c=y_pred,cmap='brg')

<IPython.core.display.Javascript object>

<matplotlib.collections.PathCollection at 0x7f37bd8c5950>

<h3>DBSCAN</h3>

<p>One of the big advantage of DBSCAN is its ability to deal with outliers as we can see in the following example (blue points classified as -1).</p>

In [15]:
from sklearn.cluster import DBSCAN
dbs = DBSCAN(eps=0.5,min_samples=5)
y_pred = dbs.fit_predict(data[:,1:3:1])

In [16]:
plt.figure()
plt.scatter(data[:,1],data[:,2],c=y_pred,cmap='brg')

<IPython.core.display.Javascript object>

<matplotlib.collections.PathCollection at 0x7f37bd57d6d0>

<h3>BONUS  Example</h3>

<p>In the following I just applied all clustering methods in 4 dimensions and visualize the result in 2 dimensions.
To see the result uncomment the line you want and compare with the true classification (right column of the plot).</p>

In [18]:
#clfr = KMeans(n_clusters=3)
clfr = MeanShift(bandwidth=0.85)
#clfr = AffinityPropagation(damping=0.95)
#clfr = AgglomerativeClustering(n_clusters=3,affinity='euclidean',linkage='average') 
#clfr = DBSCAN(eps=0.8,min_samples=4)

y_pred = clfr.fit_predict(data)

In [19]:
plt.figure(figsize=(11,20))

c=1
for i in [[0,1],[0,2],[0,3],[1,2],[1,3],[2,3]]:
    plt.subplot(6,2,c)
    plt.scatter(data[:,i[0]],data[:,i[1]],c=y_pred,cmap='brg')
    subplot(6,2,c+1)
    plt.scatter(data[:,i[0]],data[:,i[1]],c=iris.target,cmap='Spectral_r')
    c+=2
    
plt.subplot(621)
plt.title('Classifier Prediction')
plt.subplot(622)
plt.title('True Classification')

<IPython.core.display.Javascript object>

<matplotlib.text.Text at 0x7f37bd1c8d90>