## Clustering with multiple algorithms
- centroid based
    - k-meaans clustering
- heirarchical based
    - connectivity based clustering. clustering based on the idea that points are connected to points close by rather than further away. example : agglomerative and birch clustering
- distribution based
    - objects of a cluster are the ones which belong most likely to the same distribution. 
    - tend to be complex and prone to overfitting.
    - example : gaussian mixture models
- densite based cluster
    - create clusters from areas which have a higher density of data points
    - example DBSCAN and mean-shift clustering
    

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn import metrics
from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import DBSCAN
from sklearn.cluster import MeanShift
from sklearn.cluster import Birch
from sklearn.cluster import AffinityPropagation
from sklearn.cluster import MiniBatchKMeans

In [2]:
iris_df = pd.read_csv("./datasets/iris.csv",skiprows=1,names=['sepal-length',
                                                             'sepal-width',
                                                             'petal-length',
                                                             'petal-width',
                                                             'class'])
iris_df.head()

Unnamed: 0,sepal-length,sepal-width,petal-length,petal-width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [3]:
from sklearn import preprocessing
label_encoding = preprocessing.LabelEncoder()
iris_df['class'] = label_encoding.fit_transform(iris_df['class'].astype(str))
iris_df.head()

Unnamed: 0,sepal-length,sepal-width,petal-length,petal-width,class
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [4]:
iris_features = iris_df.drop('class',axis=1)
iris_labels = iris_df['class']

In [5]:
def build_model(clustering_fn,data,labels):
    model = clustering_fn(data)
    print("Homegeneity score: ", metrics.homogeneity_score(labels,model.labels_))
    print("Completeness score: ", metrics.completeness_score(labels,model.labels_))
    print("v_measure score: ", metrics.v_measure_score(labels,model.labels_))
    print("adjusted rand score: ", metrics.adjusted_rand_score(labels,model.labels_))
    print("adjusted mutual info score: ", metrics.adjusted_mutual_info_score(labels,model.labels_))
    print("silhouette score: ", metrics.silhouette_score(data,model.labels_))

## KMeans Algorithm

In [6]:
def kmeans(data,n_clusters=3,max_iter=1000):
    model = KMeans(n_clusters=n_clusters,max_iter=max_iter).fit(data)
    return model

In [7]:
build_model(kmeans,iris_features,iris_labels)

Homegeneity score:  0.7514854021988338
Completeness score:  0.7649861514489815
v_measure score:  0.7581756800057784
adjusted rand score:  0.7302382722834697
adjusted mutual info score:  0.7483723933229484
silhouette score:  0.5525919445499757


## agglomerative 

In [8]:
def aggloremative_fn(data,n_clusters=3):
    model = AgglomerativeClustering(n_clusters=n_clusters).fit(data)
    return model

In [12]:
build_model(aggloremative_fn,iris_features,iris_labels)

Homegeneity score:  0.7608008469718723
Completeness score:  0.7795958005591144
v_measure score:  0.7700836616487869
adjusted rand score:  0.7311985567707745
adjusted mutual info score:  0.7578034225092115
silhouette score:  0.5540972908150553


## DBSCAN Algorithm
Suitable for large dataset that needs to separated to moderate number of clusters. It makes clusters based on the density of the points. It is a density based clustering method.  

paramters:
- eps : minimum distance specification to consider to points to be neighbors. Points closer than this distance are neighbors
- min_samples : minimum number of points to form a dense region

lower values are preferred for min_samples

In [13]:
def dbscan_fn(data,eps=0.45,min_samples=4):
    model = DBSCAN(eps=eps,min_samples=min_samples).fit(data)
    return model

In [15]:
build_model(dbscan_fn,iris_features,iris_labels)

Homegeneity score:  0.5773205947971476
Completeness score:  0.6093983666695363
v_measure score:  0.5929259393972258
adjusted rand score:  0.5084974632998323
adjusted mutual info score:  0.5686010878114507
silhouette score:  0.3720825002964342


## Mean shift clustering
- take one point from a set of points
- defines a 'neighborhood' for all single points
- calculate a function(kernel) for each point based on all points in its neighborhood
- flat-kernel : sum of all points in a neighborhood
- gaussian(RBF) kernel : probability weighted sum of points
    - defined my mean and standard deviation
    - mean = center points
    - standard deviation = bandwidth
- after applying the kernel to every points, assume all points are color coded by order of their magnitute
- now shift all the points towards the points with higher magnitutde
- algorithm stops when the points stop moving
- parameters : 
    - bandwidth: lower value means the kernel is tall and skinny, smaller neighborhood.
    - bandwidth : higher value means the distribution is flat, bigger neighborhood
- There is no need to specify neighborhood upfront
- this kernel can handle complex non linear data
- hyperparameter tuning is very crucial
- this is computationally less intensive than kmeans


In [17]:
def mean_shift_fn(data,bandwidth=0.85):
    model = MeanShift(bandwidth=bandwidth).fit(data)
    return model

In [18]:
build_model(mean_shift_fn,iris_features,iris_labels)

Homegeneity score:  0.7603645798041669
Completeness score:  0.7717917344958113
v_measure score:  0.7660355440487252
adjusted rand score:  0.7436826319432357
adjusted mutual info score:  0.7573632678282255
silhouette score:  0.5509296349732906


## Birch Clustering
suitable when both dataset and clusters is large
- it can detect and removes outliers
- it incrementally processes data, suitable for incoming data in a stream
- it is a hierarchical clustering
- builds a tree representation of dataset.
- effective at handling noise and outliers
- efficient memory and time efficient
- it can incrementally cluster incoming data
- online-learning algorithm

In [19]:
def birch_fn(data,n_clusters=3):
    model = Birch(n_clusters = n_clusters).fit(data)
    return model

In [20]:
build_model(birch_fn,iris_features,iris_labels)

Homegeneity score:  0.6747055693979638
Completeness score:  0.7383596460504097
v_measure score:  0.7050989012575005
adjusted rand score:  0.6096252514698314
adjusted mutual info score:  0.6706105390642346
silhouette score:  0.5016992571068448


## Affinity propagation clustering
- works well with small datasets but needs to clustered into many clusteres
- makes not assumption about internal data of points
- accepts graph distances(nearest neighbor graphs)
- attempts to find exemplars(datapoints that are the most representative of other points)
- no need to specify clusters up front

In [22]:
#damping : like a learning rate for the algorithm. extent to which the current value is maintained
# relative to the incoming values
def affinity_propagation(data,damping=0.6,max_iter=1000):
    model = AffinityPropagation(damping=damping,max_iter=max_iter).fit(data)
    return model

In [23]:
build_model(affinity_propagation,iris_features,iris_labels)

Homegeneity score:  0.8512533506223854
Completeness score:  0.49170090756246776
v_measure score:  0.6233451996084364
adjusted rand score:  0.4373692389986675
adjusted mutual info score:  0.4802182847375622
silhouette score:  0.348833613127065


## Mini batch k-means clusters
- good when we want moderate amount of clusters in a very large dataset
- centroid based algorithms
- perform k-means on a randomly sampled subsets
- iteratively perform on batches
- faster than full k means

In [24]:
def mini_batch_kmeans_fn(data,n_clusters=3,max_iter=1000):
    model = MiniBatchKMeans(n_clusters=n_clusters,max_iter=max_iter,batch_size=20).fit(data)
    return model

In [25]:
build_model(mini_batch_kmeans_fn,iris_features,iris_labels)

Homegeneity score:  0.7364192881252849
Completeness score:  0.7474865805095324
v_measure score:  0.7419116631817836
adjusted rand score:  0.7163421126838475
adjusted mutual info score:  0.7331180735280008
silhouette score:  0.5509643746707443
