#### Clustering Algorithms

Clustering is the task of grouping a set of objects in sucha way that objects in the same group are more similar to each other than to those in other groups.

Types:
```
1. K means (Centriod Models)
2. Guassian Mixture Models (Distribution Models)
3. Density Models (DBSCAN)
4. Hierarichal Models (Connectivity Models)
5. Spectral Clustering (Special Models)
```

**Cost Functions**

*Homogeneity =>  means all of the observations with the same class label are in the same cluster.*

*Completeness => means all members of the same class are in the same cluster.*

*V-Measure => The V-measure is the harmonic mean between homogeneity and completeness*

```
sklearn.metrics.homogeneity_score(labels_true, labels_pred)
sklearn.metrics.completeness_score(labels_true, labels_pred)
sklearn.metrics.v_measure_score(labels_true, labels_pred, beta=1.0)
```

In [1]:
from sklearn.datasets import load_digits
from sklearn.datasets import load_iris
from sklearn.datasets import make_s_curve
from sklearn.datasets import make_swiss_roll
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from sklearn.cluster import KMeans
from sklearn import decomposition
from sklearn.mixture import GaussianMixture
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_blobs
from sklearn.metrics import v_measure_score
from sklearn.metrics import homogeneity_score
from sklearn.metrics import completeness_score
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import SpectralClustering

#### Prepare Datasets

In [2]:
# Dataset1
digits = load_digits()
X_digits_features = digits.data
y_digits = digits.target

# Dataset2
iris = load_iris()
X_iris_features = iris.data
y_iris = iris.target

# Dataset3
centers = [[1, 1], [-1, -1], [1, -1]]
X_blobs, y_blobs = make_blobs(n_samples=750, centers=centers, cluster_std=0.4,
                            random_state=0)

print("Dataset1: %s" % (X_digits_features.shape, ))
print("Dataset2: %s" % (X_iris_features.shape, ))
print("Dataset2: %s" % (X_blobs.shape, ))

Dataset1: (1797, 64)
Dataset2: (150, 4)
Dataset2: (750, 2)


#### Visualization Functions. 

Documentation Details: https://plot.ly/python/
    

In [3]:
## Using plotly
def plot_3d_interactive(X_dataset1, y_dataset1, X_dataset2, y_dataset2):
    # Initialize figure with subplots
    fig = make_subplots(
    rows=2, cols=2,
    specs=[[{"type": "scatter3d", "colspan": 2}, None],
           [{"type": "scatter3d", "colspan": 2}, None]],
    subplot_titles=("Dataset 1", "Dataset 2"))
    

    trace0 = go.Scatter3d(x=X_dataset1[:, 0], y=X_dataset1[:, 1], z=X_dataset1[:, 2], mode='markers', marker_color=y_dataset1)
    trace1 = go.Scatter3d(x=X_dataset2[:, 0], y=X_dataset2[:, 1], z=X_dataset2[:, 2], mode='markers', marker_color=y_dataset2)
    
    fig.append_trace(trace0, 1, 1)
    fig.append_trace(trace1, 2, 1)
    
    fig.update_layout(
        title_text='3D subplots with different colorscales',
        height=1000,
        width=900,
        margin=dict(l=0, r=0, b=0, t=0),
        showlegend=False
    )
    
    fig.show()

In [4]:
## Using plotly
def plot_2d_interactive(X_dataset1, y_dataset1, X_dataset2, y_dataset2):
    # Initialize figure with subplots
    fig = make_subplots(
    rows=2, cols=2,
    specs=[[{"type": "scatter", "colspan": 2}, None],
           [{"type": "scatter", "colspan": 2}, None]],
    subplot_titles=("Dataset 3", "Dataset 4"))

    trace0 = go.Scatter(x=X_dataset1[:, 0], y=X_dataset1[:, 1], mode='markers', marker_color=y_dataset1)
    trace1 = go.Scatter(x=X_dataset2[:, 0], y=X_dataset2[:, 1], mode='markers', marker_color=y_dataset2)
    
    fig.append_trace(trace0, 1, 1)
    fig.append_trace(trace1, 2, 1)
    
    fig.update_layout(
        title_text='2D subplots with different colorscales',
        height=1000,
        width=900,
        margin=dict(l=0, r=0, b=0, t=0),
        showlegend=False
    )
    
    fig.show()

#### K Means 

**n_clusters: Only this parameter is required** 
    1. The number of clusters to form as well as the number of centroids to generate.
    
**init** 
    1.  k-means++ : selects initial cluster centers for k-mean clustering in a smart way to speed up convergence. See section Notes in k_init for more details.
    2. random: choose k observations (rows) at random from data for the initial centroids.

**n_init, default=10** 
    1. Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia.



In [None]:
pca_digits = decomposition.PCA(n_components=15)
# Fit the model with X and apply the dimensionality reduction on X.
X_digits = pca_digits.fit_transform(X_digits_features) 

pca_iris = decomposition.PCA(n_components=4)
# Fit the model with X and apply the dimensionality reduction on X.
X_iris = pca_iris.fit_transform(X_iris_features) 

# init -> selects initial cluster centers for k-mean clustering ('k-means++', 'random')
kmeans_digits = KMeans(n_clusters=10, init="k-means++", n_init=10)
y_digits_pred = kmeans_digits.fit_predict(X_digits)

kmeans_iris = KMeans(n_clusters=10, init="k-means++", n_init=10)
y_iris_pred = kmeans_iris.fit_predict(X_iris)

In [None]:
print("Homogeneity (Digits): %0.3f" % homogeneity_score(y_digits, y_digits_pred))
print("Completeness (Digits): %0.3f" % completeness_score(y_digits, y_digits_pred))
print("V-measure (Digits): %0.3f" % v_measure_score(y_digits, y_digits_pred))

In [None]:
plot_2d_interactive(X_digits, y_digits_pred, X_iris, y_iris_pred)

#### Guassian Mixture Models
 
 A guassian mixture is a function that is comprised of several guassians, each identified by K $\epsilon$ {1, 2, ... k} where k is the number of clusters in dataset D. 
 
 Guassian mixture implemented using Expectation Maximization algorithm:
 
 1. Choose initial $\theta^{old}$.
 2. Expectation  Step
     q(z) = P(z|X, $\theta^{old}$)
 3. Maximization Step
     $\theta^{New}$ = $argmax_{\theta}$ $\sum_{z} q(z)logP(x,z|\theta)$
 4. Iterate till converges


In [None]:
# For Visualization purpose 
pca_digits = decomposition.PCA(n_components=10)
# Fit the model with X and apply the dimensionality reduction on X.
X_digits = pca_digits.fit_transform(X_digits_features) 

pca_iris = decomposition.PCA(n_components=3)
# Fit the model with X and apply the dimensionality reduction on X.
X_iris = pca_iris.fit_transform(X_iris_features) 

guassian_digits = GaussianMixture(n_components=10, covariance_type='full')
y_digits_pred = guassian_digits.fit_predict(X_digits)

guassian_iris = GaussianMixture(n_components=3, covariance_type='full')
y_iris_pred = guassian_iris.fit_predict(X_iris)

In [None]:
print("Homogeneity (Digits): %0.3f" % homogeneity_score(y_digits, y_digits_pred))
print("Completeness (Digits): %0.3f" % completeness_score(y_digits, y_digits_pred))
print("V-measure (Digits): %0.3f" % v_measure_score(y_digits, y_digits_pred))

In [None]:
plot_2d_interactive(X_digits, y_digits_pred, X_iris, y_iris_pred)

#### Density Based Clustering

Clustering are defined as areas of higher density than th remainder of the dataset. 

Points are classified into 3 types
1. Core Points => It should have atleast m points are with in distance $\epsilon$
2. Birder Points => Still part of the cluster because its with in $\epsilon$ of a core point but doesn't meet minimum number of points criteria.
3. Noise Points => Not assigned to a cluster

```
    Important Parameters:
    eps => The maximum distance between two samples for one to be considered as in the neighborhood of the other. This is not a maximum bound on the distances of points within a cluster. This is the most important DBSCAN parameter to choose appropriately for your data set and distance function.

    min_samples => The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself.

    algorithm => 
    The algorithm to be used by the NearestNeighbors module to compute pointwise distances and find nearest neighbors. See NearestNeighbors module documentation for details.
```

In [None]:
dbscan = DBSCAN(eps=0.3, min_samples=10, algorithm="ball_tree")
y_pred_blobs = dbscan.fit_predict(X_blobs)
print("Homogeneity (Blobs): %0.3f" % homogeneity_score(y_blobs, y_pred_blobs))
print("Completeness (Blobs): %0.3f" % completeness_score(y_blobs, y_pred_blobs))
print("V-measure (Blobs): %0.3f" % v_measure_score(y_blobs, y_pred_blobs))

In [None]:
# For Visualization purpose 
pca_digits = decomposition.PCA(n_components=15)
# Fit the model with X and apply the dimensionality reduction on X.
X_digits = pca_digits.fit_transform(X_digits_features) 

dbscan = DBSCAN(eps=0.6, min_samples=10, algorithm="ball_tree")
y_pred_digits = dbscan.fit_predict(X_digits)
print("Homogeneity (Digits): %0.3f" % homogeneity_score(y_digits, y_pred_digits))
print("Completeness (Digits): %0.3f" % completeness_score(y_digits, y_pred_digits))
print("V-measure (Digits): %0.3f" % v_measure_score(y_digits, y_pred_digits))

#### Heirarichal Clustering

Starts with one cluster, individual item in its own cluster and iteratively merge clusters until all the items belong to one cluster.

Single Linkage:
 D(C1, C2) = $min_{x1 \epsilon C1; x2 \epsilon C2}$ D(x1, x2) 

Complete Linkage:
 D(C1, C2) = $max_{x1 \epsilon C1; x2 \epsilon C2}$ D(x1, x2) 


In [None]:
agglomerative_digits = AgglomerativeClustering(n_clusters=10, linkage='complete')
y_digits_pred = agglomerative_digits.fit_predict(X_digits_features)

agglomerative_iris = AgglomerativeClustering(n_clusters=3, linkage='complete')
y_iris_pred = agglomerative_iris.fit_predict(X_iris_features)

agglomerative_blobs = AgglomerativeClustering(n_clusters=3, linkage='complete')
y_blobs_pred = agglomerative_iris.fit_predict(X_blobs)
y_blobs_pred

In [None]:
print("Homogeneity (Digits): %0.3f" % homogeneity_score(y_digits, y_digits_pred))
print("Completeness (Digits): %0.3f" % completeness_score(y_digits, y_digits_pred))
print("V-measure (Digits): %0.3f" % v_measure_score(y_digits, y_digits_pred))

In [None]:
print("Homogeneity (Digits): %0.3f" % homogeneity_score(y_iris, y_iris_pred))
print("Completeness (Digits): %0.3f" % completeness_score(y_iris, y_iris_pred))
print("V-measure (Digits): %0.3f" % v_measure_score(y_iris, y_iris_pred))

In [None]:
print("Homogeneity (Digits): %0.3f" % homogeneity_score(y_blobs, y_blobs_pred))
print("Completeness (Digits): %0.3f" % completeness_score(y_blobs, y_blobs_pred))
print("V-measure (Digits): %0.3f" % v_measure_score(y_blobs, y_blobs_pred))

#### Spectral Clustering

In [None]:
# For Visualization purpose 
pca_digits = decomposition.PCA(n_components=10)
# Fit the model with X and apply the dimensionality reduction on X.
X_digits = pca_digits.fit_transform(X_digits_features) 



In [None]:
# Make sure graph is fully connected. 
spectral_digits = SpectralClustering(n_clusters=10, n_components=3, affinity='rbf', assign_labels='kmeans', n_init=10)
y_pred_digits = spectral_digits.fit_predict(X_digits)

In [None]:
print("Homogeneity (Digits): %0.3f" % homogeneity_score(y_digits, y_pred_digits))
print("Completeness (Digits): %0.3f" % completeness_score(y_digits, y_pred_digits))
print("V-measure (Digits): %0.3f" % v_measure_score(y_digits, y_pred_digits))