# Resources

1. Book Chapter 9.
2. The web links and pdfs.

# [Clustering Analysis][1]

Clustering is the process of dividing the entire data into groups (also known as clusters) based on the patterns in the data.

In clustering, we do not have a target to predict. We look at the data and then try to club similar observations and form different groups. Hence it is an unsupervised learning problem.

## Properties of Clusters

<b> Example: </b> Assume we have bank Income-Debt:

<img style="float:center" src="./images/Debt-bank.png" alt="drawing" height="500" width="500"/>

<b> Property 1 </b>: All the data points in one should near or similar to each other. <b> Intra-cluster cohesion (compactness) </b> <br/>
<b> Property 2 </b>: The data points from  away or different clusters should be as different as possible.  <b> Inter-cluster separation (isolation) </b>

### Question: Which clustering case is better:

<img style="float:center" src="./images/Clustering.png" alt="drawing" height="600" width="600"/>


[1]:https://www.analyticsvidhya.com/blog/2019/08/comprehensive-guide-k-means-clustering/

# [Types of clustering algorithms][1]


See the [link][2] for more details:

1. Centroid-based clustering
<img style="float:center" src="./images/centroid.png" alt="drawing" height="200" width="300"/>

2. Density-based clustering
<img style="float:center" src="./images/Density.png" alt="drawing" height="200" width="300"/>


3. Distribution clustering
<img style="float:center" src="./images/Distribution.png" alt="drawing" height="200" width="300"/>

[1]:https://developers.google.com/machine-learning/clustering/clustering-algorithms
[2]:https://link.springer.com/content/pdf/10.1007%2Fs40745-015-0040-1.pdf

## [What do we need for clustering][1]


1. Proximity measure, either:
  - Similarity measure $s(x_{i},x_{k})$:<b>large</b> if $x_{i},x_{k}$ are similar
  - Dissimilarity measure $d(x_{i},x_{k})$:<b>small</b> if $x_{i},x_{k}$ are similar
2. Criterion function to evaluate intra/inter-cluster relationships
3. Algorithm to compute clustering: 
  - We need to optimize the criterion function.
  - We need to evaluate hyper-paramaters.
4. Clustering algorithm evaluation:
  - [External evaluation (with class)][2]
     - [Purity][3]: Purity is a measure of the extent to which clusters contain a single class: It is a measure between 0 and 1.
       - For each cluster, count the number of data points from the most common class in said cluster. 
       - Sum over all clusters and divide by the total number of data points.
          - Given some set of clusters $M$ and some set of classes $D$, both partinioning $N$ data points, purity is $\frac{1}{N}\sum_{m \in M}{\max_{d \in D}|m \cap d|}$
        - <b> Example: </b> Assume you have three clusters and three classes in the below figure: <br/>
          <img style="float:center" src="./images/clusters.png" alt="drawing" height="200" width="300"/>
         
           - purity = 1/17 x (5+4+3)=0.71 <br/>   
        
  - [Internal evaluation (without class)][3]
     - [Silhouette][4]:The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). 
         - Calculations: See the example in the [link][5] 

   


[1]:http://www.mit.edu/~9.54/fall14/slides/Class13.pdf
[2]:https://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html
[3]:https://en.wikipedia.org/wiki/Cluster_analysis
[4]: https://en.wikipedia.org/wiki/Silhouette_(clustering)
[5]:https://stackoverflow.com/questions/23387275/how-do-you-manually-compute-for-silhouette-cohesion-and-separation-of-cluster


# [Proximity Functions][1]



## Examples: [norm similarity:][2]

<img style="float:center" src="./images/Proximity functions.png" alt="drawing" height="300" width="500"/>


[1]:https://www.ims.uni-stuttgart.de/institut/mitarbeiter/schulte/theses/phd/algorithm.pdf
[2]:http://www.mit.edu/~9.54/fall14/slides/Class13.pdf

# [K-means clustering][1]

Given k, the k-means (centroids), algorithm works as [follows][2]:

1. Initialization: Choose k (random) data points (seeds) to be the initial centroids, cluster centers
2. Assignment: Assign each data point to the closest centroid
3. Recalculation: Re-compute the centroids using the current cluster memberships
4. If a convergence criterion is not met, repeat steps 2 and 3 until thresholds:
    - no (or minimum) re-assignments of data points to different clusters, or
    - no (or minimum) change of centroids, or
    - minimum decrease in the sum of squared error (SSE):
        - $SSE = \sum_{j=1}^{k}\sum_{x \in C_{j}}d(x,m_{j})^2$; where
            - $C_{j}$ is the jth cluster.
            - $m_{j}$ is the centroid of cluster $C_{j}$  (the mean vector of all the data points in $C_{j}$)
            - $d(x,m_{j})$  is the (Eucledian) distance between data point $x$ and centroid $m_{j}$ 


[1]:http://benalexkeen.com/k-means-clustering-in-python/
[2]:http://www.mit.edu/~9.54/fall14/slides/Class13.pdf

# [Let us practice][1]

## K=3

[1]:http://benalexkeen.com/k-means-clustering-in-python/

In [None]:
## Initialisation

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

df = pd.DataFrame({
    'x': [12, 20, 28, 18, 29, 33, 24, 45, 45, 52, 51, 52, 55, 53, 55, 61, 64, 69, 72],
    'y': [39, 36, 30, 52, 54, 46, 55, 59, 63, 70, 66, 63, 58, 23, 14, 8, 19, 7, 24]
})


np.random.seed(200)
k = 3
# centroids[i] = [x, y]
centroids = {
    i+1: [np.random.randint(0, 80), np.random.randint(0, 80)]
    for i in range(k)
}
    
fig = plt.figure(figsize=(5, 5))
plt.scatter(df['x'], df['y'], color='k')
colmap = {1: 'r', 2: 'g', 3: 'b'}
for i in centroids.keys():
    plt.scatter(*centroids[i], color=colmap[i])
plt.xlim(0, 80)
plt.ylim(0, 80)
plt.show()

In [None]:
## Assignment Stage

def assignment(df, centroids):
    for i in centroids.keys():
        # sqrt((x1 - x2)^2 - (y1 - y2)^2)
        df['distance_from_{}'.format(i)] = (
            np.sqrt(
                (df['x'] - centroids[i][0]) ** 2
                + (df['y'] - centroids[i][1]) ** 2
            )
        )
    centroid_distance_cols = ['distance_from_{}'.format(i) for i in centroids.keys()]
    df['closest'] = df.loc[:, centroid_distance_cols].idxmin(axis=1)
    df['closest'] = df['closest'].map(lambda x: int(x.lstrip('distance_from_')))
    df['color'] = df['closest'].map(lambda x: colmap[x])
    return df

df = assignment(df, centroids)
print(df.head())

fig = plt.figure(figsize=(5, 5))
plt.scatter(df['x'], df['y'], color=df['color'], alpha=0.5, edgecolor='k')
for i in centroids.keys():
    plt.scatter(*centroids[i], color=colmap[i])
plt.xlim(0, 80)
plt.ylim(0, 80)
plt.show()

In [None]:
## Update Stage

import copy

old_centroids = copy.deepcopy(centroids)

def update(k):
    for i in centroids.keys():
        centroids[i][0] = np.mean(df[df['closest'] == i]['x'])
        centroids[i][1] = np.mean(df[df['closest'] == i]['y'])
    return k

centroids = update(centroids)
    
fig = plt.figure(figsize=(5, 5))
ax = plt.axes()
plt.scatter(df['x'], df['y'], color=df['color'], alpha=0.5, edgecolor='k')
for i in centroids.keys():
    plt.scatter(*centroids[i], color=colmap[i])
plt.xlim(0, 80)
plt.ylim(0, 80)
for i in old_centroids.keys():
    old_x = old_centroids[i][0]
    old_y = old_centroids[i][1]
    dx = (centroids[i][0] - old_centroids[i][0]) * 0.75
    dy = (centroids[i][1] - old_centroids[i][1]) * 0.75
    ax.arrow(old_x, old_y, dx, dy, head_width=2, head_length=3, fc=colmap[i], ec=colmap[i])
plt.show()

In [None]:
## Repeat Assigment Stage

df = assignment(df, centroids)

# Plot results
fig = plt.figure(figsize=(5, 5))
plt.scatter(df['x'], df['y'], color=df['color'], alpha=0.5, edgecolor='k')
for i in centroids.keys():
    plt.scatter(*centroids[i], color=colmap[i])
plt.xlim(0, 80)
plt.ylim(0, 80)
plt.show()

Note that one of the reds is now green and one of the blues is now red.

We now repeat until there are no changes to any of the clusters.

In [None]:
# Continue until all assigned categories don't change any more
while True:
    closest_centroids = df['closest'].copy(deep=True)
    centroids = update(centroids)
    df = assignment(df, centroids)
    if closest_centroids.equals(df['closest']):
        break

fig = plt.figure(figsize=(5, 5))
plt.scatter(df['x'], df['y'], color=df['color'], alpha=0.5, edgecolor='k')
for i in centroids.keys():
    plt.scatter(*centroids[i], color=colmap[i])
plt.xlim(0, 80)
plt.ylim(0, 80)
plt.show()

# Using Scikit learn

In [None]:
df = pd.DataFrame({
    'x': [12, 20, 28, 18, 29, 33, 24, 45, 45, 52, 51, 52, 55, 53, 55, 61, 64, 69, 72],
    'y': [39, 36, 30, 52, 54, 46, 55, 59, 63, 70, 66, 63, 58, 23, 14, 8, 19, 7, 24]
})

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=3)
kmeans.fit(df)

In [None]:
labels = kmeans.predict(df)
centroids = kmeans.cluster_centers_

In [None]:
fig = plt.figure(figsize=(5, 5))

colors = map(lambda x: colmap[x+1], labels)

plt.scatter(df['x'], df['y'], color=list(colors), alpha=0.5, edgecolor='k')
for idx, centroid in enumerate(centroids):
    plt.scatter(*centroid, color=colmap[idx+1])
plt.xlim(0, 80)
plt.ylim(0, 80)
plt.show()

# Optimal K selection

## [Rule of thumb][2]

$K=\sqrt{n/2}$; where $n$ is the number of points


## [Elbow method (clustering)][1]

1. We need to visualize the number of clusters againest:
  - <b> Distortion: </b> It is calculated as the average of the squared distances from the cluster centers of the respective clusters (Average of $SE$). Typically, the Euclidean distance metric is used.
  - <b> Inertia: </b> It is the sum of squared distances of samples to their closest cluster center ($SSE$)
2. Select the one after which there will be near to platue or no steep decrement 

[1]: https://en.wikipedia.org/wiki/Elbow_method_(clustering)
[2]: https://stats.stackexchange.com/questions/277007/rule-of-thumb-on-the-best-k-in-k-means-clustering


# Let us practice:

In [None]:
from sklearn.cluster import KMeans 
from sklearn import metrics 
from scipy.spatial.distance import cdist 
import numpy as np 
import matplotlib.pyplot as plt  

In [None]:
#Creating the data 
x1 = np.array([3, 1, 1, 2, 1, 6, 6, 6, 5, 6, 7, 8, 9, 8, 9, 9, 8]) 
x2 = np.array([5, 4, 5, 6, 5, 8, 6, 7, 6, 7, 1, 2, 1, 2, 3, 2, 3]) 
X = np.array(list(zip(x1, x2))).reshape(len(x1), 2) 
  
#Visualizing the data 
plt.plot() 
plt.xlim([0, 10]) 
plt.ylim([0, 10]) 
plt.title('Dataset') 
plt.scatter(x1, x2) 
plt.show() 

In [None]:
distortions = [] 
inertias = [] 
mapping1 = {} 
mapping2 = {} 
K = range(1,10) 
  
for k in K: 
    #Building and fitting the model 
    kmeanModel = KMeans(n_clusters=k).fit(X) 
    kmeanModel.fit(X)     
      
    distortions.append(sum(np.min(cdist(X, kmeanModel.cluster_centers_, 
                      'euclidean'),axis=1)) / X.shape[0]) 
    inertias.append(kmeanModel.inertia_) 
  
    mapping1[k] = sum(np.min(cdist(X, kmeanModel.cluster_centers_, 
                 'euclidean'),axis=1)) / X.shape[0] 
    mapping2[k] = kmeanModel.inertia_ 

In [None]:
for key,val in mapping1.items(): 
    print(str(key)+' : '+str(val)) 

plt.plot(K, distortions, 'bx-') 
plt.xlabel('Values of K') 
plt.ylabel('Distortion') 
plt.title('The Elbow Method using Distortion') 
plt.show() 

In [None]:
for key,val in mapping2.items(): 
    print(str(key)+' : '+str(val))
plt.plot(K, inertias, 'bx-') 
plt.xlabel('Values of K') 
plt.ylabel('Inertia') 
plt.title('The Elbow Method using Inertia') 
plt.show() 

### Strengths and Weaknesses

#### Strengths
1. Efficient 
2. Easy to implement
3. A point can change cluster after updating centroids

#### Weaknesses
1. Need to specify k in advance (may be difficult if you don’t have labels)
2. Sensitive to outliers(noise) in the data
3. Not good at finding clusters with odd shapes (prefers spherical data)
4. Can get stuck at local minima (depends on where initialize centroids)
5. sensitive to scale

