## Calculate the Silhouette Coefficient for each of the three clusters and the whole clustering shown below

![sil1](img/sil1.png)

# IMPORTANT: For the computation of the Silhouette coefficient in task 9.1, please calculate b as average distance to the points of the nearest cluster (just like in the book) and not as average distance to the points of all other clusters as shown in the slides. This is confirmed to be correct. ALSO, please calculate the overall clustering silhouette coefficient as average over all point silhouette values, as shown here. In the slides, the overall clustering silhouette coefficient is calculated as average cluster silhouette coefficient which doesn't take different cluster sizes into account.


**Steps:**

1. Calculate distance matrix
2. To calculate sil coeff of a data point x
    - Calculate a, average distance to points in its own cluster
    - Calculate b, average distance to points in the nearest cluster
    - Sil coeff (x): (b - a) / max(a, b) **OR** {1 - (a/b) if a < b, (b/a) - 1 if a >= b} **Use either formula; both are equivalent**
    
4. Repeat for all points

**For a visual understanding of Silhouette Coefficient for a data point x, see the image below**

a = average of the blue lines (distances to the points in the same cluster)

b = average of the green lines (distances to the points in the nearest cluster)

Silhouette Coefficient(x) = Ratio of a to b. The lower the distances to the points in the same cluster, relative to the distances to points in the nearest cluster, the better the sil coef.

![sil1](img/sil2.png)

In [1]:
from scipy.spatial.distance import pdist, squareform
from sklearn.metrics import pairwise_distances
import numpy as np
import pandas as pd
d = {'x': [1, 2, 2, 3, 4, 4, 5, 6, 6, 3, 4, 4, 5, 5, 6],
     'y': [4, 4, 5, 3, 4, 5, 5, 4, 6, 4, 1, 3, 1, 2, 1],
     'cluster': ['c1', 'c1', 'c1', 'c3', 'c3', 'c3', 'c3', 'c3', 'c3', 'c2', 'c2', 'c2', 'c2', 'c2', 'c2']}
df = pd.DataFrame(d)
df

Unnamed: 0,x,y,cluster
0,1,4,c1
1,2,4,c1
2,2,5,c1
3,3,3,c3
4,4,4,c3
5,4,5,c3
6,5,5,c3
7,6,4,c3
8,6,6,c3
9,3,4,c2


### Step 1. Calculate the distance matrix

In [2]:
dist = pdist(df[['x', 'y']], metric = 'euclidean')
dist_matrix = pd.DataFrame(squareform(dist))
dist_matrix

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,0.0,1.0,1.414214,2.236068,3.0,3.162278,4.123106,5.0,5.385165,2.0,4.242641,3.162278,5.0,4.472136,5.830952
1,1.0,0.0,1.0,1.414214,2.0,2.236068,3.162278,4.0,4.472136,1.0,3.605551,2.236068,4.242641,3.605551,5.0
2,1.414214,1.0,0.0,2.236068,2.236068,2.0,3.0,4.123106,4.123106,1.414214,4.472136,2.828427,5.0,4.242641,5.656854
3,2.236068,1.414214,2.236068,0.0,1.414214,2.236068,2.828427,3.162278,4.242641,1.0,2.236068,1.0,2.828427,2.236068,3.605551
4,3.0,2.0,2.236068,1.414214,0.0,1.0,1.414214,2.0,2.828427,1.0,3.0,1.0,3.162278,2.236068,3.605551
5,3.162278,2.236068,2.0,2.236068,1.0,0.0,1.0,2.236068,2.236068,1.414214,4.0,2.0,4.123106,3.162278,4.472136
6,4.123106,3.162278,3.0,2.828427,1.414214,1.0,0.0,1.414214,1.414214,2.236068,4.123106,2.236068,4.0,3.0,4.123106
7,5.0,4.0,4.123106,3.162278,2.0,2.236068,1.414214,0.0,2.0,3.0,3.605551,2.236068,3.162278,2.236068,3.0
8,5.385165,4.472136,4.123106,4.242641,2.828427,2.236068,1.414214,2.0,0.0,3.605551,5.385165,3.605551,5.09902,4.123106,5.0
9,2.0,1.0,1.414214,1.0,1.0,1.414214,2.236068,3.0,3.605551,0.0,3.162278,1.414214,3.605551,2.828427,4.242641


### Step 2 a. Calculate a, average distance to points in its own cluster

Example: 

For the point at (1, 4), get the distances to the other points in the cluster from the distance matrix. **The distance to (2, 4) is 1, and the distance to (2, 5) is 1.41. Average = (1 + 1.412) / 2 = 1.207.** 

So for point (1, 4), we set a = 1.207. 

Do this for all points.

In [3]:
sil_dict = {'a':[1.207, 1, 1.207, 
                 2.777, 1.731, 1.742, 1.614, 2.163, 2.544, 
                 3.051, 1.915, 1.979, 1.768, 1.614, 2.297]}
sil = pd.DataFrame(sil_dict)
sil = pd.concat([df, sil], axis=1)
sil

Unnamed: 0,x,y,cluster,a
0,1,4,c1,1.207
1,2,4,c1,1.0
2,2,5,c1,1.207
3,3,3,c3,2.777
4,4,4,c3,1.731
5,4,5,c3,1.742
6,5,5,c3,1.614
7,6,4,c3,2.163
8,6,6,c3,2.544
9,3,4,c2,3.051


### Step 2 b. Calculate b, average distance to points in the nearest cluster

To find the average distance from a point x to points in the nearest cluster, calculate the distances from x to points in all other clusters, and take the minimum such value

Example: 

For the point at (1, 4), get the distances to the points in the cross cluster. 

The distances are 2.236068, 3.000000, 3.162278, 4.123106, 5.000000, 5.385165. Average = 3.818. 

Next, get the distances to the points in the triangle cluster. The distances are 2.000000, 4.242641, 3.162278, 5.000000, 4.472136, 5.830952. Average = 4.118. 

So the nearest cluster is the cross cluster, so we set b = 3.818

In [4]:
sil['dist1'] = pd.Series([3.818, 2.881, 2.953, 1.962, 2.412, 2.466, 3.428, 4.374, 4.66, 1.47, 4.11, 2.74, 4.74, 4.107, 5.496])
sil['dist2'] = pd.Series([4.118, 3.281, 3.936, 2.15, 2.33, 3.195, 3.286, 2.873, 4.469, 2.043, 3.725, 2.013, 3.729, 2.832, 3.968])

sil['b'] = sil.apply(lambda x : np.min([x.dist1, x.dist2]), axis=1)
sil

Unnamed: 0,x,y,cluster,a,dist1,dist2,b
0,1,4,c1,1.207,3.818,4.118,3.818
1,2,4,c1,1.0,2.881,3.281,2.881
2,2,5,c1,1.207,2.953,3.936,2.953
3,3,3,c3,2.777,1.962,2.15,1.962
4,4,4,c3,1.731,2.412,2.33,2.33
5,4,5,c3,1.742,2.466,3.195,2.466
6,5,5,c3,1.614,3.428,3.286,3.286
7,6,4,c3,2.163,4.374,2.873,2.873
8,6,6,c3,2.544,4.66,4.469,4.469
9,3,4,c2,3.051,1.47,2.043,1.47


### Finally, use the formula to get the silhouette coefficient for each point

In [5]:
sil['sil'] = sil.apply(lambda x : (x.b - x.a) / np.max([x.a, x.b]), axis=1)
sil

Unnamed: 0,x,y,cluster,a,dist1,dist2,b,sil
0,1,4,c1,1.207,3.818,4.118,3.818,0.683866
1,2,4,c1,1.0,2.881,3.281,2.881,0.652898
2,2,5,c1,1.207,2.953,3.936,2.953,0.591263
3,3,3,c3,2.777,1.962,2.15,1.962,-0.293482
4,4,4,c3,1.731,2.412,2.33,2.33,0.257082
5,4,5,c3,1.742,2.466,3.195,2.466,0.293593
6,5,5,c3,1.614,3.428,3.286,3.286,0.508825
7,6,4,c3,2.163,4.374,2.873,2.873,0.247128
8,6,6,c3,2.544,4.66,4.469,4.469,0.430745
9,3,4,c2,3.051,1.47,2.043,1.47,-0.518191


### Silhouette Coefficient of a cluster = Average silhouette coefficient of the points in the cluster

Example: 

For the circle cluster, the sil coeffs of the points are 0.684, 0.653, 0.591. Average = 0.643

In [6]:
cluster_sils = pd.DataFrame(sil.groupby('cluster').sil.agg('mean'))
cluster_sils

Unnamed: 0_level_0,sil
cluster,Unnamed: 1_level_1
c1,0.642676
c2,0.226948
c3,0.240649


### Silhouette Coefficient of the whole clustering = Average silhouette coefficient of the points in the data

In [7]:
print('Silhouette Coefficient:', np.round(sil['sil'].mean(), 3))

Silhouette Coefficient: 0.316
