## Choosing No. of clusters for KMeans

In [None]:
import warnings
warnings.simplefilter('ignore', FutureWarning)

import numpy as np
import pandas as pd

from tqdm.notebook import tqdm
from sklearn.cluster import KMeans
import plotly.express as px

In [None]:
train = pd.read_csv('../input/lish-moa/train_features.csv')
test = pd.read_csv('../input/lish-moa/test_features.csv')

GENES = [col for col in train.columns if 'g-' in col]
CELLS = [col for col in train.columns if 'c-' in col]

# GENES

First lets try out the Elbow method which is usually used. We will compute the SSE scores for a range of cluster sizes and plot them.

In [None]:
g_SSE = []
for i in tqdm(range(5,250,5)):
    k_train = train[GENES].copy()
    k_test = test[GENES].copy()
    k_data = pd.concat([k_train, k_test], axis = 0)
    kmeans = KMeans(n_clusters = i, init='k-means++', n_init=5, max_iter=50, tol=1e-04,random_state = 77)
    kmeans.fit(k_data)
    g_SSE.append(kmeans.inertia_)

In [None]:
df_g = pd.DataFrame({"gene_SSE":g_SSE,
                    'num':list(range(5,250,5))})

fig = px.line(df_g, x = 'num', y = "gene_SSE",
             title = "gene's SSE of some clusters")
fig.show()

As you can see, with elbow method we cannot clearly determine the number clusters to be used for KMeans.

Lets introduce, **Silhouette Coefficient**

# Silhouette Coefficient

Silhouette Coefficient or silhouette score is a metric used to calculate the goodness of a clustering technique. 

Its value ranges from -1 to 1 with 1 being the best and -1 being the worst.

<img src="https://miro.medium.com/max/700/1*cUcY9jSBHFMqCmX-fp8BvQ.jpeg"></img>

**Silhouette Score = ${\dfrac{b-a}{max(a,b)}}$**

where

**a** = average intra-cluster distance i.e the average distance between each point within a cluster.

**b** = average inter-cluster distance i.e the average distance between all clusters

[Source](https://towardsdatascience.com/silhouette-coefficient-validating-clustering-techniques-e976bb81d10c)

Thankfully we don't need to implement this ourselves, Scikit-learn provides us a utility to calculate this

In [None]:
from sklearn.metrics import silhouette_score

sil_scores = []
for i in tqdm(range(5,250,5)):
    k_train = train[GENES].copy()
    k_test = test[GENES].copy()
    k_data = pd.concat([k_train, k_test], axis = 0)
    kmeans = KMeans(n_clusters = i, init='k-means++', n_init=5, max_iter=50, tol=1e-04,random_state = 77)
    kmeans.fit(k_data)
    labels=kmeans.predict(k_data)
    sil_scores.append(silhouette_score(k_data, labels))

In [None]:
df_g = pd.DataFrame({"silhouette scores":sil_scores,
                    'num':list(range(5,250,5))})

fig = px.line(df_g, x = 'num', y = "silhouette scores",
             title = "gene's Silhoute Score of some clusters")
fig.show()

Clearly you can see that the silhouette score graph gives peaks at **20** and **30** number of clusters, hence that should be the ideal number of clusters if you are using the cluster centroid as features.


# CELLS

In [None]:
c_SSE = []
for i in tqdm(range(2, 100, 2)):
    k_train = train[CELLS].copy()
    k_test = test[CELLS].copy()
    k_data = pd.concat([k_train, k_test], axis = 0)
    kmeans = KMeans(n_clusters = i, init='k-means++', n_init=5, max_iter=50, tol=1e-04,random_state = 77)
    kmeans.fit(k_data)
    c_SSE.append(kmeans.inertia_)

In [None]:
df_c = pd.DataFrame({"cell_SSE":c_SSE,
                    'num':list(range(2, 100, 2))})

fig = px.line(df_c, x = 'num', y = "cell_SSE",
             title = "cell's SSE of some clusters")
fig.show()

You can see that the number of clusters in this case around 8-16, we can take any value in this range.

Lets check out the Silhouette scores

In [None]:
from sklearn.metrics import silhouette_score

sil_scores = []
for i in tqdm(range(2, 100, 2)):
    k_train = train[CELLS].copy()
    k_test = test[CELLS].copy()
    k_data = pd.concat([k_train, k_test], axis = 0)
    kmeans = KMeans(n_clusters = i, init='k-means++', n_init=5, max_iter=50, tol=1e-04,random_state = 77)
    kmeans.fit(k_data)
    labels=kmeans.predict(k_data)
    sil_scores.append(silhouette_score(k_data, labels))

In [None]:
df_c = pd.DataFrame({"silhouette scores":sil_scores,
                    'num':list(range(2, 100, 2))})

fig = px.line(df_c, x = 'num', y = "silhouette scores",
             title = "cell's Silhoute Score of some clusters")
fig.show()

The 8-16 value that we worked out before, doesn't look good now because 2 cluster separation works very well.

But as we are looking for adding features to the data, I think we can settle for a little less silhouette score.  

## Note 

Please note that I am a Beginner and If you find any problem or error in this, feel completely free to point it out to me. 

I am always looking for suggestions to improve and if you find the kernel useful, **CONSIDER UPVOTING**