


<!-- TOC -->
<!-- /TOC -->

<div class="table-of-contents" style="background-color:#007fff; color:white; padding: 20px; margin: 10px; font-size: 150%; border-radius: 25px; box-shadow: 10px 10px 5px 0px rgba(0,0,0,0.25);">
  <h1 style="color: white;">Table Of Contents</h1>
  <ol>
    <li><a href="#1" style="color: white;"> Imports</a></li>
      <li><a href="#2" style="color: white;"> Load Data</a></li>
    <li><a href="#3" style="color: white;"> EDA (Exploratory Data Analysis)</a></li>
    <li><a href="#4" style="color: white;"> K-Means Model Implementation</a></li>
    <li><a href="#5" style="color: white;"> Clustering Evaluation</a></li>
    <li><a href="#6" style="color: white;"> Thank You</a></li>  
  </ol>

</div>

Code by: [Tanvir Anjom Siddique](https://tanvirsweb.github.io/)

- [Click to visit my LinkedIn profile](https://www.linkedin.com/in/tanvir-anjom-siddique/)
- [Click to visit my Github profile](https://github.com/tanvirsweb/)
- [Click to visit my Portfolio](https://tanvirsweb.github.io/)


In [46]:
!python --version
!echo "CLUSTERING WITH K-MEANS : Tanvir Anjom Siddique"

Python 3.13.5
"CLUSTERING WITH K-MEANS : Tanvir Anjom Siddique"


<a id="1"></a>

<h1 style='background:#007fff;border:0; color:white;
    box-shadow: 10px 10px 5px 0px rgba(0,0,0,0.25);
    transform: rotateX(10deg); border-radius: 25px;
    '><center>Imports & Library Versions</center></h1>

# Imports


In [47]:
# !pip install plotly

In [48]:
import numpy as np
import pandas as pd
import plotly.express as px
import seaborn as sns
import plotly.graph_objects as go
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans


<h2>Explanation of Some Confusing Terms in Python</h2>

<table border="1" cellpadding="5" cellspacing="0">
  <thead>
    <tr>
      <th>Term</th>
      <th>Meaning</th>
      <th>Example</th>
      <th>Analogy</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><b>Environment</b></td>
      <td>Overall runtime setup</td>
      <td>Python 3.11 + Jupyter + Libraries</td>
      <td>The lab room</td>
    </tr>
    <tr>
      <td><b>Library / Package</b></td>
      <td>Reusable code modules</td>
      <td><code>numpy</code>, <code>pandas</code>, <code>sklearn</code></td>
      <td>Tools in the lab</td>
    </tr>
    <tr>
      <td><b>Dependency</b></td>
      <td>What your project requires</td>
      <td>Code depends on <code>numpy</code></td>
      <td>The tools you <em>must</em> have</td>
    </tr>
    <tr>
      <td><b>Version</b></td>
      <td>Specific release of any component</td>
      <td><code>numpy 1.26.4</code></td>
      <td>Model number of each tool</td>
    </tr>
  </tbody>
</table>

## Library/Package Versions


In [49]:
import plotly
import sys
import sklearn
from sklearn import __version__ as sklearn_version

print(f"Python Version: {sys.version.split()[0]}")
print(f"Numpy Version: {np.__version__}")
print(f"Pandas Version: {pd.__version__}")
print(f"Plotly Version: {plotly.__version__}")
print(f"Seaborn Version: {sns.__version__}")
print(f"Scikit-learn Version: {sklearn.__version__}")
print(f"Scikit-learn Version: {sklearn_version}")


Python Version: 3.13.5
Numpy Version: 2.1.3
Pandas Version: 2.2.3
Plotly Version: 5.24.1
Seaborn Version: 0.13.2
Scikit-learn Version: 1.6.1
Scikit-learn Version: 1.6.1


<a id="2"></a>

<h1 style='background:#007fff;border:0; color:white;
    box-shadow: 10px 10px 5px 0px rgba(0,0,0,0.25);
    transform: rotateX(10deg); border-radius: 10px;
    '><center>Loading The Data</center></h1>

# Loading The Data


In [50]:
# iris = pd.read_csv("../input/iris/Iris.csv") #Load Data
iris = pd.read_csv("Iris.csv") #Load Data
iris.drop('Id',inplace=True,axis=1) #Drop Id column

In [51]:
X = iris.iloc[:,:-1].values #Set our training data (all rows and cols except last col)

y = iris.iloc[:,-1].values #We'll use this just for visualization as clustering doesn't require labels (in all rows value in last column)

X.shape, y.shape

((150, 4), (150,))

## Printing DataFrame Head with Colors

**1. `iris.head()`**

- Displays the first five rows of the `iris` DataFrame.
- Useful for a quick preview of the dataset’s structure, columns, and sample values.

**2. `.style`**

- Activates Pandas **Styler**, allowing formatting and styling (color gradients, fonts, bar charts, etc.) in a Jupyter Notebook.
- **Does not modify the data**, only changes how it is visually presented.

**3. `.background_gradient(cmap=...)`**

- Applies a **color gradient** to the background of numerical cells based on their values.
- Low values get lighter colors, high values get darker (or vice versa, depending on the colormap).
- Makes it easier to **visually compare numeric values** across rows and columns.

**4. `sns.cubehelix_palette(as_cmap=True)`**

- `sns` refers to the Seaborn library.
- `cubehelix_palette()` generates a **smooth, perceptually uniform color gradient**.
- `as_cmap=True` converts it into a **matplotlib colormap object**, which `.background_gradient()` requires.
- The Cubehelix color scheme is ideal because it **looks good in both color and grayscale**, which is helpful for academic reports.


In [52]:
iris.head().style.background_gradient(cmap =sns.cubehelix_palette(as_cmap=True))

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


<a id="3"></a>

<h1 style='background:#007fff;border:0; color:white;
    box-shadow: 10px 10px 5px 0px rgba(0,0,0,0.25);
    transform: rotateX(10deg); border-radius: 10px;
    '><center>EDA</center></h1>

# **EDA (Exploratory Data Analysis):**

- EDA is the process of analyzing and visualizing data to **understand patterns, distributions, and relationships**.
- Pie charts like this help visualize the distribution of categories, which is crucial before applying any machine learning models.


## Data Distribution using Pie Chart

**Code Purpose:**

- Visualize the distribution of Iris species using a pie chart.
- Helps understand the balance or imbalance in the dataset.

**Explanation of Code:**

1. `px.pie()`

   - Creates a pie chart from a DataFrame.
   - Automatically counts unique values in the specified column to determine slice sizes.

2. `iris`

   - The DataFrame containing the Iris dataset.

3. `'Species'`

   - Column whose values determine the pie slices.

4. `color_discrete_sequence`

   - Customizes the colors of each slice.

5. `title='Data Distribution'`

   - Sets the chart title.

6. `template='plotly'`

   - Uses Plotly’s default style template for a clean appearance.

7. `fig.show()`
   - Displays the interactive chart in the notebook.


In [53]:
import plotly.express as px

fig = px.pie(
    iris,                               # DataFrame
    'Species',                           # Column to count values for pie slices
    color_discrete_sequence=['#491D8B','#7D3AC1','#EB548C'],  # Custom colors
    title='Data Distribution',           # Chart title
    template='plotly'                    # Plotly template for style
)

fig.show()                               # Render the pie chart


## From this plot we conclude that:

**The Data is perfectly balanced**


## Sepal-Length


In [54]:
fig = px.box(
    data_frame=iris,                   # DataFrame containing the data
    x='Species',                        # Categories on x-axis
    y='SepalLengthCm',                  # Values to plot on y-axis
    color='Species',                     # Color by species
    color_discrete_sequence=['#29066B','#7D3AC1','#EB548C'],  # Custom colors
    orientation='v'                      # Vertical boxes
)
fig.show()                               # Render the box plot


### Histogram and EDA

**Code Purpose:**

- Visualize the distribution of `SepalLengthCm` across Iris species using a histogram.
- Helps understand data spread, frequency, and potential overlaps between species.

**Explanation of Code:**

1. `px.histogram()`

   - Creates a histogram showing frequency counts for a numerical variable.

2. `data_frame=iris`

   - Specifies the DataFrame containing the data.

3. `x='SepalLengthCm'`

   - Numerical variable to plot on the x-axis.

4. `color='Species'`

   - Separates histogram bars by species for comparison.

5. `color_discrete_sequence=[...]`

   - Custom colors for each species.

6. `nbins=50`

   - Number of bins to divide the data range.

7. `fig.show()`
   - Displays the interactive chart.

**EDA (Exploratory Data Analysis):**

- Histograms show the **frequency distribution** of a feature.
- Useful for spotting **patterns, skewness, and overlaps** in the data before clustering or modeling.


In [55]:
fig = px.histogram(
    data_frame=iris,                        # DataFrame containing the data
    x='SepalLengthCm',                       # Column for which to plot the histogram
    color='Species',                         # Separate histograms by species
    color_discrete_sequence=['#491D8B','#7D3AC1','#EB548C'],  # Custom colors
    nbins=50                                 # Number of bins for the histogram
)
fig.show()                                   # Render the interactive histogram


### From these plots we conclude that:

- **Setosa has much smaller SepalLength than the other 2 classes**

- **Virginca has the highest SepalLength, however It seems hard to distingush between Virginca and Versicolor using SepalLength as the difference is less clear**

- **We can see that Virginica contains an outlier**


---


## SepalWidth


In [56]:
fig = px.box(data_frame=iris, x='Species',y='SepalWidthCm',color='Species',color_discrete_sequence=['#29066B','#7D3AC1','#EB548C'],orientation='v')
fig.show()

In [57]:
fig = px.histogram(data_frame=iris, x='SepalWidthCm',color='Species',color_discrete_sequence=['#491D8B','#7D3AC1','#EB548C'],nbins=30)
fig.show()

### From these plots we conclude that:

- **Setosa has larger SepalWidth than the other 2 classes**

- **Versicolor has smaller SepalWidth than the other 2 classes**

- **Overall all classes seem to have relatively close value of sepalwidth which indicate that is might not be a very useful feature**


---


## Petal-Length


In [58]:
fig = px.box(data_frame=iris, x='Species',y='PetalLengthCm',color='Species',color_discrete_sequence=['#29066B','#7D3AC1','#EB548C'],orientation='v')
fig.show()

In [59]:
fig = px.histogram(data_frame=iris, x='PetalLengthCm',color='Species',color_discrete_sequence=['#491D8B','#7D3AC1','#EB548C'],nbins=30)
fig.show()

### From these plots we conclude that:

- **Setosa has much smaller PetaLength than the other 2 classes**

- **This difference is less clear between Virginica and Versicolor**

- **Overall this seems like an PetaLength interesting feature**


---


## Petal-Width


In [60]:
fig = px.box(data_frame=iris, x='Species',y='PetalWidthCm',color='Species',color_discrete_sequence=['#29066B','#7D3AC1','#EB548C'],orientation='v')
fig.show()

In [61]:
fig = px.histogram(data_frame=iris, x='PetalWidthCm',color='Species',color_discrete_sequence=['#491D8B','#7D3AC1','#EB548C'],nbins=30)
fig.show()

### From these plots we conclude that:

- **Setosa has much smaller PetalWidth than the other 2 classes**

- **This difference is less clear between Virginica and Versicolor**

- **Overall this seems like an PetalWidth interesting feature**


---


## Scatter Plot and EDA

**Code Purpose:**

- Visualize relationships between sepal length, sepal width, and petal length across Iris species.
- Helps identify patterns, correlations, and differences between species.

**Explanation of Code:**

1. `px.scatter()`

   - Creates a scatter plot showing relationships between two numerical variables.

2. `data_frame=iris`

   - Specifies the DataFrame containing the Iris dataset.

3. `x='SepalLengthCm'`

   - Sets the x-axis values.

4. `y='SepalWidthCm'`

   - Sets the y-axis values.

5. `color='Species'`

   - Differentiates points by species using color.

6. `size='PetalLengthCm'`

   - Represents petal length as the size of points.

7. `template='seaborn'`

   - Applies a Seaborn-like visual theme.

8. `color_discrete_sequence=[...]`

   - Custom colors for species.

9. `fig.update_layout()`

   - Customizes figure size and axis colors.

10. `fig.show()`
    - Displays the interactive scatter plot.

**EDA (Exploratory Data Analysis):**

- Scatter plots help **identify correlations, trends, and clusters**.
- In this case, it shows how sepal length, sepal width, and petal length differ among Iris species.


In [62]:
fig = px.scatter(
    data_frame=iris,                    # DataFrame containing the data
    x='SepalLengthCm',                  # Values for x-axis
    y='SepalWidthCm',                   # Values for y-axis
    color='Species',                     # Color points based on species
    size='PetalLengthCm',                # Size of points represents petal length
    template='seaborn',                  # Plotly theme style
    color_discrete_sequence=['#491D8B','#7D3AC1','#EB548C']  # Custom colors for species
)
fig.update_layout(
    width=800, height=600,               # Set figure size
    xaxis=dict(color="#BF40BF"),         # Color of x-axis line and ticks
    yaxis=dict(color="#BF40BF")          # Color of y-axis line and ticks
)
fig.show()                               # Display the interactive scatter plot


In [63]:
fig = px.scatter(data_frame=iris, x='PetalLengthCm',y='PetalWidthCm'
           ,color='Species',size='SepalLengthCm',template='seaborn',color_discrete_sequence=['#491D8B','#7D3AC1','#EB548C'],)

fig.update_layout(width=800, height=600,
                  xaxis=dict(color="#BF40BF"),
                 yaxis=dict(color="#BF40BF"))
fig.show()

<a id="4"></a>

<h1 style='background:#007fff;border:0; color:white;
    box-shadow: 10px 10px 5px 0px rgba(0,0,0,0.25);
    transform: rotateX(10deg); border-radius: 10px;
    '><center>KMeans</center></h1>


# K-Means Clustering Algorithm & Definition

<b>Definition:</b><br>
K-Means is an <b>unsupervised machine learning algorithm</b> used to partition <code>n</code> data points into <code>k</code> clusters. Each cluster has a centroid, and the algorithm aims to <b>minimize the distance between points and their respective centroids</b>.

## <b>Steps of K-Means Algorithm</b>

1. <b>Choose the number of clusters (k)</b>.
2. <b>Initialize centroids</b> (randomly or using a heuristic).
3. <b>Assign each data point to the nearest centroid</b> using a distance metric (usually Euclidean distance):

<p style="text-align:center;">
Cluster assignment: <b>C<sub>i</sub> = argmin<sub>j</sub> || x<sub>i</sub> - μ<sub>j</sub> ||²</b>
</p>

Where:

- <b>x<sub>i</sub></b> = data point
- <b>μ<sub>j</sub></b> = centroid of cluster j
- <b>C<sub>i</sub></b> = cluster assignment of x<sub>i</sub>

4. <b>Update centroids</b> by computing the mean of all points assigned to each cluster:

<p style="text-align:center;">
μ<sub>j</sub> = (1 / |C<sub>j</sub>|) Σ<sub>x<sub>i</sub> ∈ C<sub>j</sub></sub> x<sub>i</sub>
</p>

5. <b>Repeat steps 3-4</b> until centroids do not change significantly or a maximum number of iterations is reached.

## <b>Mathematical Example</b>

Suppose we have 5 data points along a 1D axis:

<code>X = [2, 4, 5, 8, 10]</code>

We want to cluster them into <b>k = 2 clusters</b>.

1. <b>Initialize centroids:</b> randomly pick μ<sub>1</sub> = 2, μ<sub>2</sub> = 8.

2. <b>Assign points to nearest centroid:</b>

| Point x<sub>i</sub> | Distance to μ<sub>1</sub>=2 | Distance to μ<sub>2</sub>=8 | Cluster  |
| ------------------- | --------------------------- | --------------------------- | -------- |
| 2                   | 0                           | 6                           | 1        |
| 4                   | 2                           | 4                           | 1        |
| 5                   | 3                           | 3                           | 1 (or 2) |
| 8                   | 6                           | 0                           | 2        |
| 10                  | 8                           | 2                           | 2        |

3. <b>Update centroids:</b>

<p style="text-align:center;">
μ<sub>1</sub> = mean([2,4,5]) ≈ 3.67, &nbsp;&nbsp; μ<sub>2</sub> = mean([8,10]) = 9
</p>

4. <b>Repeat assignment:</b>

- Recompute distances with new centroids
- Update clusters and centroids until convergence.

<b>Result:</b>

- Cluster 1: [2,4,5] → centroid ≈ 3.67
- Cluster 2: [8,10] → centroid = 9

This simple example illustrates how K-Means iteratively minimizes the <b>sum of squared distances</b> within clusters.

---


## Using the Elbow Method to Find the Optimal Number of Clusters for K-Means

**Purpose:**

- Determine the optimal number of clusters (`k`) for KMeans using **Sum of Squared Errors (SSE)**.

### **Elbow Method Concept**

- Plot SSE vs number of clusters.
- The **elbow point** is where SSE stops decreasing sharply.
- This point is considered the **optimal number of clusters** for KMeans.

### **What is SSE (Sum of Squared Errors)?**

- SSE measures the **total squared distance** between each point and its assigned cluster centroid:

<div style="text-align: center;">
SSE = Σ<sub>i=1</sub><sup>n</sup> || x<sub>i</sub> - μ<sub>C<sub>i</sub></sub> ||²
</div>

Where:

- <b>x<sub>i</sub></b> = data point
- <b>μ<sub>C<sub>i</sub></sub></b> = centroid of the cluster assigned to x<sub>i</sub>
- <b>n</b> = total number of points

- **Lower SSE** → points are closer to their centroids → tighter clusters.
- SSE **always decreases** as the number of clusters increases, so we look for the **“elbow point”** where the decrease slows down.

### **Explanation of Code**

1. `sse = []` → Initialize a list to store **SSE values** for different numbers of clusters.
2. `for i in range(1, 9):` → Loop through 1 to 8 clusters.
3. `KMeans(n_clusters=i, max_iter=300)` → Initialize KMeans for the current number of clusters.
4. `kmeans.fit(X)` → Fit KMeans to the dataset `X`.
5. `sse.append(kmeans.inertia_)` → Record **SSE (sum of squared distances)**.
6. `px.line(y=sse, template="seaborn", title='Elbow Method')` → Plot SSE vs number of clusters.
7. `fig.update_layout()` → Customize figure layout.
8. `fig.show()` → Display the interactive plot.


In [64]:

import os
os.environ["OMP_NUM_THREADS"] = "1"
import warnings
warnings.filterwarnings("ignore", category=UserWarning, module="sklearn.cluster._kmeans")

from sklearn.cluster import KMeans

sse = []                                # To store sum of squared errors for each k

for i in range(1, 9):                   # Try 1 to 8 clusters
    kmeans = KMeans(n_clusters=i,n_init=10, max_iter=300)  # Initialize KMeans
    kmeans.fit(X)                        # Fit the model on data X
    sse.append(kmeans.inertia_)          # Record inertia (sum of squared distances)

# Plot SSE vs number of clusters
fig = px.line(
    y=sse,
    template="seaborn",
    title='Elbow Method'
)
fig.update_layout(
    width=800,
    height=600,
    title_font_color="#BF40BF",
    xaxis=dict(color="#BF40BF", title="Clusters"),
    yaxis=dict(color="#BF40BF", title="SSE")
)
fig.show()


**As expected the optimal number of clusters seems to be 3 so let's implement the model using 3 clusters**

---


### KMeans Memory Warning on Windows

**Warning Message:**

```

c:\ProgramData\anaconda3\Lib\site-packages\sklearn\cluster_kmeans.py:1419: UserWarning:
KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=1.

```

**Explanation:**

- This warning occurs when using **KMeans on Windows** with **MKL (Intel Math Kernel Library)**.
- KMeans uses **multi-threading** to speed up computation.
- On some Windows setups, if the number of **data chunks < available threads**, memory can accumulate (memory leak).
- **Small datasets** (like Iris) are usually not affected in practice, so the warning can be safely ignored.

**How to Fix / Suppress the Warning:**

1. **Set the environment variable in Python:**

```py
    import os
    os.environ["OMP_NUM_THREADS"] = "1"
```

2. **Explicitly set `n_init` in KMeans (optional):**

```py
    from sklearn.cluster import KMeans
    kmeans = KMeans(n_clusters=3, max_iter=300, n_init=10)
```

**Recommendation:**

- For **small datasets**, you can safely ignore this warning.
- For **larger datasets**, use `OMP_NUM_THREADS=1` to prevent potential memory issues.

### Understanding `n_init` in KMeans

**Definition:**

- `n_init` specifies **how many times KMeans will run with different initial cluster centers**.

**Why it Matters:**

- KMeans is sensitive to **initial placement of centroids**.
- Poor initialization can lead to **suboptimal clusters** or convergence to a local minimum.
- Running KMeans multiple times with different initial centroids improves the chance of finding the **best clustering solution**.

**Default Behavior:**

- In scikit-learn ≥1.4, `n_init=10` by default.
- Older versions also used `n_init=10`, but it’s safer to set it explicitly.

**Example:**

```py
    from sklearn.cluster import KMeans

    # Run KMeans 20 times with different random initial centroids
    kmeans = KMeans(n_clusters=3, n_init=20, max_iter=300)
    kmeans.fit(X)
```

- KMeans runs **20 separate initializations** of centroids.
- Computes **SSE (inertia)** for each run.
- Keeps the clustering with the **lowest SSE** (best solution).

**Analogy:**

- Imagine throwing darts blindfolded to hit the center of a target.
- **One throw** → might miss.
- **Multiple throws** → keep the closest, more reliable result.
- That’s what `n_init` does — multiple tries to find the best cluster centers.

---


## KMeans Clustering for 3 clusters

```python
    kmeans = KMeans(n_clusters=3, init='k-means++', max_iter=300, n_init=10, random_state=0)
    clusters = kmeans.fit_predict(X)
```

### Explanation:

1. **`KMeans(n_clusters=3, ...)`**

   - Creates a KMeans clustering object.
   - **`n_clusters=3`**: Divide the dataset into 3 clusters (e.g., the 3 Iris species).
   - **`init='k-means++'`**: Uses an improved method to initialize centroids for faster convergence and better clustering.
   - **`max_iter=300`**: Maximum number of iterations the algorithm will run per initialization.
   - **`n_init=10`**: Runs KMeans 10 times with different initial centroids and selects the best clustering based on lowest inertia (SSE).
   - **`random_state=0`**: Ensures reproducibility; the results will be the same each run.

2. **`fit_predict(X)`**

   - **`fit(X)`**: Computes the centroids by iteratively minimizing the distance between points and their assigned cluster centroid.
   - **`predict(X)`**: Assigns each data point in `X` to the nearest centroid.
   - Returns an array `clusters` containing **cluster labels** for each data point (e.g., `[0, 0, 2, 1, 0, ...]`).

### Summary:

- Finds the **best cluster centroids** in the dataset.
- Assigns **each point to a cluster**.
- Output `clusters` can be used for **visualization** or **evaluation metrics** like Silhouette Score or Davies-Bouldin Index.


In [65]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, davies_bouldin_score
import plotly.graph_objects as go

# --- KMeans Clustering ---
kmeans = KMeans(
    n_clusters=3,
    init='k-means++',
    max_iter=300,
    n_init=10,
    random_state=0
)
clusters = kmeans.fit_predict(X)



## KMeans Cluster Visualization

**Purpose:**

- Visualize the clustering results after running KMeans on the Iris dataset.
- Shows how data points are grouped and where the cluster centroids are located.

**Explanation of the Code:**

1. **Create a Figure**
   ```py
       fig = go.Figure()
   ```

- Initializes an empty Plotly figure for adding traces.

2. **Add Data Points for Each Cluster**

   ```py
       fig.add_trace(go.Scatter(
           x=X[clusters == 0, 0], y=X[clusters == 0, 1],
           mode='markers', marker_color='#DB4CB2', name='Iris-setosa'
       ))
   ```

- `clusters == 0` selects points belonging to cluster 0.
- `mode='markers'` plots points as scatter markers.
- `marker_color` sets the color for this cluster.
- `name` labels the cluster in the legend.
- Repeat for clusters 1 and 2 with different colors.

3. **Add Cluster Centroids**

   ```py
       fig.add_trace(go.Scatter(
           x=kmeans.cluster_centers_[:, 0], y=kmeans.cluster_centers_[:,1],
           mode='markers', marker_color='#CAC9CD', marker_symbol=4, marker_size=13, name='Centroids'
       ))
   ```

- `kmeans.cluster_centers_` contains the coordinates of cluster centers.
- `marker_symbol=4` gives a different shape (triangle) for centroids.
- `marker_size=13` makes centroids larger than data points.

4. **Customize Layout**

   ```py
       fig.update_layout(
           template='plotly_dark',
           width=1000, height=500,
           title='KMeans Clustering Results'
       )
   ```

- `template='plotly_dark'` gives a dark theme for the plot.
- Sets figure width, height, and title.

**Result:**

- Colored points represent data points assigned to each cluster.
- Centroids are marked with larger, distinct markers.
- Helps visually evaluate the effectiveness of KMeans clustering.

```

```


In [66]:
# --- Cluster Visualization ---
fig = go.Figure()

# Cluster 0
fig.add_trace(go.Scatter(
    x=X[clusters == 0, 0], y=X[clusters == 0, 1],
    mode='markers', marker_color='#DB4CB2', name='Cluster 0'
))

# Cluster 1
fig.add_trace(go.Scatter(
    x=X[clusters == 1, 0], y=X[clusters == 1, 1],
    mode='markers', marker_color='#c9e9f6', name='Cluster 1'
))

# Cluster 2
fig.add_trace(go.Scatter(
    x=X[clusters == 2, 0], y=X[clusters == 2, 1],
    mode='markers', marker_color='#7D3AC1', name='Cluster 2'
))

# Centroids
fig.add_trace(go.Scatter(
    x=kmeans.cluster_centers_[:, 0], y=kmeans.cluster_centers_[:,1],
    mode='markers', marker_color='#CAC9CD', marker_symbol=4, marker_size=13, name='Centroids'
))

fig.update_layout(
    template='plotly_dark',
    width=1000, height=500,
    title='KMeans Clustering Results'
)
fig.show()



<a id="5"></a>

<h1 style='background:#007fff;border:0; color:white;
    box-shadow: 10px 10px 5px 0px rgba(0,0,0,0.25);
    transform: rotateX(10deg); border-radius: 10px;
    '><center>Clustering Evaluation</center></h1>

# **Clustering Evaluation**

After performing KMeans clustering, it is important to **assess the quality of the clusters**. This can be done using **internal evaluation metrics**, which do not require ground truth labels.

## **1. Silhouette Score**

```python
  from sklearn.metrics import silhouette_score
  sil_score = silhouette_score(X, clusters)
  print(f"Silhouette Score: {sil_score:.3f}")
```

**What it does:**

- Measures how similar a point is to its **own cluster** compared to other clusters.

**Formula:**

<div style="text-align:center;">
s(i) = (b(i) - a(i)) / max(a(i), b(i))
</div>

Where:

- a(i) = average distance between point i and all other points in the **same cluster**
- b(i) = average distance between point i and all points in the **nearest other cluster**

**Interpretation:**

- Range: -1 to 1 → higher is better

## **2. Davies-Bouldin Index (DBI)**

```python
  from sklearn.metrics import davies_bouldin_score
  db_score = davies_bouldin_score(X, clusters)
  print(f"Davies-Bouldin Index: {db_score:.3f}")
```

**What it does:**

- Measures **average similarity** between clusters.

**Formula:**

<div style="text-align:center;">
DB = (1 / k) * Σ<sub>i=1</sub><sup>k</sup> max<sub>j≠i</sub> (σ<sub>i</sub> + σ<sub>j</sub>) / d(μ<sub>i</sub>, μ<sub>j</sub>)
</div>

**Interpretation:**

- Lower DBI → better clustering

## **3. Calinski-Harabasz Index (Variance Ratio Criterion)**

```python
  from sklearn.metrics import calinski_harabasz_score
  ch_score = calinski_harabasz_score(X, clusters)
  print(f"Calinski-Harabasz Score: {ch_score:.3f}")
```

**What it does:**

- Measures the ratio of **between-cluster dispersion** to **within-cluster dispersion**.
- Higher values indicate **well-separated and dense clusters**.

**Formula:**

<div style="text-align:center;">
CH = ( <span style="font-weight:bold;">Tr(B<sub>k</sub>)/(k-1)</span> ) &nbsp;/&nbsp; ( <span style="font-weight:bold;">Tr(W<sub>k</sub>)/(n-k)</span> )
</div>

Where:

- Tr(B<sub>k</sub>) = trace of **between-cluster dispersion matrix**
- Tr(W<sub>k</sub>) = trace of **within-cluster dispersion matrix**
- k = number of clusters
- n = total number of data points

**Interpretation:**

- **Higher CH score → better clustering**
- Indicates clusters are **compact (low within-cluster variance)** and **well-separated (high between-cluster variance)**

**Usage Tip:**

- Use CH index **when you want to compare clustering results for different k values**.
- Especially useful in **KMeans and hierarchical clustering** to find the optimal number of clusters.

## **4. Inertia / SSE (Sum of Squared Errors)**

```python
  inertia = kmeans.inertia_
  print(f"Inertia (SSE): {inertia:.3f}")
```

**Formula:**

<div style="text-align:center;">
SSE = Σ<sub>i=1</sub><sup>n</sup> || x<sub>i</sub> - μ<sub>C<sub>i</sub></sub> ||²
</div>

**Interpretation:**

- Lower SSE → tighter clusters

## **5. Silhouette Coefficient per Sample (Optional Visualization)**

```python
  from sklearn.metrics import silhouette_samples
  import matplotlib.pyplot as plt
  import numpy as np

  sample_silhouette_values = silhouette_samples(X, clusters)
  plt.bar(range(len(sample_silhouette_values)), sample_silhouette_values)
  plt.xlabel("Sample Index")
  plt.ylabel("Silhouette Value")
  plt.title("Silhouette Coefficient per Sample")
  plt.show()
```

- Useful to **see variation** of silhouette score across points
- Identifies clusters with weak cohesion or overlapping clusters

## **6. Gap Statistic** _(Advanced, optional)_

```python
  # Example using gap statistic library
  from gap_statistic import OptimalK
  optimalK = OptimalK(parallel_backend='joblib')
  n_clusters = optimalK(X, cluster_array=np.arange(1,11))
  print(f"Optimal number of clusters according to Gap Statistic: {n_clusters}")
```

- Compares **within-cluster dispersion** to random uniform distribution
- Can be used to **validate Elbow method results**

## **Comparison Table of Clustering Metrics**

<table>
  <tr>
    <th>Metric</th>
    <th>What it Measures</th>
    <th>Interpretation</th>
    <th>Equation</th>
    <th>When to Use</th>
  </tr>
  <tr>
    <td>Silhouette Score</td>
    <td>Cohesion vs separation</td>
    <td>Higher = better, 1 = ideal</td>
    <td>s(i) = (b(i) - a(i)) / max(a(i), b(i))</td>
    <td>Good general-purpose metric; compares cluster tightness & separation</td>
  </tr>
  <tr>
    <td>Davies-Bouldin Index</td>
    <td>Average cluster similarity</td>
    <td>Lower = better</td>
    <td>DB = (1/k) Σ<sub>i=1</sub><sup>k</sup> max<sub>j≠i</sub> (σ<sub>i</sub> + σ<sub>j</sub>)/d(μ<sub>i</sub>, μ<sub>j</sub>)</td>
    <td>When comparing compactness & separation of clusters</td>
  </tr>
  <tr>
    <td>Calinski-Harabasz Index</td>
    <td>Between-cluster vs within-cluster variance</td>
    <td>Higher = better</td>
    <td>CH = (Tr(B<sub>k</sub>)/(k-1)) / (Tr(W<sub>k</sub>)/(n-k))</td>
    <td>Useful when clusters are well-separated and convex</td>
  </tr>
  <tr>
    <td>Inertia / SSE</td>
    <td>Total squared distance to centroids</td>
    <td>Lower = better</td>
    <td>SSE = Σ<sub>i=1</sub><sup>n</sup> ||x<sub>i</sub> - μ<sub>C<sub>i</sub></sub>||²</td>
    <td>Used with Elbow method; quick indicator of cluster tightness</td>
  </tr>
  <tr>
    <td>Gap Statistic</td>
    <td>Within-cluster dispersion vs random</td>
    <td>Higher gap → better k</td>
    <td>G(k) = E*log(W_k) - log(W_k)</td>
    <td>Advanced method to validate optimal number of clusters</td>
  </tr>
</table>


In [67]:
# --- Clustering Evaluation ---
sil_score = silhouette_score(X, clusters)
db_score = davies_bouldin_score(X, clusters)

print(f"Silhouette Score: {sil_score:.3f}  → higher is better (max=1)")
print(f"Davies-Bouldin Index: {db_score:.3f}  → lower is better (min=0)")


Silhouette Score: 0.553  → higher is better (max=1)
Davies-Bouldin Index: 0.662  → lower is better (min=0)


<a id="6"></a>

<h1 style='background:#007fff;border:0; color:white;
    box-shadow: 10px 10px 5px 0px rgba(0,0,0,0.25);
    transform: rotateX(10deg); border-radius: 10px;
    '><center>Thank You</center></h1>

# Thank You

**Thank you for going through this notebook**

**If you have any feedback please let me know**

**For KMeans from Scratch Implementation please refer to this [notebook]**
