# üß™ Lab Handout: K-Means Clustering
---

**Learning Objectives**:

- Understand the steps of K-Means clustering

- How the **Elbow Method** and **Silhouette Score** are used to find the optimal `k`





## üß© K-Means Clustering

**K-means clustering** is a method to group data into clusters where each piece of data is closest to the central point, or **centroid**, of its cluster.

### Steps of the K-Means Algorithm
1. Start with **K centroids** by putting them at random places.  
   Example: here K = 2 (randomly selected centroids).  
2. Compute the **distance** of every point from each centroid and cluster them accordingly.  
3. Adjust centroids so that they become the **center of gravity** for their respective clusters.  
4. Re-cluster every point based on their updated distance from the centroids.  
5. Again, adjust centroids.  
6. Repeat steps until **data points stop changing clusters** (convergence).

---

### üßÆ **SSE** ‚Äì Sum of Squared Errors or **WCSS** - Within Clusters Sum of Square
To find the **optimal number of clusters (K)** using SSE:
- Plot the **sum of squared distances** from each data point to its cluster‚Äôs centroid.  
- Then, select **K** where the decrease in SSE starts to **level off**, known as the **‚Äúelbow point.‚Äù**

***

## [scikit-learn KMeans Reference](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)

**from sklearn.cluster import KMeans**

### ‚öôÔ∏è Key Parameters of `KMeans`

| Parameter | Description |
|------------|-------------|
| **n_clusters** | Number of clusters (K) to form. |
| **init** | Method for initializing centroids ‚Äî `'k-means++'` *(default)* or `'random'`or manually provide an array of centroids |
| **n_init** | Number of times the algorithm runs with different centroid seeds (best result kept). |
| **max_iter** | Maximum number of iterations for a single run. |
| **random_state** | Controls random number generation for reproducibility. |
---

### üß© Main Methods

| Method | Description |
|---------|-------------|
| **fit(X)** | Compute K-Means clustering on dataset `X`. |
| **fit_predict(X)** | Fit model and return cluster labels for each data point. |
| **predict(X)** | Assign new samples to the nearest existing cluster centroid. |

---

### üìä Important Attributes

| Attribute | Description |
|------------|-------------|
| **cluster_centers_** | Coordinates of the cluster centroids. |
| **labels_** | Index (0, 1, 2, ‚Ä¶) of the cluster each data point belongs to. |
| **inertia_** | Sum of squared distances (SSE) of samples to their nearest centroid. |

---

üìù **Note:**  
- Always scale your data (e.g., using `StandardScaler`) before applying K-Means.  
- Choose an appropriate `K` using the **Elbow Method**.  
- K-Means assumes **spherical**, equally sized clusters and is sensitive to **outliers**.

***

## Elbow Method to determine optimal number of clusters


The **Elbow Method** helps determine the **optimal number of clusters (K)** in a K-Means model.  
It is based on analyzing the **Sum of Squared Errors (SSE)**, also known as **inertia** in scikit-learn.

---

#### **Concept**

- As K increases, the **SSE (inertia)** ‚Äî i.e., the sum of squared distances of samples to their nearest cluster center ‚Äî always **decreases**.  
- Initially, this reduction is large, but after a certain K, the improvement slows down.  
- The point where the **rate of decrease sharply changes** forms an **‚Äúelbow‚Äù** in the curve ‚Äî this K is usually a good choice.

---

#### **Formula**

The SSE (or inertia) is a measure of how well data points are clustered around their respective centroids. It is calculated as follows:

$$ SSE = \sum_{i=1}^{k} \sum_{x_j \in C_i} |x_j - \mu_i|^2$$

Where:


$$
C_i : \text{The } i^{\text{th}} \text{ cluster}
$$

$$
\mu_i : \text{The centroid of cluster } i
$$ 

$$
x_j : \text{The } j^{\text{th}} \text{ data point (where } x_j \in C_i \text{ if it belongs to cluster } i)
$$ 

$$
k : \text{The total number of clusters}
$$ 

$$
\|x_j - \mu_i\|^2 : \text{The squared Euclidean distance between the data point } x_j \text{ and the centroid } \mu_i
$$
