# K-Means Clustering

## Problem Type
**K-Means Clustering** is primarily used for:
- **Clustering** problems
- **Unsupervised** learning

### How K-Means Clustering Works
- **Partitioning method:**
  - Divides a dataset into `k` distinct, non-overlapping clusters.
- **Centroid-based clustering:**
  - Each cluster is represented by its centroid, which is the mean of all points in that cluster.
- **Iterative refinement:**
  - Randomly initializes `k` centroids, then assigns each point to the nearest centroid.
  - Centroids are recalculated as the mean of the points assigned to them.
  - The process repeats until centroids no longer change significantly (convergence).
- **Objective function:**
  - Minimizes the sum of squared distances (inertia) between each point and its corresponding centroid.
- **Cluster assignment:**
  - Points are assigned to the cluster with the nearest centroid, often using Euclidean distance.

### Key Tuning Metrics
- **`n_clusters`:**
  - **Description:** Number of clusters (`k`) to form.
  - **Impact:** Directly influences the clustering result; higher values can capture more granularity but may overfit the data.
  - **Default:** No default; must be specified.
- **`init`:**
  - **Description:** Method for initializing centroids (`k-means++`, `random`).
  - **Impact:** `k-means++` reduces the chances of poor clustering results by spreading out initial centroids.
  - **Default:** `k-means++`.
- **`max_iter`:**
  - **Description:** Maximum number of iterations to run the algorithm.
  - **Impact:** More iterations allow better convergence but increase computational time.
  - **Default:** `300`.
- **`n_init`:**
  - **Description:** Number of times the algorithm will be run with different centroid seeds.
  - **Impact:** Multiple initializations can avoid local minima, improving the final clustering solution.
  - **Default:** `10`.
- **`tol`:**
  - **Description:** Relative tolerance with respect to the change in the centroid positions to declare convergence.
  - **Impact:** Lower values may lead to more precise clusters but require more iterations.
  - **Default:** `1e-4`.

### Pros vs Cons

| Pros                                                  | Cons                                                   |
|-------------------------------------------------------|--------------------------------------------------------|
| Simple and easy to implement                           | Requires the number of clusters (`k`) to be predefined  |
| Scales well to large datasets                          | Sensitive to initial centroid positions, which can lead to local minima |
| Efficient with linear time complexity relative to the number of data points | Assumes spherical clusters, which may not fit complex data distributions |
| Works well with compact, well-separated clusters       | Struggles with clusters of varying sizes and densities  |
| Can be used as a preprocessing step for other algorithms | Not deterministic; results may vary with different initializations |

### Evaluation Metrics
- **Inertia (Within-cluster Sum of Squares):**
  - **Description:** Measures the sum of squared distances between each point and its corresponding centroid.
  - **Good Value:** Lower values indicate tighter clusters; relative decrease in inertia can help determine the optimal `k`.
  - **Bad Value:** High values suggest loose clustering; no significant drop may indicate poor clustering.
- **Silhouette Score:**
  - **Description:** Measures how similar points are to their own cluster compared to other clusters.
  - **Good Value:** Values close to 1 indicate well-separated clusters.
  - **Bad Value:** Values near 0 suggest overlapping clusters, while negative values indicate points might be assigned to the wrong cluster.
- **Elbow Method:**
  - **Description:** Plots the inertia for different values of `k` to find the "elbow point," where adding more clusters does not significantly improve the model.
  - **Good Value:** The elbow point suggests the optimal number of clusters.
  - **Bad Value:** A smooth curve with no clear elbow suggests ambiguity in the optimal `k`.
- **Davies-Bouldin Index:**
  - **Description:** Measures the average similarity ratio of each cluster with its most similar cluster (lower is better).
  - **Good Value:** Values close to 0 indicate well-separated clusters.
  - **Bad Value:** Higher values suggest that clusters are not distinct.
- **Dunn Index:**
  - **Description:** Ratio of the minimum inter-cluster distance to the maximum intra-cluster distance (higher is better).
  - **Good Value:** Higher values indicate well-separated and compact clusters.
  - **Bad Value:** Lower values suggest poor cluster separation and cohesion.



In [None]:
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

In [None]:
# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

In [None]:
# Standardize features 
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [None]:
# Calculate within-cluster sum of squares (WCSS) for different K values
wcss = []
for k in range(1, 11):
    model = KMeans(n_clusters=k, random_state=42)
    model.fit(X_scaled)
    wcss.append(model.inertia_)  # Inertia is the WCSS value

# Plot the Elbow curve
plt.figure(figsize=(8, 6))
plt.plot(range(1, 11), wcss, marker='o', linestyle='--')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Within-Cluster Sum of Squares (WCSS)')
plt.title('Elbow Method for Optimal K')
plt.grid(True)
plt.show()

In [None]:
# Apply K-means clustering
model = KMeans(
    n_clusters=4,
    init='k-means++',
    max_iter=300,
    n_init=10,
    tol=0.0001,
    random_state=42,
)
model.fit(X_scaled)

# Get cluster assignments and centroids
cluster_labels = model.labels_
centroids = model.cluster_centers_

In [None]:
# Visualize the clusters (using PCA for 2D visualization)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=cluster_labels, cmap='viridis', edgecolor='k')
plt.scatter(centroids[:, 0], centroids[:, 1], c='red', marker='X', s=100, label='Centroids')
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.title('K-means Clustering of Iris Dataset')
plt.legend()
plt.show()