```{contents}
```

### Hyperparameter tuning

The **main hyperparameter in K-Means is `k` (number of clusters)**. Choosing the right value of `k` is critical because it directly affects model performance. There are also auxiliary parameters like `init`, `max_iter`, and `n_init`.

#### 1. Choosing `k`

* **Elbow Method**: Plot the cost function (Within-Cluster-Sum-of-Squares, WCSS) against different values of `k`. The "elbow point" indicates a good trade-off between variance explained and complexity.
* **Silhouette Score**: Measures cohesion (how close points are in a cluster) vs separation (how far clusters are apart). Higher silhouette = better cluster quality.
* **Gap Statistic**: Compares WCSS of actual clustering with WCSS of randomly generated data.

#### 2. Initialization method (`init`)

* **k-means++** (default): Ensures initial centroids are well spread out, improving convergence.
* **Random**: Risk of poor local minima, but with `n_init > 1`, mitigated.

#### 3. Number of runs (`n_init`)

* Run clustering multiple times with different centroid seeds, then choose the best outcome. Higher `n_init` reduces sensitivity to bad initialization.

#### 4. Maximum iterations (`max_iter`)

* Ensures convergence. Usually defaults (like 300) are sufficient, but can be tuned for speed vs stability.

---

### Handling Overfitting and Underfitting in K-Means

Though clustering is unsupervised (no labels), the concepts of underfitting/overfitting still apply in terms of **cluster quality**.

#### Underfitting (too simple clusters)

* **Cause**:

  * Choosing too few clusters (`k` too small).
  * Poor initialization.
* **Symptoms**:

  * High WCSS (large distances within clusters).
  * Low silhouette score.
* **Fix**:

  * Increase `k`.
  * Use `k-means++` instead of random init.
  * Increase `n_init`.

#### Overfitting (too many clusters)

* **Cause**:

  * Choosing `k` too large.
* **Symptoms**:

  * Very low WCSS but clusters don’t generalize (each point may form its own cluster).
  * Silhouette score decreases after some point.
* **Fix**:

  * Use elbow method or silhouette score to cap `k`.
  * Regularize by preferring simpler solutions (Occam’s razor).

---

**In short**

* **Tune `k`** carefully using elbow, silhouette, or gap statistic.
* **Use `k-means++` and `n_init > 10`** for stability.
* **Balance `k`** to avoid under/overfitting: too low = coarse clusters, too high = fragmented clusters.
