### K-Means Clustering

**Concept Overview:**
K-Means Clustering is a popular unsupervised learning algorithm used to partition a dataset into \( k \) clusters. Each data point belongs to the cluster with the nearest mean, serving as a prototype of the cluster.

**Real-Time Example:**
Consider a retail business wanting to segment its customers based on purchasing behavior. By applying K-Means Clustering, the business can group customers into different clusters, such as high spenders, occasional buyers, and bargain hunters.

**Mathematics Behind K-Means Clustering:**

1. **Initialization**: Select \( k \) initial centroids randomly from the dataset.
2. **Assignment Step**: Assign each data point to the nearest centroid. The nearest centroid is usually found using the Euclidean distance.
   
   For a data point \( x_i \) and centroid \( \mu_j \), the Euclidean distance is:
   \$
   \text{distance}(x_i, \mu_j) = \sqrt{\sum_{m=1}^{M} (x_{im} - \mu_{jm})^2}
   \$
3. **Update Step**: Recalculate the centroids by taking the mean of all data points assigned to each centroid.
   
   For each centroid \( \mu_j \):
   \$
   \mu_j = \frac{1}{|C_j|} \sum_{x_i \in C_j} x_i
   \$
   where \( C_j \) is the set of points assigned to centroid \( j \).

4. **Convergence**: Repeat the assignment and update steps until the centroids no longer change significantly, or a maximum number of iterations is reached.

**Example:**

Let's say we have the following data points representing customers based on their annual income and spending score:

\$
\{ (15, 39), (15, 81), (16, 6), (16, 77), (17, 40), (17, 76), (18, 6), (18, 94), (19, 3), (19, 72) \}
\$

We want to cluster these data points into \( k = 2 \) clusters.

1. **Initialization**: Randomly select 2 centroids, e.g., \( \mu_1 = (15, 39) \) and \( \mu_2 = (19, 72) \).

2. **Assignment Step**: Calculate the distance of each point from both centroids and assign them to the nearest centroid.

   For example, for point \( (15, 81) \):
   \$
   \text{distance}((15, 81), (15, 39)) = \sqrt{(15-15)^2 + (81-39)^2} = 42
   \$
   \$
   \text{distance}((15, 81), (19, 72)) = \sqrt{(15-19)^2 + (81-72)^2} = 9.848
   \$

   Since \( 9.848 < 42 \), assign \( (15, 81) \) to cluster with centroid \( \mu_2 \).

   Repeat for all points. Assume initial assignment results in:

   - Cluster 1: \( \{(15, 39), (16, 6), (17, 40), (18, 6), (19, 3)\} \)
   - Cluster 2: \( \{(15, 81), (16, 77), (17, 76), (18, 94), (19, 72)\} \)

3. **Update Step**: Recalculate the centroids of the two clusters.

   For Cluster 1:
   \$
   \mu_1 = \left( \frac{15+16+17+18+19}{5}, \frac{39+6+40+6+3}{5} \right) = (17, 18.8)
   \$

   For Cluster 2:
   \$
   \mu_2 = \left( \frac{15+16+17+18+19}{5}, \frac{81+77+76+94+72}{5} \right) = (17, 80)
   \$

4. **Convergence**: Repeat steps 2 and 3 until the centroids no longer change significantly.

### Summary:

K-Means Clustering is a simple yet powerful technique for grouping data points into clusters based on similarity. By understanding the underlying mathematics and steps involved, you can effectively apply K-Means to real-world problems, such as customer segmentation.

In [1]:

### Pseudocode:

import numpy as np

def k_means_clustering(data, k, max_iterations=100):
    # Randomly initialize the centroids
    centroids = data[np.random.choice(data.shape[0], k, replace=False)]
    
    for _ in range(max_iterations):
        # Assignment step
        clusters = [[] for _ in range(k)]
        for point in data:
            distances = np.linalg.norm(point - centroids, axis=1)
            closest_centroid = np.argmin(distances)
            clusters[closest_centroid].append(point)
        
        # Update step
        new_centroids = np.array([np.mean(cluster, axis=0) for cluster in clusters])
        
        # Check for convergence
        if np.all(centroids == new_centroids):
            break
        centroids = new_centroids
    
    return centroids, clusters

# Example usage
data = np.array([
    [15, 39], [15, 81], [16, 6], [16, 77], [17, 40], [17, 76],
    [18, 6], [18, 94], [19, 3], [19, 72]
])
k = 2
centroids, clusters = k_means_clustering(data, k)

print("Centroids:", centroids)
print("Clusters:", clusters)
data

Centroids: [[17.  18.8]
 [17.  80. ]]
Clusters: [[array([15, 39]), array([16,  6]), array([17, 40]), array([18,  6]), array([19,  3])], [array([15, 81]), array([16, 77]), array([17, 76]), array([18, 94]), array([19, 72])]]


array([[15, 39],
       [15, 81],
       [16,  6],
       [16, 77],
       [17, 40],
       [17, 76],
       [18,  6],
       [18, 94],
       [19,  3],
       [19, 72]])