# Lesson 2: Enhancing Machine Learning Expertise: Mini-Batch K-Means Clustering Explained

# Enhancing Machine Learning Expertise: Mini-Batch K-Means Clustering Explained

## Introduction
Welcome back to our exploration of clustering algorithms! Today, we'll cover an improved version of the k-means algorithm — the **mini-batch k-means**. While related to k-means, this variant enhances computational speed while maintaining exceptional clustering quality. Let's discuss its Python implementation.

## Understanding the Mini-Batch Concept
In machine learning, **mini-batches** refer to subsets of data randomly selected for each algorithm iteration. This approach optimizes computational functions. Specifically, for mini-batch k-means, this technique significantly accelerates the clustering process.

## Generative Dataset and Preliminaries
Before delving into the mini-batch k-means implementation, we must establish preparatory functions and a working dataset. Our dataset consists of two distinct clusters. We'll calculate the **Euclidean distance** and randomly initialize our centroids to assign each data point to its closest centroid.

The formula for Euclidean distance is:

\[
d(a, b) = \sqrt{\sum (a - b)^2}
\]

This represents the straight-line distance between two points.

```python
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(0)
data = np.vstack([np.random.normal(loc=3, scale=1, size=(100,2)), np.random.normal(loc=-3, scale=1, size=(100,2))])

def euclidean_distance(a, b):
    return np.linalg.norm(a - b, axis=-1)

def initialize_centers(data, k):
    idx = np.random.choice(len(data), size=k)
    return data[idx, :]
```

This implementation of the `euclidean_distance` function assumes `a` and `b` are numpy arrays, with potentially multidimensional data. The function calculates the **Frobenius norm** (Euclidean norm for n-dimensional space).

## Python Mini-Batch K-Means Algorithm
Let's put theory into practice by implementing the mini-batch k-means. The `mini_batch_kMeans` function accepts:

- **data**: 2D coordinates representing data points.
- **k**: The number of clusters.
- **iterations**: The number of algorithm iterations.
- **batch_size**: The number of data points randomly selected in each iteration.

The algorithm starts by initializing the centroids, then iteratively selects a mini-batch, calculates Euclidean distances, assigns points to the closest centroid, and recalculates the centroids.

```python
# Implement mini-batch K-Means
def mini_batch_kMeans(data, k, iterations=10, batch_size=20):
    centers = initialize_centers(data, k)
    for _ in range(iterations):
        idx = np.random.choice(len(data), size=batch_size)
        batch = data[idx, :]
        dists = euclidean_distance(batch[:, None, :], centers[None, :, :])
        labels = np.argmin(dists, axis=1)
        for i in range(k):
            if np.sum(labels == i) > 0:
                centers[i] = np.mean(batch[labels == i], axis=0)
    return centers

centers = mini_batch_kMeans(data, k=2)
```

## Interpreting the Results
After obtaining the final centroids, it's time to visualize the formed clusters. Each color represents a data point assigned to a centroid, and the red dots mark the centroid positions.

```python
plt.scatter(data[:, 0], data[:, 1], s=50)
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.5)
plt.show()
```

Here is the resulting image:



## Strengths, Drawbacks, and Applications
The mini-batch k-means algorithm is a powerful tool with advantages like **computational speed** and applicability to **large datasets**. However, it may not be as precise as the classic k-means. This algorithm shines in **large-scale data mining** operations where time and resources are constrained.

## Lesson Summary and Practice
Today's exploration introduced the efficient **mini-batch k-means** clustering, implemented through Python. Practice with different parameters to understand how they affect the output. Stay tuned for more engaging exercises in the next lesson!


## Visualizing Mini-Batch K-Means Clustering

## Adjusting Batch Size in Mini-Batch K-Means

## Updating the Mini-Batch K-Means Centroids

## Update Cluster Centers in Mini-Batch K-Means