# Mini Batch K-means clustering algorithm

K-means is one of the most popular clustering algorithms, mainly because of its good time performance. With the increasing size of the datasets being analyzed, the computation time of K-means increases because of its constraint of needing the whole dataset in main memory. For this reason, several methods have been proposed to reduce the temporal and spatial cost of the algorithm. A different approach is the Mini batch K-means algorithm.

Mini Batch K-means algorithmâ€˜s main idea is to use small random batches of data of a fixed size, so they can be stored in memory. Each iteration a new random sample from the dataset is obtained and used to update the clusters and this is repeated until convergence. Each mini batch updates the clusters using a convex combination of the values of the prototypes and the data, applying a learning rate that decreases with the number of iterations. This learning rate is the inverse of the number of data assigned to a cluster during the process. As the number of iterations increases, the effect of new data is reduced, so convergence can be detected when no changes in the clusters occur in several consecutive iterations.
The empirical results suggest that it can obtain a substantial saving of computational time at the expense of some loss of cluster quality, but not extensive study of the algorithm has been done to measure how the characteristics of the datasets, such as the number of clusters or its size, affect the partition quality.

The algorithm takes small randomly chosen batches of the dataset for each iteration. Each data in the batch is assigned to the clusters, depending on the previous locations of the cluster centroids. It then updates the locations of cluster centroids based on the new points from the batch. The update is a gradient descent update, which is significantly faster than a normal Batch K-Means update.

In [1]:
from sklearn.cluster import MiniBatchKMeans, KMeans 
from sklearn.metrics.pairwise import pairwise_distances_argmin 
from sklearn.datasets.samples_generator import make_blobs 
import numpy as np

In [4]:
# Load data in X 
batch_size = 45
centers = [[1, 1], [-2, -1], [1, -2], [1, 9]] 
n_clusters = len(centers) 
X, labels_true = make_blobs(n_samples = 3000, 
							centers = centers, 
							cluster_std = 0.9) 

# perform the mini batch K-means 
mbk = MiniBatchKMeans(init ='k-means++', n_clusters = 4, 
					batch_size = batch_size, n_init = 10, 
					max_no_improvement = 10, verbose = 0) 

mbk.fit(X) 
mbk_means_cluster_centers = np.sort(mbk.cluster_centers_, axis = 0) 
mbk_means_labels = pairwise_distances_argmin(X, mbk_means_cluster_centers) 

# print the labels of each data 
print(mbk_means_labels) 


[3 3 3 ... 2 0 1]


In [2]:
from sklearn import datasets
iris = datasets.load_iris()


In [7]:
X = iris["data"] 

In [8]:
X.shape

(150, 4)

In [39]:
X

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4],
       [5.2, 4.1, 1.5, 0.1],
       [5.5, 4.2, 1.4, 0.2],
       [4.9, 3

In [74]:
# manually fit on batches
kmeans = MiniBatchKMeans(n_clusters=3, random_state=0,
                         batch_size=10)

In [75]:
kmeans = MiniBatchKMeans(n_clusters=3,
...                          random_state=0,
...                          batch_size=10,
...                          max_iter=6).fit(X)

In [76]:
kmeans.cluster_centers_

array([[5.00238095, 3.39880952, 1.4702381 , 0.23571429],
       [6.60126582, 3.03417722, 5.51012658, 2.00506329],
       [5.8358209 , 2.74179104, 4.20746269, 1.3119403 ]])

In [77]:
y_kmeans=kmeans.predict([[1, 0, 0, 0], [3,4,3,4], [3,3,3,3], [2,1,2,2]])
y_kmeans

array([0, 2, 2, 0], dtype=int32)

In [47]:
y = iris["target"]

In [81]:
#Visualising the clusters
import matplotlib.pyplot as plt
plt.scatter(X[y_kmeans == 0, 0], x[y_kmeans == 0, 1], s = 50, c = 'red', label = 'Iris-setosa')
plt.scatter(x[y_kmeans == 1, 0], x[y_kmeans == 1, 1], s = 50, c = 'blue', label = 'Iris-versicolour')
plt.scatter(x[y_kmeans == 2, 0], x[y_kmeans == 2, 1], s = 50, c = 'green', label = 'Iris-virginica')

#Plotting the centroids of the clusters
plt.scatter(kmeans.cluster_centers_[:, 3], kmeans.cluster_centers_[:,3], s = 100, c = 'yellow', label = 'Centroids')

plt.legend()

IndexError: boolean index did not match indexed array along dimension 0; dimension is 150 but corresponding boolean dimension is 4