# Mini Batch K-Means Clustering with Sklearn

This notebook shows how to implement Mini Batch K-Means clustering in Sklearn.

Mini-batches are subsets of the input data, randomly sampled in each training iteration. These mini-batches drastically reduce the amount of computation required to converge to a local solution. In contrast to other algorithms that reduce the convergence time of k-means, mini-batch k-means produces results that are generally only slightly worse than the standard algorithm.

* Method: [Mini Batch K-Means](http://scikit-learn.org/stable/modules/clustering.html#mini-batch-kmeans)
* Dataset: Sklearn 20 newsgroups

## Imports

In [None]:
import numpy as np

from sklearn.datasets import fetch_20newsgroups
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer
from sklearn import metrics

from sklearn.cluster import MiniBatchKMeans

import seaborn as sb
import matplotlib.pyplot as plt
from pylab import rcParams

%matplotlib inline
rcParams['figure.figsize'] = 10, 8
sb.set_style('whitegrid')

## Load and Prepare the Data

In [None]:
# Perform analysis on all the categories
categories = None

In [None]:
# Load the data
data = fetch_20newsgroups(subset='all', categories=categories, shuffle=True, random_state=42)

In [None]:
# Show some information on the data
print("%d documents" % len(data.data))
print("%d categories" % len(data.target_names))

In [None]:
# Get the labels for the data
labels = data.target

In [None]:
# Get the true K of the dataset
true_k = np.unique(labels).shape[0]
print("True K: {}".format(true_k))

### Extract the features from the training data using a sparse vectorizer

In [None]:
# number of features (dimensions) to extract from the text
n_features = 10000

In [None]:
# Perform an IDF (Inverse Document Frequency) normalization on the output of HashingVectorizer
hasher = HashingVectorizer(n_features=n_features,
                           stop_words='english',
                           alternate_sign=False,
                           norm=None,
                           binary=False)
vectorizer = make_pipeline(hasher, TfidfTransformer())

X = vectorizer.fit_transform(data.data)

print("n_samples: %d, n_features: %d" % X.shape)

### Perform dimensionality reduction using LSA

In [None]:
# Number of components used in the dimensionality reduction
n_components = 100

In [None]:
# Vectorizer results are normalized, which makes KMeans behave as spherical k-means for better results.
# Since LSA/SVD results are not normalized, we have to redo the normalization.
svd = TruncatedSVD(n_components)
normalizer = Normalizer(copy=False)
lsa = make_pipeline(svd, normalizer)

X = lsa.fit_transform(X)

print("n_samples: %d, n_features: %d" % X.shape)

In [None]:
explained_variance = svd.explained_variance_ratio_.sum()
print("Explained variance of the SVD step: {}%".format(int(explained_variance * 100)))

## Fit a Mini Batch K-Means Clustering Model

In [None]:
# Instantiate the model
km = MiniBatchKMeans(n_clusters=true_k,
                       init='k-means++',
                       n_init=1,
                       init_size=1000,
                       batch_size=1000,
                       verbose=False)

In [None]:
# Fit the model
km.fit(X)

## Evaluate the Model

### Homogeneity

Homogeneity metric of a cluster labeling given a ground truth. A clustering result satisfies homogeneity if all of its clusters contain only data points which are members of a single class.

Score between 0 and 1
* 1 stands for perfect homogenous (of the same kind) labeling

In [None]:
print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels, km.labels_))

### Completeness

A clustering result satisfies completeness if all the data points that are members of a given class are elements of the same cluster

Score between 0.0 and 1.0
* 1.0 stands for perfectly complete labeling

In [None]:
print("Completeness: %0.3f" % metrics.completeness_score(labels, km.labels_))

### V-Measure

The harmonic mean between homogeneity and completeness.

Score between 0.0 and 1.0.
* 1.0 stands for perfectly complete labeling.

In [None]:
print("V-measure: %0.3f" % metrics.v_measure_score(labels, km.labels_))

### Adjusted Rand-Index

Computes a similarity measure between two clusterings by considering all pairs of samples and counting pairs that are assigned in the same or different clusters in the predicted and true clusterings.

Similarity score between -1.0 and 1.0.
* Random labelings have an ARI close to 0.0.
* 1.0 stands for perfect match.

In [None]:
print("Adjusted Rand-Index: %.3f" % metrics.adjusted_rand_score(labels, km.labels_))

### Silhouette Score

Shows how well defined the clusters are.

Scores
* 1: Best
* 0: indicates overlapping clusters
* -1: Worst

In [None]:
print("Silhouette Coefficient: %0.3f" % metrics.silhouette_score(X, km.labels_, sample_size=1000))