# Clustering text documents using k-means
Example from [sklearn documentation](http://scikit-learn.org/stable/auto_examples/text/document_clustering.html) by Peter Prettenhofer and Lars Buitinck.

This is an example showing how the scikit-learn can be used to cluster
documents by topics using a bag-of-words approach. This example uses
a scipy.sparse matrix to store the features instead of standard numpy arrays.

It can be noted that k-means is very sensitive to
feature scaling and that in this case the IDF weighting helps improve the
quality of the clustering by quite a lot as measured against the "ground truth"
provided by the class label assignments of the 20 newsgroups dataset.

Note: as k-means is optimizing a non-convex objective function, it will likely
end up in a local optimum. Several runs with independent random init might be
necessary to get a good convergence.

In [1]:
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer
from sklearn import metrics
from sklearn.cluster import KMeans

#### Download data

In [2]:
%%time
categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space',]
print "Loading 20-newsgroups dataset for categories:", categories
dataset = fetch_20newsgroups(subset='all', categories=categories, shuffle=True)
print "%d article in %d categories" % (len(dataset.data), len(dataset.target_names))
labels = dataset.target
true_k = np.unique(labels).shape[0]

Loading 20-newsgroups dataset for categories: ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
3387 article in 4 categories
CPU times: user 649 ms, sys: 67.7 ms, total: 716 ms
Wall time: 725 ms


#### Feature extraction
Extract features from the training dataset using a sparse vectorizer.

In [3]:
%%time
vectorizer = TfidfVectorizer(stop_words='english', max_features=10000, min_df=2, max_df=0.5)
X = vectorizer.fit_transform(dataset.data)
print "Dimensions feature matrix:", X.shape

Dimensions feature matrix: (3387, 10000)
CPU times: user 928 ms, sys: 42 ms, total: 970 ms
Wall time: 950 ms


#### Dimension reduction
We could reduce the number of dimensions by performing a SVD (see next lesson). We would loose the actual meaning of the words in each cluster. We skip this step for now, since we haven't covered SVD yet, and we would like to inspect the words in each cluster.

In [4]:
# Set None for skipping dimension reduction, otherwise choose an integer
n_components = None  

In [7]:
%%time
if n_components is not None:
    print "Performing dimensionality reduction using LSA"
    # Vectorizer results are normalized, which makes KMeans behave as
    # spherical k-means for better results. Since LSA/SVD results are
    # not normalized, we have to redo the normalization.
    svd = TruncatedSVD(n_components)
    lsa = make_pipeline(svd, Normalizer(copy=False))

    X_red = lsa.fit_transform(X)

    explained_variance = svd.explained_variance_ratio_.sum()
    print "Explained variance of the SVD step: %d%%" % (explained_variance * 100)
    print "Dimensions feature matrix:", X.shape

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 5.01 µs


In [8]:
if n_components is not None:
    X = X_red

#### Applying k-means
Do the actual clustering.

In [9]:
%%time
km = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1, verbose=False)
km.fit(X)

CPU times: user 252 ms, sys: 2.6 ms, total: 254 ms
Wall time: 254 ms


In [11]:
print "Homogeneity: %0.3f" % metrics.homogeneity_score(labels, km.labels_)
print "Completeness: %0.3f" % metrics.completeness_score(labels, km.labels_)
print "V-measure: %0.3f" % metrics.v_measure_score(labels, km.labels_)
print "Adjusted Rand-Index: %.3f" % metrics.adjusted_rand_score(labels, km.labels_)
print "Silhouette Coefficient: %0.3f" % metrics.silhouette_score(X, km.labels_, sample_size=1000)

Homogeneity: 0.572
Completeness: 0.645
V-measure: 0.606
Adjusted Rand-Index: 0.566
Silhouette Coefficient: 0.007


#### Show some words in each cluster

In [13]:
top = 12
if not n_components:
    print "Top %d terms per cluster:" % top
    order_centroids = km.cluster_centers_.argsort()[:, ::-1]
    terms = vectorizer.get_feature_names()
    for i in range(true_k):
        print "Cluster %d:" % i,
        for ind in order_centroids[i, :top]:
            print terms[ind],
        print

Top 12 terms per cluster:
Cluster 0: sgi keith livesey wpd solntze jon caltech com schneider morality allan cco
Cluster 1: graphics university com image thanks posting host nntp ac computer file 3d
Cluster 2: space nasa henry access digex toronto gov pat alaska shuttle com moon
Cluster 3: god com sandvik people jesus article don say christian bible religion believe
