# Clustering with scikit-learn

Clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar to each other than to those in other groups. It is a central tool of exploratory data mining, statistical data analysis, and machine learning. Clustering is a form of unsupervised learning, in that the datapoints are grouped without information about labels.

Cluster analysis itself is not one specific algorithm, but the name of a general task. It can be achieved by various algorithms that differ significantly in their notion of what constitutes a cluster and how to find them. Cluster analysis as such is not an automatic task, but an iterative process of knowledge discovery or interactive multi-objective optimization that involves trial and failure. It is often necessary to modify data preprocessing and model parameters until the result achieves the desired properties.

We will cover here how to implement two simple clustering methods using scikit-learn:
- k-means
- Mixture of Gaussians

For the full documentation, we encourage you to have a look at the [official scikit-learn documentation on clustering](http://scikit-learn.org/stable/modules/clustering.html).

## Generate toy data

For the purposes of this tutorial, we will generate a toy dataset of $N=1000$ datapoints and $D=2$ dimensions.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib notebook

# Generate toy data
N = 1000
D = 2
X = np.zeros((N, D))
means_k = [[1, 1], [-1, 1], [1, -1], [-1, -1]]
for k in range(4):
    idx_k = np.array(list(range(N//4)))
    idx_k += k*(N//4)
    X[idx_k, :] = means_k[k] + 0.5*np.random.randn(N//4, D)

# Plot the data
plt.figure()
plt.scatter(X[:,0], X[:,1])    

## k-means

The k-means algorithm clusters data by trying to separate samples in $K$ groups of equal variance, minimizing a criterion known as the within-cluster sum-of-squares. This algorithm requires the number of clusters to be specified. It scales well to large number of samples and has been used across a large range of application areas in many different fields.

The k-means algorithm divides a set of $N$ samples into $K$ disjoint clusters, each described by the mean $\mu_k$ of the samples in the cluster. The means are commonly called the cluster *centroids*; note that they are not data points, although they live in the same space. The k-means algorithm aims to choose centroids that minimise the within-cluster sum-of-squares criterion, i.e.,
$$
\min \sum_{n=1}^{N} ||x_n-\mathbf{\mu}_{z_n}||^2,
$$
being $z_n\in\{1,\ldots,K\}$ the indicator variable of the cluster assigned to datapoint $n$.

**Example of usage.** We use `sklearn.cluster.KMeans`. See [this page](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) for the full documentation.

In [None]:
from sklearn.cluster import KMeans
kmeans_model = KMeans(n_clusters=4).fit(X)

**Obtaining the labels.** `kmeans_model.labels_` gives us the labels for all datapoints.

In [None]:
all_labels = kmeans_model.labels_

We can use this information, e.g., to plot the results.

In [None]:
# Plot the clustered data (using a different color per cluster)
plt.figure()
for k in range(4):
    idx_k = (all_labels==k)
    plt.scatter(X[idx_k,0], X[idx_k,1])
plt.show()

**Predictions on unseen data.** We can obtain the labels of new datapoints as follows.

In [None]:
Xnew = np.random.randn(10, D)
labels_new = kmeans_model.predict(Xnew)
print(labels_new)

**Obtain the cluster means.** The cluster means can be obtained as shown below.

In [None]:
cluster_means = kmeans_model.cluster_centers_

We can also make a plot.

In [None]:
# Plot the clustered data (using a different color per cluster)
plt.figure()
for k in range(4):
    idx_k = (all_labels==k)
    plt.scatter(X[idx_k,0], X[idx_k,1])
plt.scatter(cluster_means[:,0], cluster_means[:,1], marker='x')
plt.show()

**Limitations of k-means.** For a list of the limitations of this algorithm, we refer the reader to [this page](http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_assumptions.html#sphx-glr-auto-examples-cluster-plot-kmeans-assumptions-py).

## Mixture of Gaussians

Gaussian mixture models are a type of probabilistic model. For each cluster, they assume that the data in that cluster has been generated from a Gaussian distribution of certain mean and covariance. The goal is thus to find the parameters (means and covariances) of these Gaussian distributions.

The likelihood can be formally described as follows:
$$
p(x_n|z_n=k) = \frac{1}{2\pi|\Sigma_k|^{1/2}} \exp\left\{- \frac{1}{2} (x_n-\mu_k)^\top \Sigma_k^{-1}(x_n-\mu_k) \right\},
$$
where $z_n$ is a cluster indicator variable; $\mu_k$ is the cluster mean, and $\Sigma_k$ is the cluster covariance.

The joint probability $p(x_n,z_n)$ has an additional parameter: the *weight* of each cluster. Formally,
$$
p(x_n,z_n=k) = p(x_n|z_n=k)p(z_n=k)= w_k p(x_n|z_n=k),
$$
where $w_k$ is the prior probability of cluster $k$ (this allows modeling data clustered in uneven groups).

These equations, together with the corresponding priors over $w_k$, $\mu_k$, and $\Sigma_k$, form the model specification.

**Example of usage.** The package `sklearn.mixture` allows us to implement a Gaussian mixture, which is fit to the data via the expectation-maximization (EM) algorithm. See the full documentation [here](http://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html#sklearn.mixture.GaussianMixture). Its usage is similar to k-means.

In [None]:
from sklearn.mixture import GaussianMixture
gmm_model = GaussianMixture(n_components=4).fit(X)

**Obtaining the labels.**

In [None]:
all_labels = gmm_model.predict(X)

We can plot the clustered data.

In [None]:
# Plot the clustered data (using a different color per cluster)
plt.figure()
for k in range(4):
    idx_k = (all_labels==k)
    plt.scatter(X[idx_k,0], X[idx_k,1])
plt.show()

**Predictions on unseen data.**

In [None]:
Xnew = np.random.randn(10, D)
labels_new = gmm_model.predict(Xnew)
print(labels_new)

**Obtain the cluster parameters.** We can obtain the cluster means, covariances, and weights, as shown below.

In [None]:
gmm_means = gmm_model.means_
gmm_cov = gmm_model.covariances_
gmm_weight = gmm_model.weights_

We can, e.g., plot the means:

In [None]:
# Plot the clustered data (using a different color per cluster)
plt.figure()
for k in range(4):
    idx_k = (all_labels==k)
    plt.scatter(X[idx_k,0], X[idx_k,1])
plt.scatter(gmm_means[:,0], gmm_means[:,1], marker='x')
plt.show()

Note that the covariance matrices are approximately diagonal, and that the weights are approximately uniform. This is a consequence of how we generated the data (with zero covariance between both dimensions, and evenly distributed across groups).

In [None]:
print(gmm_cov)

In [None]:
print(gmm_weight)