# Unsupervised  Learning

Unsupervised learning is a type of machine learning algorithm used to draw inferences from datasets consisting of input data **without labeled responses**. The most common unsupervised learning method is cluster analysis, which is used for exploratory data analysis to find **hidden patterns** or grouping in data.

Note that because of this there is no such thing as a training and test set. We namely have no idea what the *correct* clustering looks like!

## K-Means

One of the most popular clustering algorithms is K-means. The KMeans algorithm clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares. This algorithm requires the number of clusters to be specified. It scales well to large number of samples and has been used across a large range of application areas in many different fields.

Though normally we have no access to the true clusters, let's use the iris database so have some baseline to compare how well this algorithm performs:

In [None]:
from sklearn import cluster, datasets
import matplotlib.pyplot as plt
%matplotlib inline

iris = datasets.load_iris()
X_iris = iris.data
y_iris = iris.target

k_means = cluster.KMeans(n_clusters=3)
k_means.fit(X_iris) 

print(k_means.labels_[::10])

print(y_iris[::10])

What happens when more clusters are used? Plot the groundtrugh, n_clusters=3 and n_clusters=4 below (2 dimensions are enough):

In [None]:
# %load 5_Machine_Learning/kmeans.py

f, ax = plt.subplots(1,3, figsize=(20,5))
ax[0].scatter(X_iris[:,0], X_iris[:,1], c=y_iris)
ax[0].set_title("ground truth")
ax[1].scatter(X_iris[:,0], X_iris[:,1], c=k_means.labels_)
ax[1].set_title("kmeans: 2 components")

k_means = cluster.KMeans(n_clusters=4)
k_means.fit(X_iris)

ax[2].scatter(X_iris[:,0], X_iris[:,1], c=k_means.labels_)
ax[2].set_title("kmeans: 3 components")

But as we said mostly you do not know what the correct number of clusters is, so how do we find them? For this we need to take a look at Gaussian Mixture models and their extension to Dirichlet Process Models. Unfortunately this is a bit of a more complex topic, so we will not go into it extensively.

## Dirichlet Process Models

Here a quick example though of working code.

In [None]:
import numpy as np
from sklearn import mixture
from scipy import linalg
import itertools
import matplotlib as mpl
color_iter = itertools.cycle(['navy', 'c', 'cornflowerblue', 'gold',
                              'darkorange'])

# Number of samples per component
n_samples = 500

# Generate random sample, two components
np.random.seed(0)
C = np.array([[0., -0.1], [1.7, .4]])
X = np.r_[np.dot(np.random.randn(n_samples, 2), C),
          .7 * np.random.randn(n_samples, 2) + np.array([-6, 3])]

In [None]:
# Fit a Dirichlet process Gaussian mixture using five components
dpgmm = mixture.BayesianGaussianMixture(n_components=5,
                                        covariance_type='full').fit(X)

# n_components: max number of components
# covariance_type: shape of covariance matrix

In [None]:
def plot_results(X, Y_, means, covariances, index, title):
    plt.figure(figsize=(20,12))
    splot = plt.subplot(2, 1, 1 + index)
    for i, (mean, covar, color) in enumerate(zip(
            means, covariances, color_iter)):
        v, w = linalg.eigh(covar)
        v = 2. * np.sqrt(2.) * np.sqrt(v)
        u = w[0] / linalg.norm(w[0])
        # as the DP will not use every component it has access to
        # unless it needs it, we shouldn't plot the redundant
        # components.
        if not np.any(Y_ == i):
            continue
        plt.scatter(X[Y_ == i, 0], X[Y_ == i, 1], .8, color=color)

        # Plot an ellipse to show the Gaussian component
        angle = np.arctan(u[1] / u[0])
        angle = 180. * angle / np.pi  # convert to degrees
        ell = mpl.patches.Ellipse(mean, v[0], v[1], 180. + angle, color=color)
        ell.set_clip_box(splot.bbox)
        ell.set_alpha(0.5)
        splot.add_artist(ell)

    plt.xlim(-9., 5.)
    plt.ylim(-3., 6.)
    plt.xticks(())
    plt.yticks(())
    plt.title(title)


In [None]:
plot_results(X, dpgmm.predict(X), dpgmm.means_, dpgmm.covariances_, 1,
             'Bayesian Gaussian Mixture with a Dirichlet process prior')