# Survey of Unsupervised Learning Methods

In this survey of learning methods, we'll look at implementations of the following algorithms.

- K-Means
- DBSCAN
- Principle Component Analysis

## Preparing the Environment

To perform these machine learning tasks, we'll make use of the following libraries and their dependencies.

- sklearn - a machine learning library
- seaborn - a data visualization library
- pandas - library providing the data structures in which we'll store our data

In [None]:
import sys
!{sys.executable} -m pip install sklearn seaborn pandas

We'll also configure plotting.  First we ensure that generated plots appear in the notebook itself.

In [None]:
%matplotlib inline

Next, we set the figure size for plots.

In [None]:
import seaborn as sns

sns.set(rc={'figure.figsize':(12,8)})

## Lab

For this lab, we'll work with the [Iris Flower data set](https://en.wikipedia.org/wiki/Iris_flower_data_set) again. Though there are existing categories, we'll try to determine clusters using the K-means and DBSCAN alogorithms.

### Loading the Data

We start by loading the data.  This data set is among the example datasets included with Seaborn.

In [None]:
iris_data = sns.load_dataset('iris')
iris_data.head()

Recall that there are 150 rows of data; each of the three species has a 50 sets of measurements.

In [None]:
len(iris_data)

### Explore the Data

Before applying the clustering algorithms, it's helpful to explore the data in order to get an understanding of the input values.  An understanding of the data will help to evaluate the models produced by the algorithms and determine how well they perform.

Looking at the pair-wise scatter plots, we see that there at least two distinct clusters.

In [None]:
sns.pairplot(iris_data)

In terms of petal/sepal length and width, our data is four-dimensional. Looking at two-dimensional "slices" doesn't show us if obvious groups of points exist in higher dimensions.  While we cannot easily visualize four-dimensional data, we can visualize three-dimensional data.  We can create scatter plots of three of the four dimensions.  Tjhere are four ways of choosing any three dimensions from the four.

In [None]:
import itertools
input_data = iris_data.drop('species', axis=1)
three_dimensions = list(itertools.combinations(input_data.columns, 3))
three_dimensions

The following code creates four three-dimensional scatter plots.  Note that we use the `%matplotlib notebook` command to make these interactive. You might have to run the cell twice to see the plots.

In [None]:
%matplotlib notebook
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

for column_names in three_dimensions:
    x_dim, y_dim, z_dim = column_names
    
    # extract data from DataFrame
    x_data = input_data[x_dim]
    y_data = input_data[y_dim]
    z_data = input_data[z_dim]

    # create figure with 3d projection
    fig = plt.figure()
    ax = fig.add_subplot(111, projection='3d')
    
    # plot the data
    ax.scatter(x_data, y_data, z_data)
    
    # set axes labels
    ax.set_xlabel(x_dim)
    ax.set_ylabel(y_dim)
    ax.set_zlabel(z_dim)
    plt.show()

We'll return to inlining plots.

In [None]:
% matplotlib inline 

### K-Means

We start by loading the *KMeans* class and creating an instance specifying the number of desired clusters.  We use the *fit()* method to find clusters.

In [None]:
from sklearn.cluster import KMeans

clusters = KMeans(n_clusters=3)
clusters.fit(input_data)

We can see which clusters the algorithm assigned the data too using the *labels_* property.

In [None]:
clusters.labels_

We can compare cluster labels with the species labels.

In [None]:
for i, label in enumerate(clusters.labels_):
    print(label, iris_data.iloc[i].species)

We can create scatter plots and and use the cluster labels to color the markers.

In [None]:
plot_data = input_data.copy()
plot_data['cluster'] = clusters.labels_
sns.pairplot(data=plot_data, hue='cluster', vars=input_data.columns)

Compare this with the known classifications.

In [None]:
sns.pairplot(data=iris_data, hue='species')

We can see the location of each cluster's centroid.

In [None]:
clusters.cluster_centers_

To visualize these, we first create a DataFrame with the data.

In [None]:
import pandas as pd
centers = pd.DataFrame(clusters.cluster_centers_, columns=input_data.columns)
centers

We can create pair-wise scatter plots showing the location of the computed clusters' centers and coloring based on species.  There are six ways of choosing two dimensions from four.

In [None]:
two_dimensions = itertools.combinations(input_data.columns, 2)
for column_names in two_dimensions:
    x_dim, y_dim = column_names
    
    facet = sns.lmplot(x=x_dim, y=y_dim, hue="species", data=iris_data, fit_reg=False)
    
    # plot centers
    facet.ax.plot(centers[x_dim], centers[y_dim], 'ok')

### DBSCAN

To use the DBSCAN algorithm, we start by importing the necessary class, creating an instance, and fitting the data.  Note that we do not specify the desired number of clusters for the DBSCAN algorithm.

In [None]:
from sklearn.cluster import DBSCAN
clusters = DBSCAN()
clusters.fit(input_data)

We can see which clusters the algorithm assigned the data too using the *labels_* property.

In [None]:
clusters.labels_

As before, we can display the cluster number and corresponding species for each row of data for comparison.

In [None]:
for index, value in enumerate(clusters.labels_):
    print(value, iris_data.iloc[index].species)

Unlike K-means, DBSCAN does not compute clusters based on center points.  To get a visual sense of how the clusters compare to the original data, we can create pair-wise scatter plots using the DBSCAN results to for marker hues. We first add the results as a new column to the data.

In [None]:
results = input_data.copy()
results["cluster"] = clusters.labels_
results.head()

In [None]:
two_dimensions = itertools.combinations(input_data.columns, 2)
for column_names in two_dimensions:
    x_dim, y_dim = column_names    
    sns.lmplot(x=x_dim, y=y_dim, hue="cluster", data=results, fit_reg=False)

While the algorithm identified three clusters, the clusters it identified don't correspond to the known species classifications.

### Principle Component Analysis

In the iris data set, there are four dimensions of input/independent data that can be used to to analyze the data.  While we have been able to get meaningful results using all four dimensions, it would be simpler to work with fewer dimensions (and easier to visualize).  Principle Component Analysis (PCA) attempts to reduce the number of dimensions based on those that have the greatest variability.

To calculate the PCA, we use the *PCA* class.

In [None]:
from sklearn.decomposition import PCA
reduced_data = PCA(n_components=2).fit_transform(input_data)

Using *fit_transform()* transforms the source data into values in the desired number of dimensions.  To plot the transformed data, we can create a DataFrame and copy the original species data.

In [None]:
reduced_data = pd.DataFrame(reduced_data, columns=['PC1', 'PC2'])
plot_data = reduced_data.copy()
plot_data['species'] = iris_data.species
sns.lmplot(x='PC1', y='PC2', data=plot_data, hue='species', fit_reg=False)

**Save the notebook file and submit it on Blackboard for this week's lab.**

## Additional Resources

- [Comparing different clustering algorithms on toy datasets](scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html)

## Exercise

Use K-means to find three clusters of the reduced-dimensionality data created using the PCA algorithm, `reduced_data`.  Create a scatter plot with marker coloring based on the clusters found using K-means.  **Save the notebook and submit it for this week's exercise.**