# Unsupervised learning

**Unsupervised Learning** addresses a different sort of problem. Here the data has no labels,
and we are interested in finding similarities between the objects in question. In a sense,
you can think of unsupervised learning as a means of discovering labels from the data itself.  

Unsupervised learning comprises tasks such as *dimensionality reduction*, *clustering*, and
*density estimation*. For example, in the iris data discussed above, we can use unsupervised
methods to determine combinations of the measurements which best display the structure of the
data. As we'll see below, such a projection of the data can be used to visualize the
four-dimensional dataset in two dimensions. 

## Dimensionality reduction with PCA

Principle Component Analysis (PCA) is a dimension reduction technique that can find the combinations of variables that explain the most variance.
Consider the iris dataset. It cannot be visualized in a single 2D plot, as it has 4 features. We are going to extract 2 combinations of sepal and petal dimensions to visualize it:

In [None]:
from sklearn import datasets
import numpy as np
import matplotlib.pyplot as plt

plt.style.use('seaborn-poster')
%matplotlib inline

In [None]:
iris = datasets.load_iris()
X = iris.data
print("The dataset shape:", X.shape)

X, y = iris.data, iris.target

Use PCA, we can reduce the dimensions from 4 into 2 and visualize it. 

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
pca.fit(X)
X_reduced = pca.transform(X)
print("Reduced dataset shape:", X_reduced.shape)

In [None]:
plt.figure(figsize=(10,8))
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y, cmap='RdYlBu')
plt.xlabel('First component')
plt.ylabel('Second component')

## Clustering with K-means

K Means is an algorithm for **unsupervised clustering**: that is, finding clusters in data based on the data attributes alone (not the labels).

K Means is a relatively easy-to-understand algorithm.  It searches for cluster centers which are the mean of the points within them, such that every point is closest to the cluster center it is assigned to.

Let's look at how KMeans operates on the simple clusters we looked at previously - The Iris dataset. To emphasize that this is unsupervised, we'll not plot the colors of the clusters:

### Train K-means

In [None]:
from sklearn.cluster import KMeans

In [None]:
k_means = KMeans(n_clusters=3, random_state=2)
k_means.fit(X)
y_pred = k_means.predict(X)

plt.figure(figsize=(10,8))
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y_pred, cmap='RdYlBu')
plt.xlabel('First component')
plt.ylabel('Second component')

## Excercise

When we use PCA, visualization is just one purpose. Sometimes, we have high dimensional data that we want to use PCA to reduce the dimensionality while keep certain amount of information in the new PCA transformed data. 

In this exercise, please use PCA on the Iris data and keep the components that explained 95% of the variance of the original data. 

In [None]:
## Your solution


In [None]:
%load ../solutions/solution_04.py