# PC Lab 8: Unsupervised Learning
---

_Unsupervised learning_ is a different branch in machine learning where a response variable $y$ is missing. Unsupervised learning techniques are most often used for exploratory purposes or as a preprocessing step in a supervised context. Unsupervised learning is more prone to subjectivity because results are harder (or even impossible) to validate. This is why one should be careful with the interpretation of results after unsupervised learning. Those interested can have a look at the paper ["Clustering: Science or Art"](http://proceedings.mlr.press/v27/luxburg12a/luxburg12a.pdf), which summarizes a couple of critics and tries to give some pointers considering the evaluation of clustering algorithms. 

In this PC-lab we will have a look at two frequently applied techniques in the context of unsupervised learning, namely principal component analysis and k-means clustering. We will end with a general scheme, in which both techniques are used. 

![unsupervised](https://analystprep.com/study-notes/wp-content/uploads/2021/03/Img_12.jpg)

## 1. Principal components analysis for dimensionality reduction

![gaussianscatterpca](https://upload.wikimedia.org/wikipedia/commons/thumb/f/f5/GaussianScatterPCA.svg/800px-GaussianScatterPCA.svg.png)

A popular area of unsupervised learning is the area of _Dimensionality Reduction_, in which one tries to reduce the number of variables for visualization purposes or as a preprocessing step for clustering or classification/regression techniques. An established technique which you will find back in most statistics courses is _Principal Components Analysis_ (PCA).

Assume a _normalized_ $n\times p$ data matrix $\mathbf{X}$. 
    
#### **Goal:** find the direction in $\mathbf{X}$ with the largest variance (i.e., the most information). 

In other words, we need to find a linear combination of the inputs:

$$ z_{i1} = \phi_{11}x_{i1}+\phi_{21}x_{i2}+\ldots+\phi_{p1}x_{ip},$$

, where $\mathbf{\phi}$ is also called the loadings in PCA nomenclature, for which the variance is maximized:

$$\text{maximize}_{\phi_{11},\ldots,\phi_{p1}}\Big\{\frac{1}{n}\sum_{i=1}^{n}\Big(\sum_{j=1}^{p}\phi_{j1}x_{ij}\Big)^{2}\Big\}\quad \text{subject to} \quad \sum_{j=1}^{p}\phi_{j1}^{2}=1.$$

Those interested in a non-formal explanation of PCA, check out this intuitive ['dining table-tale'](https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues) about PCA.

<div class="alert alert-success">

<b>EXERCISE 1.1</b>: 
**a) Have a look at the [Iris](https://en.wikipedia.org/wiki/Iris_flower_data_set) dataset. Reduce the dataset using PCA and visualize its first two components using a scatterplot. Don't forget to preprocess your data. Do you see distinctive groups?
b) How much variance is captured in the first three components?**
</div>

In [2]:
import numpy as np

from sklearn.cluster import KMeans
from sklearn.datasets import load_iris, load_digits
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import scale
from sklearn.utils import resample

import matplotlib.pyplot as plt
import pandas as pd

from IPython.display import display, HTML
from IPython.display import Image

plt.style.use('seaborn-white')
%matplotlib inline

In [3]:
#Preprocessing: 
iris = load_iris()
X_train = iris.data
labels = iris.target

In [None]:
##1a): 
#http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html


In [None]:
##1b): 

## 2. K-means clustering

K-means clustering aims to partition the data in K clusters, so that the within-cluster variation is minimized:

$$ \text{minimize}_{C_{1},\ldots,C_{K}} \Big\{ \sum_{k=1}^{K}W(C_{k})\Big\},$$

where the most popular choice for $W(C_{k})$ is the Euclidean distance:

$$W(C_{k})=\frac{1}{|C_{k}|}\sum_{i,i'\in C_{k}}\sum_{j=1}^{p}(x_{ij}-x_{i'j})^{2}.$$

K-means clustering uses the following three steps, for which step two and three are repeated until convergence is reached: 

1) The first step chooses the initial centroids; most easy way of doing this is by choosing K samples at random from the dataset. 

2) In the second step each element of the dataset is assigned to its nearest centroid. 

3) New centroids are chosen by taking the mean of all clustered samples according to the previous centroid. 

<div class="alert alert-success">

<b>EXERCISE 2.1</b>: 
**Cluster the Iris dataset by means of 2-means and 3-means clustering. Compare the clustering results by visualizing the data in the space induced by the first two principal components.**
</b>
</div>

## 3. Combining unsupervised techniques

Often you will find that a number of unsupervised techniques are combined when exploratory analyses are conducted. This is typically the case when your number of variables is high, where you might suffer from the [curse of dimensionality](https://en.wikipedia.org/wiki/Curse_of_dimensionality). In these cases, the approaches laid out above can be combined using the following scheme, which can be tweaked in function of your research question:  

1) Compute the principal components using PCA; 

2) Select a reduced number of components in function of the explained variance; 

3) Search for a number of K meaningful clusters; 

4) Cluster your data using these final settings; 

We will use this approach and analyze a more challenging dataset, called the [`digits`-dataset](http://scikit-learn.org/stable/auto_examples/datasets/plot_digits_last_image.html). This dataset consists of handwritten images of the numbers 0-9, which has been proprocessed into feature vectors of length 64. 

<div class="alert alert-success">

<b>EXERCISE 3</b>: 
**Apply the approach illustrated above to the digits dataset. Store and compare the components which explain 50% and 90% of the variance. Choose an 'optimal' number of clusters. What do you think of the result?**
</div>

In [None]:
digits = load_digits()
X = scale(digits.data)
y = digits.target

In [None]:
def return_noc_pca(var, threshold):
    sumvar = 0.
    i = 0
    while sumvar < threshold: 
        sumvar+=var[i]
        i+=1
    return i    