# Dimensionality Reduction using PCA, t-SNE, and UMAP <a id="title"></a>

This notebook assumes you are familiar with basic machine learning vocabulary.

---

## Table of Contents
[Introduction](#intro) <br>
[0. Imports](#imports) <br>
[1. MNIST dataset and scaling](#mnist) <br>
[2. Principal Component Analysis (PCA)](#pca) <br>
- [2a. Fit, transform, and visualize using training set](#pca_train) <br>
- [2b. Transform and visualize test set](#pca_test) <br>
- [2c. Variances and the inverse function](#pca_inverse) <br>
- [2d. Fit and transform to 28 PCs](#pca_28) <br>
- [2e. Fit and transform to 95% variance](#pca_95) <br>

[3. t-distributed Stochastic Neighbor Embedding (t-SNE)](#tsne) <br>
- [3a. Fit, transform, and visualize using training set](#tsne_train) <br>
- [3b. Find an anomaly](#anomaly) <br>
- [3c. Fit, transform, and visualize using 28 PCs](#tsne_28) <br>

[4. Uniform Manifold Approximation and Projection (UMAP)](#umap) <br>
- [4a. Fit, transform, and visualize using training set](#umap_train) <br>
- [4b. Transform and visualize test set](#umap_test) <br>
- [4c. Inverse function](#umap_inverse) <br>
- [4d. UMAP using 28 PCs exercises](#exercises) <br>

[5. Conclusions](#con) <br>
[Appendix: TSVD](#append) <br>
[Additional Resources](#add) <br>
[About this Notebook](#about) <br>
[Citations](#cite) <br>

## Introduction <a id="intro"></a>

Analyzing and understanding high dimensional data is difficult, but we are a lot more equipped to analyze and understand 2-D and 3-D data. Dimensionality reduction is a subset of machine learning techniques that reduce high dimensional data to a low dimensional representation.  We can use dimensionality reduction to look for patterns in data to better understand its overall structure, which may be harder otherwise. By understanding the data's structure in a low dimensional space, we can make inferences of the data in the high dimensional space, such as what samples are similar and dissimilar. In addition, machine learning on high dimensional data is computationally expensive. By leveraging a low dimensional space, we can quickly train models without using every feature available to us.

**The purpose of this notebook is to demonstrate principal component analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE), and uniform manifold approximation and projection (UMAP) as dimensionality reduction techniques on the MNIST dataset.**

## 0. Imports <a id="imports"></a>

We use `numpy` for arrays, `matplotlib` for plotting, `tensorflow` for getting MNIST data, `sklearn` for PCA and t-SNE, and `umap` for UMAP.

If you do not have some of the packages, please follow the conda installation guides:
- [Numpy](https://numpy.org/install/)
- [Matplotlib](https://matplotlib.org/stable/users/installing/index.html)
- [Tensorflow](https://docs.anaconda.com/anaconda/user-guide/tasks/tensorflow/)
- [Scikit-learn](https://scikit-learn.org/stable/install.html)
- [UMAP](https://umap-learn.readthedocs.io/en/latest/)

In [None]:
from time import time

import numpy as np
import matplotlib.pyplot as plt

import tensorflow as tf
import sklearn
from umap import UMAP

## 1. MNIST dataset and scaling <a id="mnist"></a>

MNIST is a popular image data set of handwritten digits from 0 to 9. We use it to showcase the different dimensionality reduction techniques. Here are some qualities of the dataset:
- 60k training samples
- 10k testing samples
- 10 classifications (digits 0-9)
- 28x28 images
- 8-bit gray scaled (0-255 pixel values)

Why is MNIST such a good dataset to use for learning ML? 
- Relatively small images (only 784 features)
- Relatively large data set (70k samples)
- 10 unique well defined labels (all the digits are clearly different from each other)
- very clean dataset (no noise)
    - backgorund pixels are 0 and signal pixels are nearly 255 so it approximates a binomial distibution
    - the digits are well centered, meaning pixels for similar parts of a digit should consistently be in the same vicinity
    
We retrieve our data using `tensorflow`.

In [None]:
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

We define some global variables and min-max normalize the images so pixels range between 0-1 (normalizing data is a common practice in machine learning). We also flatten our 28x28 images into a 784 feature arrays, in which each pixel is a feature.

In [None]:
# Global variables
x_train_size = x_train.shape[0]
x_test_size = x_test.shape[0]
x_length = x_train.shape[1]
norm = x_train.max()

# Scale images
x_train_scale = x_train / norm
x_test_scale = x_test / norm

# Flatten arrays
x_train_scale_flat = x_train_scale.reshape(x_train_size, x_length ** 2)
x_test_scale_flat = x_test_scale.reshape(x_test_size, x_length ** 2)

Here are the first 16 samples in the training set.

In [None]:
fig, axs = plt.subplots(4,4,figsize=[10,10])
for i in range (4):
    for j in range (4):
        axs[i,j].imshow(x_train_scale[i*4+j])
plt.tight_layout()

## 2. Principal Component Analysis (PCA) <a id="pca"></a>

Principal component analysis [(PCA)](https://en.wikipedia.org/wiki/Principal_component_analysis) is the first go-to method for dimensionality reduction because of its simplicity and linear nature. It's built off of linear algebra, mostly related to singular value decomposition (SVD). The goal of PCA is to find the axes of most variation and project the data onto those axes, which conserves global structure. Because it's just using linear algebra, it is an extremely precise embedding, meaning results generally will not change from different runs. It's also exceptionally fast, which is why practicioners use this first to explore data. In addition, fitting more components doesn't change the results of previous components, i.e., a 3-D reduction is the 2-D reduction with an added dimension. Furthermore, axes are fit in decreasing amounts of variation. That is to say the first principal component holds the most variation, the second principal component holds the second most variation, etc. Adding more components adds more variance by decreasing amounts. Here are some useful more complete resources:

- [Stat Quest Main Ideas Video (6 minues, simple language)](https://www.youtube.com/watch?v=HMOI_lkzW08)
- [Stat Quest Detailed Video (22 minutes, simple language)](https://www.youtube.com/watch?v=FgakZw6K1QQ&t=231s)
- [Computerphile Video (20 minutes, intermediate lanuage)](https://www.youtube.com/watch?v=TJdH6rPA-TI)
- [Steve Brunton Video (14 minutes, advanced language)](https://www.youtube.com/watch?v=fkf4IBRSeEc&t=645s)
- [A Tutorial on Principal Component Analysis Paper](https://arxiv.org/pdf/1404.1100.pdf)


There is also functional PCA [(FPCA)](https://en.wikipedia.org/wiki/Functional_principal_component_analysis), which can be used for time series. Instead of building eigenvectors to project data points to, FPCA builds eigenfunctions to project time series to. Here's the Python implementation if it's of interest: [`scikit-fda`](https://fda.readthedocs.io/en/latest/auto_examples/plot_fpca.html#sphx-glr-auto-examples-plot-fpca-py).

### 2a. Fit, transform, and visualize using training set <a id="pca_train"></a>

We use [scikit-learn for PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html). We reduce to the first two principal components for simplicity. In addition, we set `whiten=True`, which normalizes the principal components to a normal distribution.

In [None]:
pca = sklearn.decomposition.PCA(n_components=2, whiten=True)

We fit and transform the data from the input space to the "PCA space", or the reduced space formed from PCA.

In [None]:
t0_pca = time()
pca_mnist_train = pca.fit_transform(x_train_scale_flat)
t1_pca = time()

print ('Time spent fitting model: {:.4f} seconds'.format(t1_pca-t0_pca))

Let's check the shape to make sure we have two principal components as features.

In [None]:
pca_mnist_train.shape

Now, we'll plot the training set in the PCA space, where the x-axis is the first principal component, the y-axis is the second principal component, and each data point represents an image.

In [None]:
plt.figure(figsize=[10,10])
plt.title('PCA using MNIST (training): unlabeled')
plt.scatter(pca_mnist_train[:, 0], pca_mnist_train[:, 1], s=5, alpha=0.5)
plt.xlabel('PC1')
plt.ylabel('PC2')

An interesting shape, but hard to tell if any meaningful groups formed. Let's color the plot by digit to get a better idea of where the digits are.

In [None]:
plt.figure(figsize=[10,10])
plt.title('PCA using MNIST (training): labeled')
for i in range (10):
    mask = y_train == i
    plt.scatter(pca_mnist_train[:, 0][mask], pca_mnist_train[:, 1][mask], 
                s=5, alpha=0.5, label=i, color='C{}'.format(i))
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.legend()

Similar digits clearly cluster together, but there is a good amount of overlap. Let's plot ten times by digit to get a better sense of the overlapping regions.

In [None]:
for i in range (10):
    # Plot digit
    plt.figure(figsize=[10,10])
    plt.title('PCA using MNIST (training): labeled {}'.format(i))
    mask = y_train == i
    plt.scatter(pca_mnist_train[:, 0][mask], pca_mnist_train[:, 1][mask], 
                s=5, alpha=0.5, label=i, color='C{}'.format(i))
    # Plot transparent background
    for j in range (10):
        mask2 = y_train == j
        plt.scatter(pca_mnist_train[:, 0][mask2], pca_mnist_train[:, 1][mask2], 
                    s=1, alpha=0.1, color='C{}'.format(j))
    plt.xlabel('PC1')
    plt.ylabel('PC2')
    plt.legend()

We could be convinced that a simple classification/clustering algorithm on just two digits in the PCA space would yield good results, illustrating the the 784-D input space is well represented in the 2-D PCA space. The clusters are somewhat interpratable as well:
- 0 and 1 are the most distant clusters, which makes sense because they have no common features in the input space (0s are curvy and 1s are straight).
- 4, 7, and 9 overlap, which makes sense because all these digits contain a straight line and a some feature at the top of the digit.
- 2, 3, 5, 6, and 8 all overlap, which makes sense because all these digits contain a curve and are generally more "complex" than other digits.
- 1 has a tight distribution, which makes sense because there are less degrees of freedom for drawing a 1.
- 2 and 5 have wide distributions, which makes sense because there are many ways to draw a 2 and 5.

It's nice that some of these small mental checks of the PCA space and where samples are embedded with respect to their classification are understandable, i.e., nothing about the embedding is too surprising.

### 2b. Transform and visualize test set <a id="pca_test"></a>

To see if this representation is generalized, we transform the test set.

In [None]:
pca_mnist_test = pca.transform(x_test_scale_flat)

Now we'll plot the test set in the PCA space, with the training set as a transparent background.

In [None]:
plt.figure(figsize=[10,10])
for i in range (10):
    plt.title('PCA using MNIST (test): labeled')
    mask = y_train == i
    plt.scatter(pca_mnist_train[:, 0][mask], pca_mnist_train[:, 1][mask], 
                s=1, alpha=0.1, color='C{}'.format(i))
    mask2 = y_test == i
    plt.scatter(pca_mnist_test[:, 0][mask2], pca_mnist_test[:, 1][mask2], 
                s=5, alpha=0.5, label=i, color='C{}'.format(i))
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.legend()

Again, there is some overlap so let's plot by digit to clearly see the samples.

In [None]:
for i in range (10):
    # Plot digit
    plt.figure(figsize=[10,10])
    plt.title('PCA using MNIST (test): labeled {}'.format(i))
    mask = y_test == i
    plt.scatter(pca_mnist_test[:, 0][mask], pca_mnist_test[:, 1][mask], 
                s=5, alpha=0.5, label=i, color='C{}'.format(i))
    # Plot transparent background
    for j in range (10):
        mask2 = y_train == j
        plt.scatter(pca_mnist_train[:, 0][mask2], pca_mnist_train[:, 1][mask2], 
                    s=1, alpha=0.1, color='C{}'.format(j))
    plt.xlabel('PC1')
    plt.ylabel('PC2')
    plt.legend()

The digits in the test set are reduced near their associated labels, meaning the PCA space is generalized.

### 2c. Variances and the inverse function <a id="pca_inverse"></a>

Remember, PCA finds the axes of most variance within the data's input space and projects the data onto those axes. We can calculate the total percentage of variance the PCA space has.

In [None]:
print ('First PC Variance: {:.4f}'.format(pca.explained_variance_ratio_[0]))
print ('Second PC Variance: {:.4f}'.format(pca.explained_variance_ratio_[1]))
print ('Total Variance: {:.4f}'.format(pca.explained_variance_ratio_.sum()))

The first principal component contains 9.7% of the data's variance, and the second principal component contains 7.1% of the data's variance. The total variance contained is 16.8%.

PCA also has an inverse transformation function that goes from the PCA space to original input space. The more variance contained, the better the inverse transformations are. Let's inverse transform the first 8 samples from the training set.

In [None]:
subset = 8
pca_mnist_inverse = pca.inverse_transform(pca_mnist_train[:subset]).reshape(subset, x_length, x_length)

Now, let's plot the input samples with the inverse samples.

In [None]:
fig, axs = plt.subplots(2, subset, figsize=[20, 5])
for i in range (subset):
    axs[0, i].imshow(x_train_scale[i])
    axs[1, i].imshow(pca_mnist_inverse[i])
plt.tight_layout()

Since the PCA space doesn't contain a lot of the data's variance, the inverse transforms are not that good. Most inverses do not look like their parent sample.

Since we have an inverse function, we can also sample from the PCA space and inverse to generate "new digits", which will help us understand how MNIST is mapped onto the PCA space. 

First, we'll make a grid of points as our samples.

In [None]:
x_min, x_max = -2, 4
y_min, y_max = -3, 3
n = 10
pca_manifold = []
for i in np.linspace(y_max,y_min,n):
    for j in np.linspace(x_min,x_max,n):
        pca_manifold.append([j, i])
pca_manifold = np.array(pca_manifold)

Next, we inverse transform the grid points to the high dimensional space.

In [None]:
pca_manifold_inverse = pca.inverse_transform(pca_manifold).reshape(n, n, x_length, x_length)

Finally, we can plot the grid with each "digit" representing the inverse of a grid point near those coordinates, e.g., (-2,-1) in PCA space is mapped onto (-2,-1) on the grid.

In [None]:
# Create image with all inverses
manifold = np.zeros((n*x_length, n*x_length))
for i in range (n):
    for j in range (n):
        manifold[i*x_length:(i+1)*x_length, j*x_length:(j+1)*x_length] = pca_manifold_inverse[i, j]

fig, axs = plt.subplots(1,2,figsize=[20,10])

# Plot training set in PCA space
axs[0].set_title('PCA using MNIST (training): labeled')
for i in range (10):
    mask = y_train == i
    axs[0].scatter(pca_mnist_train[:, 0][mask], pca_mnist_train[:, 1][mask], 
                s=5, alpha=0.5, label=i, color='C{}'.format(i))
axs[0].set_xlabel('PC1')
axs[0].set_ylabel('PC2')
axs[0].set_xlim(x_min, x_max)
axs[0].set_ylim(y_min, y_max)
axs[0].legend()

# Plot inverse grid
axs[1].set_title('PCA using MNIST: manifold')
axs[1].imshow(manifold, extent=[x_min,x_max,y_min,y_max])
axs[1].set_xlabel('PC1')
axs[1].set_ylabel('PC2')

The inverse grid hardly looks realistic, but the map reinforces that the PCA space in general represents our data:

- on the bottom left (-2,-3), we have our 1s
- on the bottom right (4,-3), we have our 0s
- on the top left (-2,3), we have our 9s
- on the top right (4,3), we have a combination of a 0 and a 9.
    - the PCA space didn't have any samples originally there so the inverse function is trying its best to predict what might be there.

### 2d. Fit and transform to 28 PCs <a id="pca_28"></a>

In order to learn a better representation, let's try keeping the first 28 principal components and seeing how well the inverse transformation is. **Note: the author likes to use the square root of the number of features (especially for square images) as a starting point for low dimensional representation, which is why here we choose 28.**

In [None]:
pca_28 = sklearn.decomposition.PCA(n_components=28, whiten=True)

We fit and transform the samples to 28 principal components.

In [None]:
t0_pca_28 = time()
pca_28_mnist_train = pca_28.fit_transform(x_train_scale_flat)
t1_pca_28 = time()

print ('Time spent fitting model: {:.4f} seconds'.format(t1_pca_28-t0_pca_28))
print ('Total Variance: {:.4f}'.format(pca_28.explained_variance_ratio_.sum()))

Even after adding another 26 principal components, PCA is still really fast. The total amount of variance contained in the first 28 principal components is 71.6%, which is much higher than the 2-D representation. To show fitting more components doesn't change the results of the previous components, we plot a histogram of the differences between the first principal components of the 2-D and 28-D spaces.

In [None]:
plt.title('Difference Histogram')
plt.hist(pca_mnist_train[:, 0] - pca_28_mnist_train[:, 0], bins=100)
plt.xlabel('pca - pca_28')
plt.ylabel('Frequency')

The differences are negligible (~$10^{-5}$) most likely due to floating point precision. 

Let's replot the same samples as before with their 28-component inverses.

In [None]:
subset = 8
pca_28_mnist_inverse = pca_28.inverse_transform(pca_28_mnist_train[:subset]).reshape(subset, x_length, x_length)

fig, axs = plt.subplots(2, subset, figsize=[20, 5])
for i in range (subset):
    axs[0, i].imshow(x_train_scale[i])
    axs[1, i].imshow(pca_28_mnist_inverse[i])
plt.tight_layout()

The inverses are a bit better and can for the most part be recognized as their parent sample. However, the inverses are a little blurry/noisy.

### 2e. Fit and transform to 95% variance <a id="pca_95"></a>

Let's try keeping the number of components that contains 95% of the variance, which is a common heuristic. We can do this by using the portion we want as the number of components in `PCA()`.

In [None]:
# Define model
pca_95 = sklearn.decomposition.PCA(n_components=0.95, whiten=True)

# Fit
pca_95_mnist_train = pca_95.fit_transform(x_train_scale_flat)
print ('Number of PCs: ', pca_95_mnist_train.shape[1])
print ('Total Variance: {:.4f}'.format(pca_95.explained_variance_ratio_.sum()))

# Evaluate inverses
subset = 8
pca_95_mnist_inverse = pca_95.inverse_transform(pca_95_mnist_train[:subset]).reshape(subset, x_length, x_length)

# Plot inverses
fig, axs = plt.subplots(2, subset, figsize=[20, 5])
for i in range (subset):
    axs[0, i].imshow(x_train_scale[i])
    axs[1, i].imshow(pca_95_mnist_inverse[i])
plt.tight_layout()

The 154-D inverses are almost identical, with just some noise around the digits.

Since we have a lot of components, we can plot the components by their ratios to see how much variance each component is adding.

In [None]:
fig, axs = plt.subplots(1,2,figsize=[20,10])

axs[0].set_title('PCA 154 Variance Plot')
axs[0].scatter(np.arange(pca_95_mnist_train.shape[1]), pca_95.explained_variance_ratio_)
axs[0].set_xlabel('Component')
axs[0].set_ylabel('Variance Ratio')

axs[1].set_title('PCA 154 Variance Plot (Log Scale)')
axs[1].scatter(np.arange(pca_95_mnist_train.shape[1]), pca_95.explained_variance_ratio_)
axs[1].set_xlabel('Component')
axs[1].set_ylabel('Variance Ratio')
axs[1].set_yscale('log')

We can see that as components get added, the amount of variance they're adding decreases.

If you remember the first embedding plot we made, all of the digits were essentially in one big blob. These next two methods perform a lot better in placing similar samples close to each other and distinguishing groups of data with boundaries in the embedding.

## 3. t-distributed Stochastic Neighbor Embedding (t-SNE) <a id="tsne"></a>

t-distributed stochastic neighbor embedding [(t-SNE)](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding) is a nonlinear dimensionality reduction method, mostly used as a visualization tool. Here's the `scikit-learn` definition:

    t-SNE is a tool to visualize high-dimensional data. It converts similarities between data points to joint
    probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low
    dimensional embedding and the high-dimensional data. t-SNE has a cost function that is not convex, i.e. with
    different initializations we can get different results.

It is extremely useful for exploratory data analysis (EDA) since its purpose is to put similar samples near each other, i.e., there's an emphasis on local structure. That being said, the global structure is not well represented and the distances between far samples in the embedded space are not similar to their respective distances in the input space.
 
It can only fit data to a reduced space, i.e., it can't predict new samples. This weakness means if you recieved more data, the embedding would have to be fit again entirely. One could in theory [train a neural network to learn the embedding](https://lvdmaaten.github.io/publications/papers/AISTATS_2009.pdf), but that is beyond the scope of this tutorial. Another weakness is that it is relatively slow compared to PCA. Here are some useful more complete resources:

- [Stat Quest Video (12 minutes, simple language)](https://www.youtube.com/watch?v=NEaUSP4YerM)
- [Visualizing Data using t-SNE Talk (55 minutes, advanced language)](https://www.youtube.com/watch?v=RJVL80Gg3lA&t=15s)
- [t-SNE Original Paper](https://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf)

### 3a. Fit, transform, and visualize using training set <a id="tsne_train"></a>

We also use [scikit-learn for t-SNE](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html). Since the method is stochastic, different runs will have different results. The main hyperparameter for t-SNE is perplexity, which is similar to the number of nearest neighbors used for the embedding. Results are very sensitive to perplexity and changing it by a little can cause drastic differences in the reduced space. Here, we reduce to two components and choose the default perplexity.

In [None]:
tsne = sklearn.manifold.TSNE(n_components=2, perplexity=30, random_state=42)

We will fit and transform the training set to the "t-SNE space", or the reduced space formed from t-SNE. **Warning: this may take several minutes.**

In [None]:
t0_tsne = time()
tsne_mnist_train = tsne.fit_transform(x_train_scale_flat)
t1_tsne = time()

print ('Time spent fitting model: {:.4f} seconds'.format(t1_tsne-t0_tsne))

Comparably, t-SNE is a lot slower than PCA. Let's check the shape to make sure we have two components as features.

In [None]:
tsne_mnist_train.shape

Now we plot the training set in the t-SNE space. The x and y axes are meaningless, and the focus of the analysis should be on the samples' locations with respect to their neighbors.

In [None]:
plt.figure(figsize=[10,10])
plt.title('t-SNE using MNIST (training): unlabeled')
plt.scatter(tsne_mnist_train[:, 0], tsne_mnist_train[:, 1], s=5, alpha=0.5)
plt.xlabel('TSNE1')
plt.ylabel('TSNE2')

By eye, there is arguably 11 distinct clusters, which is a lot better than PCA. Let's color the plot by digit to see where the digits are.

In [None]:
plt.figure(figsize=[10,10])
plt.title('t-SNE using MNIST (training): labeled')
for i in range (10):
    mask = y_train == i
    plt.scatter(tsne_mnist_train[:, 0][mask], tsne_mnist_train[:, 1][mask], 
                s=5, alpha=0.5, label=i, color='C{}'.format(i))
plt.xlabel('TSNE1')
plt.ylabel('TSNE2')
plt.legend()

Most of the clusters are independent with small margins dividing each one up. Let's plot by digit to get some finer details.

In [None]:
for i in range (10):
    # Plot digit
    plt.figure(figsize=[10,10])
    plt.title('t-SNE using MNIST (training): labeled')
    mask = y_train == i
    plt.scatter(tsne_mnist_train[:, 0][mask], tsne_mnist_train[:, 1][mask], 
                s=5, alpha=0.5, label=i, color='C{}'.format(i))
    # Plot background
    for j in range (10):
        mask2 = y_train == j
        plt.scatter(tsne_mnist_train[:, 0][mask2], tsne_mnist_train[:, 1][mask2], 
                    s=1, alpha=0.1, color='C{}'.format(j))
    plt.xlabel('TSNE1')
    plt.ylabel('TSNE2')
    plt.legend()

We could be convinced that a simple classification/clustering algorithm on several digits in the t-SNE space would produce good results, illustrating the the 784-D input space is well represented in the 2-D t-SNE space. In addition, t-SNE far outperforms PCA in not overlapping several digits and forming distinct groups. The 3 cluster (red) is split, indicating there could be two different local strctures of 3s or the model needed more time to run (see `n_iter` in documentation).

### 3b. Find an anomaly <a id="anomaly"></a>

The "anomalies", or samples that aren't part of their respective digit clusters, had trouble embedding to their cluster either because they are true anomalies or because the hyperparameters we chose weren't optimal enough. Let's find an anomaly and see if there's a reason why it joined a specific cluster.

In [None]:
# Choose a 9 with the greatest x coordinate as an anomaly
num = 9
index = np.where(tsne_mnist_train[y_train==num][:, 0].max() == tsne_mnist_train)[0][0]

# Plot t-SNE space
fig, axs = plt.subplots(1,2,figsize=[16,8])
for i in range (10):
    mask = y_train == i
    axs[0].scatter(tsne_mnist_train[:, 0][mask], tsne_mnist_train[:, 1][mask], 
                s=1, alpha=0.1, label=i, color='C{}'.format(i))

# Plot the anomaly in t-SNE space as black
axs[0].scatter(tsne_mnist_train[index][0], tsne_mnist_train[index][1], s=10, color='k', label='anomaly')
axs[0].set_title('t-SNE using MNIST (training): labeled')
axs[0].set_xlabel('TSNE1')
axs[0].set_ylabel('TSNE2')
axs[0].legend()

# Plot the anomaly
axs[1].imshow(x_train_scale[index])

This 9 (black) was grouped with the 0s (dark blue). Even though the anomaly 9 has a small stem, it could easily be mistaken as a 0.

### 3c. Fit, transform, and visualize using 28 PCs <a id="tsne_28"></a>

In practice, it is recommended to use another dimensionality reduction technique, e.g. PCA, before t-SNE to reduce noise and computation time. Using a smaller number of features that well represent the original data set will suffice for data visualization. We use the same hyperparameters as before for the model.

In [None]:
tsne_28 = sklearn.manifold.TSNE(n_components=2, perplexity=30, random_state=42)

We use the first 28 principal components of the training set from earlier as features to fit.

In [None]:
t0_tsne_28 = time()
tsne_28_mnist_train = tsne.fit_transform(pca_28_mnist_train)
t1_tsne_28 = time()

print ('Time spent fitting model: {:.4f} seconds'.format(t1_tsne_28-t0_tsne_28))

Using 28 features instead of 784 slightly decreased the amount of time t-SNE took to model. We could imagine extremely high dimensional data (>10k features) would benefit more from an initial reduction.

Let's plot the new embedding.

In [None]:
plt.figure(figsize=[10,10])
for i in range (10):
    plt.title('t-SNE using 28 MNIST PCs (training): labeled')
    mask = y_train == i
    plt.scatter(tsne_28_mnist_train[:, 0][mask], tsne_28_mnist_train[:, 1][mask], 
                s=5, alpha=0.5, label=i, color='C{}'.format(i))
plt.xlabel('TSNE1')
plt.ylabel('TSNE2')
plt.legend()

Since the noise was reduced using PCA, the 3s cluster (red) is not split anymore. Now we have 10 distinct clusters, one for each digit. There are still some anomalies, which we can visualize by plotting the digits individually.

In [None]:
for i in range (10):
    # Plot digit
    plt.figure(figsize=[10,10])
    plt.title('t-SNE using 28 MNIST PCs (training): labeled')
    mask = y_train == i
    plt.scatter(tsne_28_mnist_train[:, 0][mask], tsne_28_mnist_train[:, 1][mask], 
                s=5, alpha=0.5, label=i, color='C{}'.format(i))
    # Plot background
    for j in range (10):
        mask2 = y_train == j
        plt.scatter(tsne_28_mnist_train[:, 0][mask2], tsne_28_mnist_train[:, 1][mask2], 
                    s=1, alpha=0.1, color='C{}'.format(j))
    plt.xlabel('TSNE1')
    plt.ylabel('TSNE2')
    plt.legend()

t-SNE overall outperforms PCA in local structure connectivity with similar samples being close to each other in the reduced space. It can also distinctly form clusters of digits either using the original features or principal components. However, some drawbacks include computational complexity, no prediction after training, and no inverse function from the reduced space back to the original space. For these reasons, t-SNE is more useful as a data visualization tool. Thankfully, our last method addresses these drawbacks with the added benefit of focusing on global structure as well.

## 4. Uniform Manifold Approximation and Projection (UMAP) <a id="umap"></a>

Uniform manifold approximation and projection [(UMAP)](https://github.com/lmcinnes/umap) is a nonlinear dimensionality reduction technique that has gained widespread use and popularity since its conception in 2018. UMAP constructs neighbor graphs in a high dimension and projects that graph to a low dimension. It does so by making three assumptions:

- The data is uniformly distributed on Riemannian manifold;
- The Riemannian metric is locally constant (or can be approximated as such);
- The manifold is locally connected.

UMAP aims to conserve both local and global structure; the distances between similar samples in the high dimensional space are consistent with those distances in the low dimensional space, and the distances between clusters formed here are more meaningful than in t-SNE. Because of this, clusters will form and try to distance themselves away from each other, unlike t-SNE. The mathematical theory behind UMAP is intense and is beyond the scope of the tutorial, but the theory makes the algorithm robust as a proof of concept. 

In addition, it's pretty fast with its computational complexity just above PCA, but far below t-SNE. Furthermore, data can be projected into the reduced space after training, i.e., prediction of new data is possible. Lastly, there exists an inverse function between the low dimensional space and high dimensional space. Here are some useful more complete resources:

- [StatQuest UMAP: Main Ideas (19 minutes, simple language)](https://www.youtube.com/watch?v=eN0wFzBA4Sc)
- [StatQuest UMAP: Mathematical Details (16 minutes, simple language)](https://www.youtube.com/watch?v=jth4kEvJ3P8)
- [AI Coffee Bean Video (9 minutes, intermediate language)](https://www.youtube.com/watch?v=6BPl81wGGP8)
- [McInnes (author) Scipy Talk (26 minutes, advanced language)](https://www.youtube.com/watch?v=nq6iPZVUxZU&t=843s)
- [UMAP original paper](https://arxiv.org/pdf/1802.03426.pdf)
- [Understanding UMAP (article)](https://pair-code.github.io/understanding-umap/)

### 4a. Fit, transform, and visualize using training set <a id="umap_train"></a>

We use a [Python implementation of UMAP](https://umap-learn.readthedocs.io/en/latest/) that is built similar to `sklearn`. The two main hyperparameters for this algorithm are `n_neighbors` and `min_dist`. The former determines the number of neighbors used for the local approximation, and larger values conserve more global structure. The later determines how tight the data is in the reduced space, and smaller values conserve more local structure. Tuning these hyperparameters smoothly changes the overall embedding. We reduce to two components and choose the default hyperparameter values.

In [None]:
umap = UMAP(n_components=2, n_neighbors=15, min_dist=0.1, random_state=42)

Like PCA and t-SNE, we fit and transform the training data to the "UMAP space", or the reduced space formed from UMAP.

In [None]:
t0_umap = time()
umap_mnist_train = umap.fit_transform(x_train_scale_flat)
t1_umap = time()

print ('Time spent fitting model: {:.4f} seconds'.format(t1_umap-t0_umap))

Not as fast as PCA, but way faster than t-SNE. Let's check the shape to make sure we have two features.

In [None]:
umap_mnist_train.shape

Now we plot the training set in the UMAP space. Like t-SNE, the axes are arbitrary and analysis should focus on samples' locations with respect to their neighbors and clusters' locations with respect to each other.

In [None]:
plt.figure(figsize=[10,10])
plt.title('UMAP using MNIST (training): unlabeled')
plt.scatter(umap_mnist_train[:, 0], umap_mnist_train[:, 1], s=5, alpha=0.5)
plt.xlabel('UMAP1')
plt.ylabel('UMAP2')

There are clearly 10 distinct clusters with a good amount of separation between them. Let's color the plot so we can see where each digit lies.

In [None]:
plt.figure(figsize=[10,10])
plt.title('UMAP using MNIST (training): labeled')
for i in range (10):
    mask = y_train == i
    plt.scatter(umap_mnist_train[:, 0][mask], umap_mnist_train[:, 1][mask], 
                s=5, alpha=0.5, label=i, color='C{}'.format(i))
plt.xlabel('UMAP1')
plt.ylabel('UMAP2')
plt.legend()

All the digits formed their own groups. Since there are some "anomalies", let's plot by digit to get finer details of where they lie.

In [None]:
for i in range (10):
    # Plot digit
    plt.figure(figsize=[10,10])
    plt.title('UMAP using MNIST (training): labeled')
    mask = y_train == i
    plt.scatter(umap_mnist_train[:, 0][mask], umap_mnist_train[:, 1][mask], 
                s=5, alpha=0.5, label=i, color='C{}'.format(i))
    # Plot background
    for j in range (10):
        mask2 = y_train == j
        plt.scatter(umap_mnist_train[:, 0][mask2], umap_mnist_train[:, 1][mask2], 
                    s=1, alpha=0.1, color='C{}'.format(j))
    plt.xlabel('UMAP1')
    plt.ylabel('UMAP2')
    plt.legend()

We could be convinced that a simple classification/clustering algorithm on all the digits in the UMAP space would yield excellent results, illustrating the the 784-D input space is well represented in the 2-D UMAP space. The clusters are also relatively dense compared to PCA and t-SNE. The clusters are interpratable as well:

- 0 and 1 remain distant as we saw in PCA.
- 4, 7, and 9 remain close like in PCA.
- 2, 5, and 8 have the most overlap as we saw in t-SNE.

### 4b. Transform and visualize test set <a id="umap_test"></a>

To see if this representation is generalized, we transform the test set.

In [None]:
umap_mnist_test = umap.transform(x_test_scale_flat)

Now we plot the test set in the UMAP space with the training set as a transparent background.

In [None]:
plt.figure(figsize=[10,10])
plt.title('UMAP using MNIST (test): labeled')
for i in range (10):
    mask = y_train == i
    plt.scatter(umap_mnist_train[:, 0][mask], umap_mnist_train[:, 1][mask], 
                s=1, alpha=0.1, color='C{}'.format(i))
    mask2 = y_test == i
    plt.scatter(umap_mnist_test[:, 0][mask2], umap_mnist_test[:, 1][mask2], 
                s=5, alpha=0.5, label=i, color='C{}'.format(i))
plt.xlabel('UMAP1')
plt.ylabel('UMAP2')
plt.legend()

The test set nicely transforms to their associated digits in the UMAP space. To see the test set anomalies, we plot by digit.

In [None]:
for i in range (10):
    # Plot digit
    plt.figure(figsize=[10,10])
    plt.title('UMAP using MNIST (test): labeled')
    mask = y_test == i
    plt.scatter(umap_mnist_test[:, 0][mask], umap_mnist_test[:, 1][mask], 
                s=5, alpha=0.5, label=i, color='C{}'.format(i))
    # Plot background
    for j in range (10):
        mask2 = y_train == j
        plt.scatter(umap_mnist_train[:, 0][mask2], umap_mnist_train[:, 1][mask2], 
                    s=1, alpha=0.1, color='C{}'.format(j))
    plt.xlabel('UMAP1')
    plt.ylabel('UMAP2')
    plt.legend()

### 4c. Inverse function <a id="umap_inverse"></a>

As mentioned, UMAP has an inverse function going from the low dimensional space to the original input space. Here, we inverse transform the first 8 samples of the training set. **Note: the inverse transformation takes longer than PCA's inverse.**

In [None]:
subset = 8
umap_mnist_inverse = umap.inverse_transform(umap_mnist_train[:subset]).reshape(subset, x_length, x_length)

Now we plot the original samples with their inverses.

In [None]:
fig, axs = plt.subplots(2, subset, figsize=[20, 5])
for i in range (subset):
    axs[0, i].imshow(x_train_scale[i])
    axs[1, i].imshow(umap_mnist_inverse[i])
plt.tight_layout()

Even with just a 2-D representation, the inverses are pretty convincing and outperforms PCA in this regard for low dimensional representation. However, one of UMAPs weaknesses is the inverse function scales poorly with increased reduced space. As PCA's inverses increase in quality (as we saw from 2-D to 154-D), UMAPs inverses decrease in quality and becomes too computationally expensive at dimensions higher than 3-D, making the inverse function essentially useless after that.

In comparison with PCA, we evaluate the inverse function on a grid of points in the UMAP space to visualize how the inverses smoothly change. First, we make a grid of points as our samples.

In [None]:
x_min, x_max = -6, 16
y_min, y_max = -6, 16
n = 15
umap_manifold = []
for i in np.linspace(y_max,y_min,n):
    for j in np.linspace(x_min,x_max,n):
        umap_manifold.append([j, i])
umap_manifold = np.array(umap_manifold)

Next, we inverse transform the grid points to the high dimensional space.

In [None]:
umap_manifold_inverse = umap.inverse_transform(umap_manifold).reshape(n, n, x_length, x_length)

Now, we plot the manifold of the UMAP space.

In [None]:
# Create image of inverses
manifold = np.zeros((n*x_length, n*x_length))
for i in range (n):
    for j in range (n):
        manifold[i*x_length:(i+1)*x_length, j*x_length:(j+1)*x_length] = umap_manifold_inverse[i, j]

fig, axs = plt.subplots(1,2,figsize=[20,10])

# Plot UMAP space
axs[0].set_title('UMAP using MNIST (training): labeled')
for i in range (10):
    mask = y_train == i
    axs[0].scatter(umap_mnist_train[:, 0][mask], umap_mnist_train[:, 1][mask], 
                s=5, alpha=0.5, label=i, color='C{}'.format(i))
axs[0].set_xlabel('UMAP1')
axs[0].set_ylabel('UMAP2')
axs[0].set_xlim(x_min, x_max)
axs[0].set_ylim(y_min, y_max)
axs[0].legend()

# Plot inverses
axs[1].set_title('UMAP using MNIST: manifold')
axs[1].imshow(manifold, extent=[x_min,x_max,y_min,y_max])
axs[1].set_xlabel('UMAP1')
axs[1].set_ylabel('UMAP2')

The UMAP manifold is a lot more interesting than the PCA manifold. For starters, the blank spaces inverse transforms to an 8 so for some reason it takes that digit to be a "default" guess. Most clusters inverse to the labeled digit, e.g., points near the 0 cluster (~13.5, 7.5) inverse transforms to a 0. The 4, 7 and 9 clusters smoothly transition to each other, reinforcing the similarities the digits have.

### 4d. UMAP using 28 PCs exercises <a id="exercises"></a>

It is also recommended to use another dimensionality reduction technique before UMAP for the same reasons as t-SNE. We leave that as an exercise for the reader.

1. Define a UMAP model using the same hyperparameters as before: 
    - `umap_28 = UMAP(n_components=2, n_neighbors=15, min_dist=0.1, random_state=42)`
2. Fit, transform, and visualize using 28 PCs:
    - `umap_28_mnist_train = umap_28.fit_transform(pca_28_mnist_train)`
3. Transform and visualize test set by:
    - 3a. Transforming the test set to 28-D PCA space
        - `pca_28_mnist_test = pca_28.transform(x_test_scale_flat)`
    - 3b. Transforming from PCA space to UMAP space **(Hint: Reuse the plotting code and replace with the `umap_28` objects accordingly)**
        - `umap_28_mnist_test = umap_28.transform(pca_28_mnist_test)`
4. Visualize inverses by:
    - 4a. Inverse transforming 8 samples from UMAP space to PCA space
        - `umap_28_mnist_inverse = umap_28.inverse_transform(umap_28_mnist_train[:8])`
    - 4b. Inverse transforming those samples from PCA space to input space **(Hint: Reuse the plotting code from `pca_28` objects)**
        - `pca_28_mnist_inverse = pca_28.inverse_transform(umap_28_mnist_inverse).reshape(8,x_length,x_length)`
    - if successful, if the PCA reconstructions are comparable to the original samples, then UMAP well represents the PCA space
5. Analyze an anomaly as in [Section 3b.](#anomaly) **(Hint: Reuse the plotting code and replace with the `umap_28` objects accordingly)**

Training may take a similar amount of time, but the UMAP spaces should be pretty similar between using the original input pixels and principal components.

## 5. Conclusions <a id="con"></a>

Dimensionality reduction is an excellent tool for discovering structure in high dimensional data. Principal component analysis is a fast linear algorithm that projects data onto axes of most variance, preserving global structure. t-distributed stochastic neighbor embedding is a slower nonlinear algorithm that reduces the similarities of neighbors in a high dimensional space to a low dimensional space, preserving local structure. Uniform manifold approximation and projection is a relatively fast algorithm that projects graphs in a high dimensional space to a low dimensional space, preserving a mix of global and local structure. There are many dimensionality reduction techniques, but using these three in unison is a substantial start to any machine learning based exploratory data analysis.

**Thank you and congratulations for completing the notebook!**

In [None]:
# Time spent for each model
print ('Time spent fitting PCA: {:.4f} seconds'.format(t1_pca-t0_pca))
print ('Time spent fitting PCA_28: {:.4f} seconds'.format(t1_pca_28-t0_pca_28))
print ('Time spent fitting t-SNE: {:.4f} seconds'.format(t1_tsne-t0_tsne))
print ('Time spent fitting t-SNE_28: {:.4f} seconds'.format(t1_tsne_28-t0_tsne_28))
print ('Time spent fitting UMAP: {:.4f} seconds'.format(t1_umap-t0_umap))

## Appendix: TSVD <a id="append"></a>

There is a lot of debate of how to choose the optimal number of principal components in PCA. Ideally, we want to choose the number of components that maximizes signal and minimizes noise. Most data sets do not need to contain 95% of the variance since a lot of that variance is noise. The easy answer is everything is data specific, which is true, but there are various optimizations for choosing the number of components. One such method is [optimal truncated singular value decomposition](https://arxiv.org/pdf/1305.5870.pdf). By making some assumptions about the data's noise, one can calculate the number of components theoretically needed to remove a majority of the noise. Here are some videos that explain the method:

- [Singular Value Decomposition Overview](https://www.youtube.com/watch?v=gXbThCXjZFM)
- [SVD and Optimal Truncation](https://www.youtube.com/watch?v=9vJDjkx825k&t=3s)
- [Optimal TSVD (Python)](https://www.youtube.com/watch?v=epoHE2rex0g&t=326s)


## Additional Resources <a id="add"></a>

Machine learning is a dense and rapidly evolving field of study. Becoming an expert takes years of practice and patience, but hopefully this notebook brought you closer in that direction. Here are some of the author's favorite resources for learning about machine learning and data science:

- [Google Machine Learning Crash Course](https://developers.google.com/machine-learning/crash-course/ml-intro)
- [scikit-learn Python Library](https://scikit-learn.org/stable/index.html) (go-to for most ML algorithms besides neural networks)
- [StatQuest YouTube Channel](https://www.youtube.com/c/joshstarmer)
- [DeepLearningAI YouTube Channel](https://www.youtube.com/c/Deeplearningai/videos)
- [Towards Data Science](https://towardsdatascience.com/) (articles about data science and machine learning, some involving example blocks of code)
- Advance searching [arxiv](https://arxiv.org/search/advanced) (e.g. search term "machine learning" in Abstract for Subject astro-ph) to see what others are doing currently
- Google, YouTube, and Wikipedia in general

## About this Notebook <a id="about"></a>

**Author:** Fred Dauphin, DeepWFC3

**Updated on:** 2023-02-08

## Citations <a id="cite"></a>

If you use `numpy`, `matplotlib`, `sklearn`, or `umap` for published research, please cite the authors. Follow these links for more information about citing `numpy`, `matplotlib`, `sklearn`, and `umap`:

* [Citing `numpy`](https://numpy.org/doc/stable/license.html)
* [Citing `matplotlib`](https://matplotlib.org/stable/users/project/license.html#:~:text=Matplotlib%20only%20uses%20BSD%20compatible,are%20acceptable%20in%20matplotlib%20toolkits.)
* [Citing `sklearn`](https://scikit-learn.org/stable/about.html#citing-scikit-learn)
* [Citing `umap`](https://github.com/lmcinnes/umap/blob/master/LICENSE.txt)

***
[Top of Page](#title)