# Case Study: Visualizing MNIST with t-SNE

In this case study, we will once again take a look at the [MNIST dataset](http://yann.lecun.com/exdb/mnist/). We will use [t-distributed Stochastic Neighbor Embedding](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding) (t-SNE) to visualize this dataset. Let's get started.

## Setup

The setup is exactly the same as in PCA and MDS.

In [1]:
import mnist
import altair as alt
import pandas as pd
import numpy as np

alt.data_transformers.disable_max_rows()

training_set = mnist.train_images()
training_labels = mnist.train_labels()

M = training_set.reshape((60000, 28*28), order="C").astype(float)

## t-SNE computation

Similar to MDS, t-SNE is an example of [manifold learning algorithms](https://scikit-learn.org/stable/modules/manifold.html). While MDS tries to preserve all pairwise distances between data points, t-SNE only tries to ensure nearby data points stay nearby in the embedding. Thus, conceptually t-SNE allows for more flexibility in optimizing the embedding location of the data points.

Once again, we will use [scikit-learn's implementation](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html#sklearn.manifold.TSNE) of t-SNE algorithm.

In [2]:
import sklearn
from sklearn.manifold import TSNE

M_scaled = M / 255
tsne = TSNE(2)
r = tsne.fit_transform(M_scaled[:10000])

Because "nearby"-ness is measured by the distance between two data points, we avoid using [standardization](https://scikit-learn.org/stable/modules/preprocessing.html#standardization-or-mean-removal-and-variance-scaling) as it will change the distance nonlinearly.  Another important parameter of t-SNE algorithm is the number of nearby points to take into consideration during optimization. The parameter is often called "perplexity", and it is dataset dependent. In this case study, we will use the default perplexity, which is 30.  Feel free to play with this parameter and see how it affects the result.

We choose to only embed the first 10000 data points due to limitations in our computing hardware. If you have a powerful machine, feel free to change this.

## Visualization with t-SNE

Let's now visualize the 2D embedding of the dataset computed with t-SNE.

In [3]:
df = pd.DataFrame({"x": r[:,0], "y": r[:,1], "label":training_labels[:10000]})
alt.Chart(df).mark_point().encode(x="x:Q", y="y:Q", color="label:N")

In the above visualization, data points with different labels form clear and separable clusters. This is a big improvement over the results from PCA or MDS. Let us look at the facet view as well.

In [None]:
alt.Chart(df).mark_point().encode(x="x:Q", y="y:Q", color="label:N")\
    .facet("label:N", columns=2)

In the facet view, it is clear that most data points with a given label form a nice cluster. There are still outliers. Tuning parameters such as the perplexity may reduce the number of outliers, but it is out of the scope of this course.

Similar to MDS, it is also possible to combine PCA with t-SNE, where we first project the dataset using the first N principal components and then use t-SNE to compute a lower dimensional embedding. We will leave this as an optional exercise for you to try in your own time.

## Summary

In this case study, we visualized the MNIST dataset using t-SNE. Similar to MDS, t-SNE seeks a lower dimensional embedding of high dimensional data. Unlike MDS, t-SNE only tries to keep nearby data points stay nearby in the embedding. This allows t-SNE to produce better embedding of the MNIST dataset with clear visible clusters for data points of different labels.