analysis/pca_vs_tsne.Rmd

---
title: "PCA vs. t-SNE and UMAP: an illustration"
author: Peter Carbonetto
output: workflowr::wflow_html
---

Here we contrast use of a simple linear dimensionality reduction
technique, PCA, with nonlinear dimensionality reduction methods
*t*-SNE and UMAP.

```{r knitr-opts, include=FALSE}
knitr::opts_chunk$set(comment = "#",collapse = TRUE,results = "hold",
                      fig.align = "center",dpi = 120)
```

Load the packages used in the analysis below.

To begin, draw a random subset of 2,000 cells from the B, CD14+ and
CD34+ clusters identified above. (The main reason for taking a random
subset is that we don't want to wait a long time for *t*-SNE and UMAP
to complete.)

```{r subset-fit}
set.seed(5)
rows <- which(with(samples,
                   cluster == "B" |
                   cluster == "CD14+" |
                   cluster == "CD34+"))
rows <- sort(sample(rows,2000))
fit2 <- select_loadings(fit,loadings = rows)
x    <- samples$cluster[rows,drop = TRUE]
```

Next, run PCA on the topic proportions for this random subset of 2,000
samples.

```{r pca}
p8 <- pca_plot(fit2,fill = x) + labs(fill = "cluster")
```

Run *t*-SNE on the topic proportions.

```{r tsne}
tsne <- Rtsne(fit2$L,dims = 2,pca = FALSE,normalize = FALSE,perplexity = 100,
              theta = 0.1,max_iter = 1000,eta = 200,verbose = FALSE)
tsne$x <- tsne$Y
colnames(tsne$x) <- c("tsne1","tsne2")
p9 <- pca_plot(fit2,out.pca = tsne,fill = x) + labs(fill = "cluster")
```

Then run UMAP on the topic proportions.

```{r umap}
out.umap <- umap(fit2$L,n_neighbors = 30,metric = "euclidean",n_epochs = 1000,
                 min_dist = 0.1,scale = "none",learning_rate = 1,
                 verbose = FALSE)
out.umap <- list(x = out.umap)
colnames(out.umap$x) <- c("umap1","umap2")
p10 <- pca_plot(fit2,out.pca = out.umap,fill = x) + labs(fill = "cluster")
```

Here are the PCA, t-SNE and UMAP 2-d embeddings, side-by-side:

```{r pca-vs-tsne, fig.width=8, fig.height=2}
plot_grid(p8,p9,p10,nrow = 1)
```

By the projection of the samples onto the first two PCs, 
the B-cells cluster is distinct from the others, whereas the CD14+ and
CD34+ cells do not separate as well.

By contrast, this detail is not captured in the *t*-SNE and UMAP
embeddings. This illustrates the tendency of *t*-SNE and UMAP to
accentuate clusters in the data at the risk of distorting or obscuring
finer scale substructure.

```{r pca-vs-tsne-for-paper, echo=FALSE}
ggsave("pca_vs_tsne.eps",plot_grid(p6,p7,p8,nrow = 1),height = 2,width = 8)
```

Note that the first 2 PCs should be sufficient for capturing the full
structure in the topic proportions as they explain >96% of the
variance:

```{r prcomp-summary}
summary(prcomp(fit2$L))
```