-
Notifications
You must be signed in to change notification settings - Fork 0
/
pca_vs_tsne.Rmd
88 lines (69 loc) · 2.65 KB
/
pca_vs_tsne.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
---
title: "PCA vs. t-SNE and UMAP: an illustration"
author: Peter Carbonetto
output: workflowr::wflow_html
---
Here we contrast use of a simple linear dimensionality reduction
technique, PCA, with nonlinear dimensionality reduction methods
*t*-SNE and UMAP.
```{r knitr-opts, include=FALSE}
knitr::opts_chunk$set(comment = "#",collapse = TRUE,results = "hold",
fig.align = "center",dpi = 120)
```
Load the packages used in the analysis below.
To begin, draw a random subset of 2,000 cells from the B, CD14+ and
CD34+ clusters identified above. (The main reason for taking a random
subset is that we don't want to wait a long time for *t*-SNE and UMAP
to complete.)
```{r subset-fit}
set.seed(5)
rows <- which(with(samples,
cluster == "B" |
cluster == "CD14+" |
cluster == "CD34+"))
rows <- sort(sample(rows,2000))
fit2 <- select_loadings(fit,loadings = rows)
x <- samples$cluster[rows,drop = TRUE]
```
Next, run PCA on the topic proportions for this random subset of 2,000
samples.
```{r pca}
p8 <- pca_plot(fit2,fill = x) + labs(fill = "cluster")
```
Run *t*-SNE on the topic proportions.
```{r tsne}
tsne <- Rtsne(fit2$L,dims = 2,pca = FALSE,normalize = FALSE,perplexity = 100,
theta = 0.1,max_iter = 1000,eta = 200,verbose = FALSE)
tsne$x <- tsne$Y
colnames(tsne$x) <- c("tsne1","tsne2")
p9 <- pca_plot(fit2,out.pca = tsne,fill = x) + labs(fill = "cluster")
```
Then run UMAP on the topic proportions.
```{r umap}
out.umap <- umap(fit2$L,n_neighbors = 30,metric = "euclidean",n_epochs = 1000,
min_dist = 0.1,scale = "none",learning_rate = 1,
verbose = FALSE)
out.umap <- list(x = out.umap)
colnames(out.umap$x) <- c("umap1","umap2")
p10 <- pca_plot(fit2,out.pca = out.umap,fill = x) + labs(fill = "cluster")
```
Here are the PCA, t-SNE and UMAP 2-d embeddings, side-by-side:
```{r pca-vs-tsne, fig.width=8, fig.height=2}
plot_grid(p8,p9,p10,nrow = 1)
```
By the projection of the samples onto the first two PCs,
the B-cells cluster is distinct from the others, whereas the CD14+ and
CD34+ cells do not separate as well.
By contrast, this detail is not captured in the *t*-SNE and UMAP
embeddings. This illustrates the tendency of *t*-SNE and UMAP to
accentuate clusters in the data at the risk of distorting or obscuring
finer scale substructure.
```{r pca-vs-tsne-for-paper, echo=FALSE}
ggsave("pca_vs_tsne.eps",plot_grid(p6,p7,p8,nrow = 1),height = 2,width = 8)
```
Note that the first 2 PCs should be sufficient for capturing the full
structure in the topic proportions as they explain >96% of the
variance:
```{r prcomp-summary}
summary(prcomp(fit2$L))
```