/
single_cell_rnaseq_practical.Rmd
231 lines (182 loc) · 7.79 KB
/
single_cell_rnaseq_practical.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
---
title: "Analysis of single-cell RNA-seq data using a topic model, Part 2: practical implementation"
author: Peter Carbonetto
date: "`r Sys.Date()`"
output:
html_document:
toc: yes
toc_float: no
highlight: textmate
theme: readable
vignette: >
%\VignetteIndexEntry{Analysis of single-cell RNA-seq data using a topic model, Part 2: practical implementation}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
This vignette continues from [Part 1][vignette-part-1], where we
introduced the basic steps of a topic modeling analysis for
single-cell data. Here we give more detailed guidance and discuss
complications that may arise.
```{r knitr-opts, include=FALSE}
knitr::opts_chunk$set(comment = "#",collapse = TRUE,results = "hold",
fig.align = "center",dpi = 120)
```
We begin our analysis by loading the packages. Then we set the seed so
that the results can be reproduced.
```{r load-pkgs, message=FALSE, warning=FALSE}
library(Matrix)
library(fastTopics)
library(ggplot2)
library(cowplot)
set.seed(1)
```
Now we load the data set and the $K = 6$ pre-fitted topic model.
```{r load-data}
data(pbmc_facs)
counts <- pbmc_facs$counts
fit <- pbmc_facs$fit
```
Assessing convergence of model fitting
--------------------------------------
The topic model was fitted using the `fit_topic_model` function. This
function, as we mentioned, hides most of the complexities of model
fitting. Nevertheless, it is a good idea to check that the model
fitting has converged to a maximum-likelihood solution. This is
easily done with 'fastTopics':
```{r plot-loglik, fig.height=2, fig.width=4}
plot_progress(fit,x = "iter",add.point.every = 10,colors = "black") +
theme_cowplot(font_size = 10)
```
Internally, `fit_topic_model` first fits a non-negative matrix
factorization (NMF) to the count data, then converts the fitted NMF to
a topic model, so the plot shows the improvement in the model fit over
time. Judging by this plot, the parameter estimates get close to a
maxium-likelihood solution after about 150 updates.
For larger or more complex data sets, more updates may be needed to
obtain a good fit. For guidance, see `help(fit_topic_model)` and
`help(fit_poisson_nmf)`.
Evaluating fit: single-cell likelihoods
---------------------------------------
The topic model can be used to calculate *a likelihood for each cell:*
```{r loglik-1}
loglik <- loglik_multinom_topic_model(counts,fit)
```
This can be used to assess how well the topic model "fits" each cell.
```{r loglik-2, fig.height=2, fig.width=4.5}
pdat <- data.frame(loglik)
ggplot(pdat,aes(loglik)) +
geom_histogram(bins = 64,color = "white",fill = "black",size = 0.25) +
labs(y = "number of cells") +
theme_cowplot(font_size = 10)
```
Most of the poorly fit cells are in the CD34+ subpopulation:
```{r loglik-3, fig.height=2, fig.width=5.5}
subpop_colors <- c("dodgerblue","forestgreen","darkmagenta","skyblue","gold")
pdat <- data.frame(loglik = loglik,subpop = pbmc_facs$samples$subpop)
ggplot(pdat,aes(x = loglik,fill = subpop)) +
geom_histogram(bins = 64,color = "white",size = 0.25) +
scale_fill_manual(values = subpop_colors) +
labs(y = "number of cells") +
theme_cowplot(font_size = 10)
```
Visualizing the topics without cell labels
------------------------------------------
In [Part 1][single_cell_rnaseq_practical], we made use of additional
information about the cells to help us interpret the topics. Even
without the cell labels, the Structure plot can still be effective:
```{r structure-plot-without-labels, fig.width=6.5, fig.height=1.5, results="hide", message=FALSE}
topic_colors <- c("skyblue","forestgreen","darkmagenta","dodgerblue",
"gold","darkorange")
structure_plot(fit,colors = topic_colors)
```
From this Structure plot, the topics clearly distinguish the B cells
(dark blue), CD14+ monocytes (green) and CD34+ cells (purple). The NK
cells (light blue) and T cells (cells high proportions of yellow and
orange) are harder to distinguish, and this is reflected in sharing of
topics (mostly topic 6) among NK and T cells.
Of course, without the cell labels, we cannot know that the topics
correspond to these cell types without further downstream
analyses---for this, we performed a GoM DE analysis (see
[Part 1][vignette-part-1]).
Identifying clusters from the topic proportions
-----------------------------------------------
In more complex data sets, some structure may not show up well without
additional fine-tuning of the plot.
One simple strategy that often works well is to first identify
clusters in the topic proportions matrix, $L$. Here we use $k$-means
to identify clusters, but other methods could be used.
```{r clustering-1}
set.seed(1)
pca <- prcomp(fit$L)$x
clusters <- kmeans(pca,centers = 6,iter.max = 100)$cluster
summary(factor(clusters))
```
We chose 6 clusters, but to be clear the best number of clusters may
not be the same as the number of topics.
Now we create another Structure plot in which the cells are grouped
according to the clusters:
```{r structure-plot-by-cluster-1, fig.width=7.5, fig.height=1.75, results="hide", message=FALSE}
structure_plot(fit,topics = 1:6,colors = topic_colors,gap = 25,
grouping = clusters) +
theme(axis.text.x = element_blank())
```
k-means somewhat arbitrarily split the T cells (cells with high
proportions of topics 5 and 6) into clusters 1 and 3. Therefore, we
merge these two clusters:
```{r clustering-2}
clusters[clusters == 3] <- 1
clusters <- factor(clusters)
```
We have found that visual inspection of the clusters followed by
manual refinement is often be needed to get the "right" clustering.
Now we can label the clusters by these cell types in the plot.
(Noting that this labeling is usually not possible until downstream
analyses have been performed.)
```{r structure-plot-by-cluster-2, fig.width=7.5, fig.height=1.75, results="hide", message=FALSE}
levels(clusters) <- c("T","CD14+","B","CD34+","NK")
structure_plot(fit,topics = 1:6,colors = topic_colors,gap = 25,
grouping = clusters)
```
It is also sometimes helpful to visualize the structure in PCA plots
which show the projection of the cells onto PCs of the topic
proportions:
```{r pca-plot-1, fig.width=3.5, fig.height=2.5, message=FALSE, warning=FALSE}
cluster_colors <- c("gold","forestgreen","dodgerblue","darkmagenta","skyblue")
pca_plot(fit,fill = clusters) +
scale_fill_manual(values = cluster_colors)
```
When there are many overlapping points, plotting the *density* can
sometimes show the clusters more clearly:
```{r pca-plot-2, fig.width=3.75, fig.height=2.5, message=FALSE}
pca_hexbin_plot(fit,bins = 24)
```
More on differential expression analysis
----------------------------------------
In [Part 1][vignette-part-1], we performed a DE analysis using the
topic model.
```{r diff-count-analysis}
de <- pbmc_facs$de
```
When the volcano plot shows many overlapping differentially expressed
genes, it can be helpful to explore the results interactively. The
function `volcano_plotly` creates an *interactive volcano plot* that can
be viewed in a Web browser. For example, here is an interactive
volcano plot for the T cells topic:
```{r volcano-plotly, warning=FALSE}
genes <- pbmc_facs$genes
p <- volcano_plotly(de,k = 6,file = "volcano_plot_t_cells.html",
labels = genes$symbol,ymax = 100)
```
This call creates a new file `volcano_plot_t_cells.html`. The
interactive volcano plot can also be viewed [here][volcano-plotly], or
by calling `print(p)` in R.
Session info
------------
This is the version of R and the packages that were used to generate
these results.
```{r session-info}
sessionInfo()
```
[vignette-part-1]: https://stephenslab.github.io/fastTopics/articles/single_cell_rnaseq_basic.html
[volcano-plotly]: https://stephenslab.github.io/fastTopics/articles/volcano_plot_t_cells.html
[zheng-2017]: https://doi.org/10.1038/ncomms14049