/
sc_analysis.Rmd
478 lines (337 loc) · 12.7 KB
/
sc_analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
---
title: "10X Single cell sequencing data analysis - Chinese"
output: rmarkdown::html_vignette
author:
- name: Yue You
- name: Luyi Tian
vignette: >
%\VignetteIndexEntry{10X Single cell sequencing data analysis - Chinese}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
## Description
In this workshop (presented in Mandarin), you will learn how to analyse single-cell RNA-sequencing count data produced by the Chromium 10x platform using R/Bioconductor. This will include reading the data into R, pre-processing data, normalization, feature selection, dimensionality reduction and downstream analysis, such as clustering and cell type annotation.
Expectation: You will learn how to generate common plots for analysis and visualisation of single cell gene expfression data, such as diagnostic plots to assess the data quality as well as dimensionality reduction techniques such as principal components analysis and t-distributed stochastic neighbourhood embedding (t-SNE). The material we will be covering on single-cell RNA-sequencing analysis is a subset of the work of Amerzquita et al. (2020) Nature Methods,17:137–145 available at https://osca.bioconductor.org.
Pre-requisites: The course is aimed at PhD students, Master’s students, and third & fourth year undergraduate students. Some basic R knowledge is assumed - this is not an introduction to R course. If you are not familiar with the R statistical programming language it is compulsory that you work through an introductory R course before you attend this workshop.
### Participation
After the lecture, participants are expected to follow along the hands-on session. we highly recommend participants bringing your own laptop.
### _R_ / _Bioconductor_ packages used
The following R/Bioconductor packages will be explicitly used:
* DropletUtils
* scran
* scater
* singleR
### Time outline
| Activity | Time |
|----------------------------------|------|
| Introduction to scRNA-seq | 30m |
| Analysis workflow | 1.5h |
| Hands on session | 30m |
| Q & A | 30m |
## 数据准备
读取10xPBMC4k数据。
```{r}
#--- loading ---#
unzip(system.file("extdata", "pbmc4k.zip", package = "zhejiang2020"),exdir = ".")
library(DropletUtils)
fname <- "pbmc4k"
sce.pbmc <- read10xCounts(fname, col.names=TRUE)
```
### 质量控制。
由于10x 技术会使得外部RNA,也就是不包含微滴的里的细胞被同时测序。在分析之前,我们需要确保得到的barcode数据都相对应活细胞。
这里,我们使用`emptyDrops`来筛选“真”细胞。
```{r}
#--- cell-detection ---#
set.seed(100)
e.out <- emptyDrops(counts(sce.pbmc))
sce.pbmc <- sce.pbmc[,which(e.out$FDR <= 0.001)]
```
对于得到的基因,根据gene ID 进行基因注释,得到基因名,并且知道其所在染色体(用于之后的细胞质量控制)。
```{r}
#--- gene-annotation ---#
library(scater)
rownames(sce.pbmc) <- uniquifyFeatureNames(
rowData(sce.pbmc)$ID, rowData(sce.pbmc)$Symbol)
library(EnsDb.Hsapiens.v86)
location <- mapIds(EnsDb.Hsapiens.v86, keys=rowData(sce.pbmc)$ID,
column="SEQNAME", keytype="GENEID")
```
```{r}
unfiltered <- sce.pbmc
```
这里,我们认为线粒体基因表达量高的barcode对应细胞质mRNA已经流出的破损细胞。并用这个指标来进行筛查过滤。
```{r}
#--- quality-control ---#
stats <- perCellQCMetrics(sce.pbmc, subsets=list(Mito=which(location=="MT")))
high.mito <- isOutlier(stats$subsets_Mito_percent, type="higher")
sce.pbmc <- sce.pbmc[,!high.mito]
```
```{r}
summary(high.mito)
```
```{r}
colData(unfiltered) <- cbind(colData(unfiltered), stats)
unfiltered$discard <- high.mito
gridExtra::grid.arrange(
plotColData(unfiltered, y="sum", colour_by="discard") +
scale_y_log10() + ggtitle("Total count"),
plotColData(unfiltered, y="detected", colour_by="discard") +
scale_y_log10() + ggtitle("Detected features"),
plotColData(unfiltered, y="subsets_Mito_percent",
colour_by="discard") + ggtitle("Mito percent"),
ncol=2
)
```
```{r}
plotColData(unfiltered, x="sum", y="subsets_Mito_percent",
colour_by="discard") + scale_x_log10()
```
### 数据标准化
由于在mRNA分子的捕获,逆转录,测序过程中存在一定的技术误差,相同细胞的计数深度可能不相同。我们需要在进行细胞之间的比对之前,先尽可能的排除采样效应。进行数据标准化,使得细胞之间存在可比性。这里我们使用`scran`(Lun et al,2016a),一种允许更多细胞异质性的方法。
```{r}
#--- normalization ---#
library(scran)
set.seed(1000)
clusters <- quickCluster(sce.pbmc)
sce.pbmc <- computeSumFactors(sce.pbmc, cluster=clusters)
sce.pbmc <- logNormCounts(sce.pbmc)
```
```{r}
summary(sizeFactors(sce.pbmc))
```
### 特征选择
为了减轻下游分析工具的计算负担,减少数据中的噪声,我们需要先进行特征选择。
```{r}
#--- variance-modelling ---#
set.seed(1001)
dec.pbmc <- modelGeneVarByPoisson(sce.pbmc)
top.pbmc <- getTopHVGs(dec.pbmc, prop=0.1)
```
```{r}
plot(dec.pbmc$mean, dec.pbmc$total, pch=16, cex=0.5,
xlab="Mean of log-expression", ylab="Variance of log-expression")
curfit <- metadata(dec.pbmc)
curve(curfit$trend(x), col='dodgerblue', add=TRUE, lwd=2)
```
特征选择后,单细胞表达矩阵的维数可以通过专门的降维算法进一步降低。使得数据可以直观可视化,并且将数据简化为基本组成部分。
```{r}
#--- dimensionality-reduction ---#
set.seed(10000)
sce.pbmc <- denoisePCA(sce.pbmc, subset.row=top.pbmc, technical=dec.pbmc)
set.seed(100000)
sce.pbmc <- runTSNE(sce.pbmc, dimred="PCA")
set.seed(1000000)
sce.pbmc <- runUMAP(sce.pbmc, dimred="PCA")
```
```{r}
ncol(reducedDim(sce.pbmc, "PCA"))
```
## 聚类分析
### 使用`louvain`算法进行聚类
`louvain`算法是一个基于图的聚类算法,我们首先构建Shared Nearest Neighbor(SNN)图,输入为使用主成分分析(PCA)降维后得到的矩阵。
在得到SNN图`g`之后,我们使用它作为`louvain`算法的输入。
```{r}
g <- buildSNNGraph(sce.pbmc, k=10, use.dimred = 'PCA')
clust <- igraph::cluster_louvain(g)$membership
sce.pbmc$cluster <- factor(clust)
table(clust)
```
使用tSNE对聚类进行可视化。
```{r}
plotTSNE(sce.pbmc, colour_by="cluster",text_by="cluster")
```
使用UMAP对聚类进行可视化。
```{r}
plotUMAP(sce.pbmc, colour_by="cluster",text_by="cluster")
```
### 寻找聚类特异表达基因
在获得了聚类之后,我们想要知道每个聚类都哪些高表达的基因,这些基因往往是能区分不同细胞类型的marker gene,也可以帮助我们了解不同聚类的生物学功能。
```{r}
library(RColorBrewer)
```
这里,我们通过比对某一cluster和其他所有cluster的表达来找到marker gene, 并且只挑选显示上调基因。
```{r}
markers.pbmc <- findMarkers(sce.pbmc, sce.pbmc$cluster,
pval.type="all", direction="up")
```
```{r}
markers.pbmc[[1]][order(markers.pbmc[[1]]$FDR),]
```
查看marker gene。
```{r}
# NOTE: this is not efficient for large iteration.
marker_genes = c()
for (i in 1:length(markers.pbmc)){
tmp = markers.pbmc[[i]][order(markers.pbmc[[i]]$FDR),]
marker_genes = c(marker_genes, rownames(tmp)[1:5])
}
marker_genes = unique(marker_genes)
marker_genes
```
查看特定基因在不同cluster的表达量。
```{r}
plotExpression(sce.pbmc, x="cluster",features="MS4A1")
```
```{r}
plotExpression(sce.pbmc, x="cluster",features="GZMK")
```
使用热图对marker gene进行可视化。
```{r}
getPalette = colorRampPalette(brewer.pal(9, "Set1"))
anno_df = as.data.frame(colData(sce.pbmc))
anno_df = anno_df[,"cluster", drop=FALSE]
col_cluster = getPalette(nlevels(anno_df$cluster))
names(col_cluster) = levels(anno_df$cluster)
annotation_colors = list(cluster=col_cluster)
tmp_expr = logcounts(sce.pbmc)[marker_genes,]
tmp_expr = t(scale(t(tmp_expr)))
tmp_expr[tmp_expr<(-2.5)]=-2.5
tmp_expr[tmp_expr>2.5]=2.5
colnames(tmp_expr) = colnames(sce.pbmc)
pheatmap::pheatmap(tmp_expr[,order(sce.pbmc$cluster)],
cluster_cols = FALSE,
cluster_rows = FALSE,
annotation_col = anno_df,
annotation_colors=annotation_colors,
show_colnames = FALSE,
fontsize_row=6)
```
## 使用`singleR`进行细胞类型注释
```{r}
library(SingleR)
```
我们选择 BlueprintEncodeData() 来作为参考数据。(Martens and Stunnenberg 2013; The ENCODE Project Consortium 2012)
```{r}
ref <- BlueprintEncodeData()
pred <- SingleR(test=sce.pbmc, ref=ref, labels=ref$label.main)
table(pred$labels)
```
```{r}
plotScoreHeatmap(pred)
```
利用每个cluster和对每个细胞计算得到的可能细胞类型画heatmap。
```{r}
tab <- table(Assigned=pred$pruned.labels, Cluster=sce.pbmc$cluster)
# Adding a pseudo-count of 10 to avoid strong color jumps with just 1 cell.
library(pheatmap)
pheatmap(log2(tab+10), color=colorRampPalette(c("white", "blue"))(101))
```
# 数据整合
这里我们使用`MNN`算法整合多个数据并消除批次效应(batch effects)
首先对两个PBMC单细胞数据集进行预处理
```{r}
#--- loading ---#
library(TENxPBMCData)
pbmc4k <- TENxPBMCData('pbmc4k')
#--- quality-control ---#
is.mito <- grep("MT", rowData(pbmc4k)$Symbol_TENx)
library(scater)
stats <- perCellQCMetrics(pbmc4k, subsets=list(Mito=is.mito))
high.mito <- isOutlier(stats$subsets_Mito_percent, type="higher")
pbmc4k <- pbmc4k[,!high.mito]
#--- normalization ---#
pbmc4k <- logNormCounts(pbmc4k)
#--- variance-modelling ---#
library(scran)
dec4k <- modelGeneVar(pbmc4k)
chosen.hvgs <- getTopHVGs(dec4k, prop=0.1)
#--- dimensionality-reduction ---#
set.seed(10000)
pbmc4k <- runPCA(pbmc4k, subset_row=chosen.hvgs, ncomponents=25,
BSPARAM=BiocSingular::RandomParam())
set.seed(100000)
pbmc4k <- runTSNE(pbmc4k, dimred="PCA")
set.seed(1000000)
pbmc4k <- runUMAP(pbmc4k, dimred="PCA")
#--- clustering ---#
g <- buildSNNGraph(pbmc4k, k=10, use.dimred = 'PCA')
clust <- igraph::cluster_walktrap(g)$membership
pbmc4k$cluster <- factor(clust)
```
```{r}
#--- loading ---#
library(TENxPBMCData)
pbmc3k <- TENxPBMCData('pbmc3k')
#--- quality-control ---#
is.mito <- grep("MT", rowData(pbmc3k)$Symbol_TENx)
library(scater)
stats <- perCellQCMetrics(pbmc3k, subsets=list(Mito=is.mito))
high.mito <- isOutlier(stats$subsets_Mito_percent, type="higher")
pbmc3k <- pbmc3k[,!high.mito]
#--- normalization ---#
pbmc3k <- logNormCounts(pbmc3k)
#--- variance-modelling ---#
library(scran)
dec3k <- modelGeneVar(pbmc3k)
chosen.hvgs <- getTopHVGs(dec3k, prop=0.1)
#--- dimensionality-reduction ---#
set.seed(10000)
pbmc3k <- runPCA(pbmc3k, subset_row=chosen.hvgs, ncomponents=25,
BSPARAM=BiocSingular::RandomParam())
set.seed(100000)
pbmc3k <- runTSNE(pbmc3k, dimred="PCA")
set.seed(1000000)
pbmc3k <- runUMAP(pbmc3k, dimred="PCA")
#--- clustering ---#
g <- buildSNNGraph(pbmc3k, k=10, use.dimred = 'PCA')
clust <- igraph::cluster_louvain(g)$membership
pbmc3k$cluster <- factor(clust)
```
由于两个数据集所表达的基因不尽相同,我们取两个数据集表达基因的交集并且过滤掉不同的基因,以保证两个数据集的每一行都是一样的基因。
```{r}
universe <- intersect(rownames(pbmc3k), rownames(pbmc4k))
length(universe)
```
```{r}
# Subsetting the SingleCellExperiment object.
pbmc3k <- pbmc3k[universe,]
pbmc4k <- pbmc4k[universe,]
# Also subsetting the variance modelling results, for convenience.
dec3k <- dec3k[universe,]
dec4k <- dec4k[universe,]
```
```{r}
combined.dec <- combineVar(dec3k, dec4k)
chosen.hvgs <- combined.dec$bio > 0
sum(chosen.hvgs)
```
```{r}
library(batchelor)
rescaled <- multiBatchNorm(pbmc3k, pbmc4k)
pbmc3k <- rescaled[[1]]
pbmc4k <- rescaled[[2]]
```
```{r}
# Synchronizing the metadata for cbind()ing.
rowData(pbmc3k) <- rowData(pbmc4k)
colData(pbmc3k) = colData(pbmc3k)[,colnames(colData(pbmc4k))]
pbmc3k$batch <- "3k"
pbmc4k$batch <- "4k"
uncorrected <- cbind(pbmc3k,pbmc4k)
# Using RandomParam() as it is more efficient for file-backed matrices.
library(scater)
uncorrected <- runPCA(uncorrected, subset_row=chosen.hvgs,
BSPARAM=BiocSingular::RandomParam())
```
```{r}
uncorrected <- runTSNE(uncorrected, dimred="PCA")
plotTSNE(uncorrected, colour_by="batch")
```
```{r}
# Using randomized SVD here, as this is faster than
# irlba for file-backed matrices.
mnn.out <- fastMNN(pbmc3k, pbmc4k, d=50, k=20, subset.row=chosen.hvgs,
BSPARAM=BiocSingular::RandomParam(deferred=TRUE))
mnn.out
```
```{r}
library(scater)
mnn.out <- runTSNE(mnn.out, dimred="corrected")
mnn.out$batch <- factor(mnn.out$batch)
plotTSNE(mnn.out, colour_by="batch")
```
```{r}
ggplot()+
geom_point(alpha=0.5,size=0.5,aes(x=reducedDim(mnn.out,"TSNE")[,1],y=reducedDim(mnn.out,"TSNE")[,2],col=mnn.out$batch))+
theme_bw()
```