-
Notifications
You must be signed in to change notification settings - Fork 86
/
pbmc_vignette.Rmd
860 lines (680 loc) · 34.9 KB
/
pbmc_vignette.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
---
title: "Analyzing PBMC scATAC-seq"
output: html_document
date: 'Compiled: `r format(Sys.Date(), "%B %d, %Y")`'
---
```{r init, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_knit$set(root.dir = "../vignette_data/pbmc_vignette")
```
```{r, include=FALSE}
options(width = 600)
local({
hook_output <- knitr::knit_hooks$get('output')
knitr::knit_hooks$set(output = function(x, options) {
if (!is.null(options$max.height))options$attr.output <- c(
options$attr.output,
sprintf('style="max-height: %s;"', options$max.height)
)
hook_output(x, options)
})
})
```
For this tutorial, we will be analyzing a single-cell ATAC-seq dataset of human
peripheral blood mononuclear cells (PBMCs) provided by 10x Genomics. The
following files are used in this vignette, all available through the 10x
Genomics website:
* The [Raw data](https://cf.10xgenomics.com/samples/cell-atac/2.1.0/10k_pbmc_ATACv2_nextgem_Chromium_Controller/10k_pbmc_ATACv2_nextgem_Chromium_Controller_filtered_peak_bc_matrix.h5)
* The [Metadata](https://cf.10xgenomics.com/samples/cell-atac/2.1.0/10k_pbmc_ATACv2_nextgem_Chromium_Controller/10k_pbmc_ATACv2_nextgem_Chromium_Controller_singlecell.csv)
* The [fragments file](https://cf.10xgenomics.com/samples/cell-atac/2.1.0/10k_pbmc_ATACv2_nextgem_Chromium_Controller/10k_pbmc_ATACv2_nextgem_Chromium_Controller_fragments.tsv.gz)
* The fragments file [index](https://cf.10xgenomics.com/samples/cell-atac/2.1.0/10k_pbmc_ATACv2_nextgem_Chromium_Controller/10k_pbmc_ATACv2_nextgem_Chromium_Controller_fragments.tsv.gz.tbi)
<details>
<summary>**View data download code**</summary>
To download all the required files, you can run the following lines in a shell:
```{sh eval=FALSE}
wget https://cf.10xgenomics.com/samples/cell-atac/2.1.0/10k_pbmc_ATACv2_nextgem_Chromium_Controller/10k_pbmc_ATACv2_nextgem_Chromium_Controller_filtered_peak_bc_matrix.h5
wget https://cf.10xgenomics.com/samples/cell-atac/2.1.0/10k_pbmc_ATACv2_nextgem_Chromium_Controller/10k_pbmc_ATACv2_nextgem_Chromium_Controller_singlecell.csv
wget https://cf.10xgenomics.com/samples/cell-atac/2.1.0/10k_pbmc_ATACv2_nextgem_Chromium_Controller/10k_pbmc_ATACv2_nextgem_Chromium_Controller_fragments.tsv.gz
wget https://cf.10xgenomics.com/samples/cell-atac/2.1.0/10k_pbmc_ATACv2_nextgem_Chromium_Controller/10k_pbmc_ATACv2_nextgem_Chromium_Controller_fragments.tsv.gz.tbi
```
</details>
First load in Signac, Seurat, and some other packages we will be using for
analyzing human data.
```{r setup, message=FALSE}
library(Signac)
library(Seurat)
library(GenomicRanges)
library(ggplot2)
library(patchwork)
```
## Pre-processing workflow
When pre-processing chromatin data, Signac uses information from two related
input files, both of which can be created using CellRanger:
* **Peak/Cell matrix**. This is analogous to the gene expression count matrix used to analyze single-cell RNA-seq. However, instead of genes, each row of the matrix represents a region of the genome (a peak), that is predicted to represent a region of open chromatin. Each value in the matrix represents the number of Tn5 integration sites for each single barcode (i.e. a cell) that map within each peak. You can find more detail on the [10X Website](https://support.10xgenomics.com/single-cell-atac/software/pipelines/latest/output/matrices).
* **Fragment file**. This represents a full list of all unique fragments across all single cells. It is a substantially larger file, is slower to work with, and is stored on-disk (instead of in memory). However, the advantage of retaining this file is that it contains all fragments associated with each single cell, as opposed to only fragments that map to peaks. More information about the fragment file can be found on the 10x Genomics [website](https://support.10xgenomics.com/single-cell-atac/software/pipelines/latest/output/fragments) or on the [sinto website](https://timoast.github.io/sinto/basic_usage.html#create-scatac-seq-fragments-file).
We start by creating a Seurat object using the peak/cell matrix and cell
metadata generated by `cellranger-atac`, and store the path to the fragment
file on disk in the Seurat object:
```{r warning=FALSE, message=FALSE}
counts <- Read10X_h5(filename = "10k_pbmc_ATACv2_nextgem_Chromium_Controller_filtered_peak_bc_matrix.h5")
metadata <- read.csv(
file = "10k_pbmc_ATACv2_nextgem_Chromium_Controller_singlecell.csv",
header = TRUE,
row.names = 1
)
chrom_assay <- CreateChromatinAssay(
counts = counts,
sep = c(":", "-"),
fragments = "10k_pbmc_ATACv2_nextgem_Chromium_Controller_fragments.tsv.gz",
min.cells = 10,
min.features = 200
)
pbmc <- CreateSeuratObject(
counts = chrom_assay,
assay = "peaks",
meta.data = metadata
)
```
```{r}
pbmc
```
<details>
<summary>**What if I don't have an H5 file?**</summary>
If you do not have the `.h5` file, you can still run an analysis. You may have
data that is formatted as three files, a counts file (`.mtx`), a cell barcodes
file, and a peaks file. If this is the case you can load the data using the
`Matrix::readMM()` function:
```{r eval=FALSE}
counts <- Matrix::readMM("filtered_peak_bc_matrix/matrix.mtx")
barcodes <- readLines("filtered_peak_bc_matrix/barcodes.tsv")
peaks <- read.table("filtered_peak_bc_matrix/peaks.bed", sep="\t")
peaknames <- paste(peaks$V1, peaks$V2, peaks$V3, sep="-")
colnames(counts) <- barcodes
rownames(counts) <- peaknames
```
Alternatively, you might only have a fragment file. In this case you can create
a count matrix using the `FeatureMatrix()` function:
```{r eval=FALSE}
fragpath <- '10k_pbmc_ATACv2_nextgem_Chromium_Controller_fragments.tsv.gz'
# Define cells
# If you already have a list of cell barcodes to use you can skip this step
total_counts <- CountFragments(fragpath)
cutoff <- 1000 # Change this number depending on your dataset!
barcodes <- total_counts[total_counts$frequency_count > cutoff, ]$CB
# Create a fragment object
frags <- CreateFragmentObject(path = fragpath, cells = barcodes)
# First call peaks on the dataset
# If you already have a set of peaks you can skip this step
peaks <- CallPeaks(frags)
# Quantify fragments in each peak
counts <- FeatureMatrix(fragments = frags, features = peaks, cells = barcodes)
```
</details>
The ATAC-seq data is stored using a custom assay, the `ChromatinAssay`. This
enables some specialized functions for analysing genomic single-cell assays such
as scATAC-seq. By printing the assay we can see some of the additional
information that can be contained in the `ChromatinAssay`, including motif
information, gene annotations, and genome information.
```{r}
pbmc[['peaks']]
```
For example, we can call `granges` on a Seurat object with a `ChromatinAssay`
set as the active assay (or on a `ChromatinAssay`) to see the genomic ranges
associated with each feature in the object. See the
[object interaction vignette](data_structures.html) for more information
about the `ChromatinAssay` class.
```{r}
granges(pbmc)
```
We then remove the features that correspond to chromosome scaffolds e.g. (KI270713.1)
or other sequences instead of the (22+2) standard chromosomes.
```{r}
peaks.keep <- seqnames(granges(pbmc)) %in% standardChromosomes(granges(pbmc))
pbmc <- pbmc[as.vector(peaks.keep), ]
```
We can also add gene annotations to the `pbmc` object for the human genome.
This will allow downstream functions to pull the gene annotation information
directly from the object.
Multiple patches are released for each genome assembly. When dealing with mapped
data (such as the 10x Genomics files we will be using), it is advisable to use
the annotations from the same assembly patch that was used to perform the mapping.
From the [dataset summary](https://cf.10xgenomics.com/samples/cell-atac/2.1.0/10k_pbmc_ATACv2_nextgem_Chromium_Controller/10k_pbmc_ATACv2_nextgem_Chromium_Controller_web_summary.html),
we can see that the reference package 10x Genomics used to perform
the mapping was “GRCh38-2020-A”, which [corresponds to](https://www.10xgenomics.com/support/software/cell-ranger/downloads/cr-ref-build-steps#ref-2020-a) the Ensembl v98 patch release.
For more information on the various Ensembl releases, you can refer to [this site](https://www.ensembl.org/info/website/archives/assembly.html).
```{r warning=FALSE, message=FALSE}
library(AnnotationHub)
ah <- AnnotationHub()
# Search for the Ensembl 98 EnsDb for Homo sapiens on AnnotationHub
query(ah, "EnsDb.Hsapiens.v98")
```
```{r message=FALSE, warning=FALSE}
ensdb_v98 <- ah[["AH75011"]]
```
```{r annotations, message=FALSE, warning=FALSE, cache=FALSE}
# extract gene annotations from EnsDb
annotations <- GetGRangesFromEnsDb(ensdb = ensdb_v98)
# change to UCSC style since the data was mapped to hg38
seqlevels(annotations) <- paste0('chr', seqlevels(annotations))
genome(annotations) <- "hg38"
```
```{r}
# add the gene information to the object
Annotation(pbmc) <- annotations
```
## Computing QC Metrics
We can now compute some QC metrics for the scATAC-seq experiment. We currently
suggest the following metrics below to assess data quality. As with scRNA-seq,
the expected range of values for these parameters will vary depending on your
biological system, cell viability, and other factors.
* Nucleosome banding pattern: The histogram of DNA fragment sizes (determined from the paired-end sequencing reads) should exhibit a strong nucleosome banding pattern corresponding to the length of DNA wrapped around a single nucleosome. We calculate this per single cell, and quantify the approximate ratio of mononucleosomal to nucleosome-free fragments (stored as `nucleosome_signal`)
* Transcriptional start site (TSS) enrichment score: The [ENCODE project](https://www.encodeproject.org/) has defined an ATAC-seq targeting score based on the ratio of fragments centered at the TSS to fragments in TSS-flanking regions (see https://www.encodeproject.org/data-standards/terms/). Poor ATAC-seq experiments typically will have a low TSS enrichment score. We can compute this metric for each cell with the `TSSEnrichment()` function, and the results are stored in metadata under the column name `TSS.enrichment`.
* Total number of fragments in peaks: A measure of cellular sequencing depth / complexity. Cells with very few reads may need to be excluded due to low sequencing depth. Cells with extremely high levels may represent doublets, nuclei clumps, or other artefacts.
* Fraction of fragments in peaks: Represents the fraction of all fragments that fall within ATAC-seq peaks. Cells with low values (i.e. <15-20%) often represent low-quality cells or technical artifacts that should be removed. Note that this value can be sensitive to the set of peaks used.
* Ratio reads in genomic blacklist regions: The [ENCODE project](https://www.encodeproject.org/) has provided a list of [blacklist regions](https://github.com/Boyle-Lab/Blacklist), representing reads which are often associated with artefactual signal. Cells with a high proportion of reads mapping to these areas (compared to reads mapping to peaks) often represent technical artifacts and should be removed. ENCODE blacklist regions for human (hg19 and hg38), mouse (mm9 and mm10), Drosophila (dm3 and dm6), and C. elegans (ce10 and ce11) are included in the Signac package. The `FractionCountsInRegion()` function can be used to calculate the fraction of all counts within a given set of regions per cell. We can use this function and the blacklist regions to find the fraction of blacklist counts per cell.
Note that the last three metrics can be obtained from the output of CellRanger
(which is stored in the object metadata), but can also be calculated for non-10x
datasets using Signac (more information at the end of this document).
```{r message=FALSE, warning=FALSE}
# compute nucleosome signal score per cell
pbmc <- NucleosomeSignal(object = pbmc)
# compute TSS enrichment score per cell
pbmc <- TSSEnrichment(object = pbmc)
# add fraction of reads in peaks
pbmc$pct_reads_in_peaks <- pbmc$peak_region_fragments / pbmc$passed_filters * 100
# add blacklist ratio
pbmc$blacklist_ratio <- FractionCountsInRegion(
object = pbmc,
assay = 'peaks',
regions = blacklist_hg38_unified
)
```
The relationship between variables stored in the object metadata can
be visualized using the `DensityScatter()` function. This can also be used
to quickly find suitable cutoff values for different QC metrics by setting
`quantiles=TRUE`:
```{r}
DensityScatter(pbmc, x = 'nCount_peaks', y = 'TSS.enrichment', log_x = TRUE, quantiles = TRUE)
```
We can also look at the fragment length periodicity for all the cells, and group
by cells with high or low nucleosomal signal strength. You can see that cells
that are outliers for the mononucleosomal / nucleosome-free ratio (based on the
plots above) have different nucleosomal banding patterns. The remaining cells
exhibit a pattern that is typical for a successful ATAC-seq experiment.
```{r message=FALSE, warning=FALSE}
pbmc$nucleosome_group <- ifelse(pbmc$nucleosome_signal > 4, 'NS > 4', 'NS < 4')
FragmentHistogram(object = pbmc, group.by = 'nucleosome_group')
```
We can plot the distribution of each QC metric separately using a violin plot:
```{r message=FALSE, warning=FALSE, fig.width=18, fig.height=6}
VlnPlot(
object = pbmc,
features = c('nCount_peaks', 'TSS.enrichment', 'blacklist_ratio', 'nucleosome_signal', 'pct_reads_in_peaks'),
pt.size = 0.1,
ncol = 5
)
```
Finally we remove cells that are outliers for these QC metrics. The exact QC
thresholds used will need to be adjusted according to your dataset.
```{r}
pbmc <- subset(
x = pbmc,
subset = nCount_peaks > 9000 &
nCount_peaks < 100000 &
pct_reads_in_peaks > 40 &
blacklist_ratio < 0.01 &
nucleosome_signal < 4 &
TSS.enrichment > 4
)
pbmc
```
## Normalization and linear dimensional reduction
* Normalization: Signac performs term frequency-inverse document frequency (TF-IDF) normalization. This is a two-step normalization procedure, that both normalizes across cells to correct for differences in cellular sequencing depth, and across peaks to give higher values to more rare peaks.
* Feature selection: The low dynamic range of scATAC-seq data makes it challenging to perform variable feature selection, as we do for scRNA-seq. Instead, we can choose to use only the top _n_% of features (peaks) for dimensional reduction, or remove features present in less than _n_ cells with the `FindTopFeatures()` function. Here we will use all features, though we have seen very similar results when using only a subset of features (try setting min.cutoff to 'q75' to use the top 25% all peaks), with faster runtimes. Features used for dimensional reduction are automatically set as `VariableFeatures()` for the Seurat object by this function.
* Dimension reduction: We next run singular value decomposition (SVD) on the TD-IDF matrix, using the features (peaks) selected above. This returns a reduced dimension representation of the object (for users who are more familiar with scRNA-seq, you can think of this as analogous to the output of PCA).
The combined steps of TF-IDF followed by SVD are known as latent semantic
indexing (LSI), and were first introduced for the analysis of scATAC-seq data by
[Cusanovich et al. 2015](https://www.science.org/doi/10.1126/science.aab1601).
```{r message=FALSE, warning=FALSE}
pbmc <- RunTFIDF(pbmc)
pbmc <- FindTopFeatures(pbmc, min.cutoff = 'q0')
pbmc <- RunSVD(pbmc)
```
The first LSI component often captures sequencing depth (technical variation)
rather than biological variation. If this is the case, the component should be
removed from downstream analysis. We can assess the correlation between each LSI
component and sequencing depth using the `DepthCor()` function:
```{r}
DepthCor(pbmc)
```
Here we see there is a very strong correlation between the first LSI component
and the total number of counts for the cell. We will perform downstream steps
without this component as we don't want to group cells together based on their
total sequencing depth, but rather by their patterns of accessibility at
cell-type-specific peaks.
## Non-linear dimension reduction and clustering
Now that the cells are embedded in a low-dimensional space we can use methods
commonly applied for the analysis of scRNA-seq data to perform graph-based
clustering and non-linear dimension reduction for visualization. The functions
`RunUMAP()`, `FindNeighbors()`, and `FindClusters()` all come from the Seurat
package.
```{r message=FALSE, warning=FALSE}
pbmc <- RunUMAP(object = pbmc, reduction = 'lsi', dims = 2:30)
pbmc <- FindNeighbors(object = pbmc, reduction = 'lsi', dims = 2:30)
pbmc <- FindClusters(object = pbmc, verbose = FALSE, algorithm = 3)
DimPlot(object = pbmc, label = TRUE) + NoLegend()
```
## Create a gene activity matrix
The UMAP visualization reveals the presence of multiple cell groups in human
blood. If you are familiar with
[scRNA-seq analyses of PBMC](https://satijalab.org/seurat/articles/pbmc3k_tutorial.html),
you may recognize the presence of certain myeloid and lymphoid populations
in the scATAC-seq data. However, annotating and interpreting clusters is more
challenging in scATAC-seq data as much less is known about the functional roles
of noncoding genomic regions than is known about protein coding regions (genes).
We can try to quantify the activity of each gene in the genome by
assessing the chromatin accessibility associated with the gene, and create a
new gene activity assay derived from the scATAC-seq data. Here we will use a
simple approach of summing the fragments intersecting the gene body and promoter
region (we also recommend exploring the
[Cicero](https://cole-trapnell-lab.github.io/cicero-release/) tool, which can
accomplish a similar goal, and we provide a vignette showing how to run Cicero
within a Signac workflow [here](cicero.html)).
To create a gene activity matrix, we extract gene coordinates and extend them
to include the 2 kb upstream region (as promoter accessibility is often
correlated with gene expression). We then count the number of fragments for each
cell that map to each of these regions, using the using the `FeatureMatrix()`
function. These steps are automatically performed by the `GeneActivity()`
function:
```{r message=FALSE, warning=FALSE}
gene.activities <- GeneActivity(pbmc)
```
```{r message=FALSE, warning=FALSE}
# add the gene activity matrix to the Seurat object as a new assay and normalize it
pbmc[['RNA']] <- CreateAssayObject(counts = gene.activities)
pbmc <- NormalizeData(
object = pbmc,
assay = 'RNA',
normalization.method = 'LogNormalize',
scale.factor = median(pbmc$nCount_RNA)
)
```
Now we can visualize the activities of canonical marker genes to help interpret
our ATAC-seq clusters. Note that the activities will be much noisier than
scRNA-seq measurements. This is because they represent measurements from sparse
chromatin data, and because they assume a general correspondence between gene
body/promoter accessibility and gene expression which may not always be the
case. Nonetheless, we can begin to discern populations of monocytes, B, T, and
NK cells based on these gene activity profiles. However, further subdivision of
these cell types is challenging based on supervised analysis alone.
```{r fig.width=12, fig.height=10}
DefaultAssay(pbmc) <- 'RNA'
FeaturePlot(
object = pbmc,
features = c('MS4A1', 'CD3D', 'LEF1', 'NKG7', 'TREM1', 'LYZ'),
pt.size = 0.1,
max.cutoff = 'q95',
ncol = 3
)
```
## Integrating with scRNA-seq data
To help interpret the scATAC-seq data, we can classify cells based on an
scRNA-seq experiment from the same biological system (human PBMC). We utilize
methods for cross-modality integration and label transfer, described
[here](https://doi.org/10.1016/j.cell.2019.05.031), with a more in-depth
tutorial [here](https://satijalab.org/seurat/v3.0/atacseq_integration_vignette.html).
We aim to identify shared correlation patterns in the gene activity matrix and
scRNA-seq dataset to identify matched biological states across the two
modalities. This procedure returns a classification score for each cell for each
scRNA-seq-defined cluster label.
Here we load a pre-processed scRNA-seq dataset for human PBMCs, also provided by
10x Genomics. You can download the raw data for this experiment from the 10x
[website](https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/pbmc_10k_v3),
and view the code used to construct this object on
[GitHub](https://github.com/satijalab/Integration2019/blob/master/preprocessing_scripts/pbmc_10k_v3.R).
Alternatively, you can download the pre-processed Seurat object
[here](https://signac-objects.s3.amazonaws.com/pbmc_10k_v3.rds).
```{r warning=FALSE, message=FALSE}
# Load the pre-processed scRNA-seq data for PBMCs
pbmc_rna <- readRDS("pbmc_10k_v3.rds")
pbmc_rna <- UpdateSeuratObject(pbmc_rna)
```
```{r}
transfer.anchors <- FindTransferAnchors(
reference = pbmc_rna,
query = pbmc,
reduction = 'cca'
)
predicted.labels <- TransferData(
anchorset = transfer.anchors,
refdata = pbmc_rna$celltype,
weight.reduction = pbmc[['lsi']],
dims = 2:30
)
pbmc <- AddMetaData(object = pbmc, metadata = predicted.labels)
```
```{r fig.width=12}
plot1 <- DimPlot(
object = pbmc_rna,
group.by = 'celltype',
label = TRUE,
repel = TRUE) + NoLegend() + ggtitle('scRNA-seq')
plot2 <- DimPlot(
object = pbmc,
group.by = 'predicted.id',
label = TRUE,
repel = TRUE) + NoLegend() + ggtitle('scATAC-seq')
plot1 + plot2
```
You can see that the scRNA-based classifications are consistent with
the UMAP visualization that was computed using the scATAC-seq data. Notice, however,
that a small population of cells are predicted to be platelets in the scATAC-seq
dataset. This is unexpected as platelets are not nucleated and should not be detected
by scATAC-seq. It is possible that the cells predicted to be platelets could
instead be the platelet-precursors megakaryocytes, which largely reside in the
bone marrow but are rarely found in the peripheral blood of healthy patients,
such as the individual these PBMCs were drawn from. Given the already extreme
rarity of megakaryocytes within normal bone marrow (< 0.1%), this scenario seems unlikely.
<details>
<summary>**A closer look at the "platelets"**</summary>
Plotting the prediction score for the cells assigned to each label reveals that
the “platelet” cells received relatively low scores (< 0.8), indicating a low confidence
in the assigned cell identity. In most cases, the next most likely cell identity
predicted for these cells was “CD4 naive”.
```{r fig.width=12}
VlnPlot(pbmc, 'prediction.score.max', group.by = 'predicted.id')
```
```{r max.height='300px'}
# Identify the metadata columns that start with "prediction.score."
metadata_attributes <- colnames(pbmc[[]])
prediction_score_attributes <- grep("^prediction.score.", metadata_attributes, value = TRUE)
prediction_score_attributes <- setdiff(prediction_score_attributes, "prediction.score.max")
# Extract the prediction score attributes for these cells
predicted_platelets <- which(pbmc$predicted.id == "Platelet")
platelet_scores <- pbmc[[]][predicted_platelets, prediction_score_attributes]
# Order the columns by their average values in descending order
ordered_columns <- names(sort(colMeans(platelet_scores, na.rm = TRUE), decreasing = TRUE))
ordered_platelet_scores_df <- platelet_scores[, ordered_columns]
print(ordered_platelet_scores_df)
```
</details>
As there are only a very small number of cells classified as “platelets” (< 20),
it is difficult to figure out their precise cellular identity. Larger datasets
would be required to confidently identify specific peaks for this population of
cells, and further analysis performed to correctly annotate them.
For downstream analysis we will thus remove the extremely rare cell states that
were predicted, retaining only cell annotations with >20 cells total.
```{r}
predicted_id_counts <- table(pbmc$predicted.id)
# Identify the predicted.id values that have more than 20 cells
major_predicted_ids <- names(predicted_id_counts[predicted_id_counts > 20])
pbmc <- pbmc[, pbmc$predicted.id %in% major_predicted_ids]
```
For downstream analyses, we can simply reassign the identities of each cell
from their UMAP cluster index to the per-cell predicted labels. It is also
possible to consider merging the cluster indexes and predicted labels.
```{r}
# change cell identities to the per-cell predicted labels
Idents(pbmc) <- pbmc$predicted.id
```
<details>
<summary>**To combine clustering and label transfer results**</summary>
Alternatively, in instances where we wish to combine our scATAC-seq clustering
and label transfer results, we can reassign the identities of all cells within
each UMAP cluster to the most common predicted label for that cluster.
```{r eval=FALSE}
# replace each cluster label with its most likely predicted label
for(i in levels(pbmc)) {
cells_to_reid <- WhichCells(pbmc, idents = i)
newid <- names(which.max(table(pbmc$predicted.id[cells_to_reid])))
Idents(pbmc, cells = cells_to_reid) <- newid
}
```
</details>
## Find differentially accessible peaks between cell types
To find differentially accessible regions between clusters of cells, we can
perform a differential accessibility (DA) test. A simple approach is to perform
a Wilcoxon rank sum test, and the [presto](https://github.com/immunogenomics/presto)
package has implemented an extremely fast Wilcoxon test that can be run on a
Seurat object.
<details>
<summary>**Install presto**</summary>
```{r eval=FALSE}
if (!requireNamespace("remotes", quietly = TRUE))
install.packages('remotes')
remotes::install_github('immunogenomics/presto')
```
</details>
Here we will focus on comparing Naive CD4 cells and CD14 monocytes, but any groups of
cells can be compared using these methods. We can also visualize these marker
peaks on a violin plot, feature plot, dot plot, heat map, or any
[visualization tool in Seurat](https://satijalab.org/seurat/articles/visualization_vignette.html).
```{r message=TRUE, warning=FALSE}
# change back to working with peaks instead of gene activities
DefaultAssay(pbmc) <- 'peaks'
# wilcox is the default option for test.use
da_peaks <- FindMarkers(
object = pbmc,
ident.1 = "CD4 Naive",
ident.2 = "CD14+ Monocytes",
test.use = 'wilcox',
min.pct = 0.1
)
head(da_peaks)
```
<details>
<summary>**Logistic regression with total fragment number as the latent variable**</summary>
Another approach for differential testing is to utilize logistic regression
for, as suggested by
[Ntranos et al. 2018](https://www.biorxiv.org/content/10.1101/258566v2)
for scRNA-seq data, and add the total number of fragments as a latent variable
to mitigate the effect of differential sequencing depth on the result.
```{r eval=FALSE}
# change back to working with peaks instead of gene activities
DefaultAssay(pbmc) <- 'peaks'
# Use logistic regression, set total fragment no. as latent var
lr_da_peaks <- FindMarkers(
object = pbmc,
ident.1 = "CD4 Naive",
ident.2 = "CD14+ Monocytes",
test.use = 'LR',
latent.vars = 'nCount_peaks',
min.pct = 0.1
)
head(lr_da_peaks)
```
</details>
```{r, cache=FALSE, fig.width=12}
plot1 <- VlnPlot(
object = pbmc,
features = rownames(da_peaks)[1],
pt.size = 0.1,
idents = c("CD4 Naive","CD14+ Monocytes")
)
plot2 <- FeaturePlot(
object = pbmc,
features = rownames(da_peaks)[1],
pt.size = 0.1
)
plot1 | plot2
```
Peak coordinates can be difficult to interpret alone. We can find the closest
gene to each of these peaks using the `ClosestFeature()` function.
```{r warning=FALSE, message=FALSE}
open_cd4naive <- rownames(da_peaks[da_peaks$avg_log2FC > 3, ])
open_cd14mono <- rownames(da_peaks[da_peaks$avg_log2FC < -3, ])
closest_genes_cd4naive <- ClosestFeature(pbmc, regions = open_cd4naive)
closest_genes_cd14mono <- ClosestFeature(pbmc, regions = open_cd14mono)
```
```{r}
head(closest_genes_cd4naive)
```
```{r}
head(closest_genes_cd14mono)
```
We could follow up with this result by doing gene ontology enrichment analysis
on the gene sets returned by `ClosestFeature()`,and there are many R packages
that can do this (see the[`GOstats`](https://bioconductor.org/packages/release/bioc/html/GOstats.html) or [`clusterProfiler`](https://bioconductor.org/packages/release/bioc/html/clusterProfiler.html)
packages for example).
<details>
<summary>**GO enrichment analysis with clusterProfiler**</summary>
```{r include=FALSE}
if (!requireNamespace("clusterProfiler", quietly = TRUE))
BiocManager::install("clusterProfiler")
if (!requireNamespace("org.Hs.eg.db", quietly = TRUE))
BiocManager::install("org.Hs.eg.db")
if (!requireNamespace("enrichplot", quietly = TRUE))
BiocManager::install("enrichplot")
```
```{r message=FALSE}
library(clusterProfiler)
library(org.Hs.eg.db)
library(enrichplot)
```
```{r fig.width=12, fig.height=10, message=FALSE, warning=FALSE}
cd4naive_ego <- enrichGO(gene = closest_genes_cd4naive$gene_id,
keyType = "ENSEMBL",
OrgDb = org.Hs.eg.db,
ont = "BP",
pAdjustMethod = "BH",
pvalueCutoff = 0.05,
qvalueCutoff = 0.05,
readable = TRUE)
barplot(cd4naive_ego,showCategory = 20)
```
```{r fig.width=12, fig.height=10, message=FALSE, warning=FALSE}
cd14mono_ego <- enrichGO(gene = closest_genes_cd14mono$gene_id,
keyType = "ENSEMBL",
OrgDb = org.Hs.eg.db,
ont = "BP",
pAdjustMethod = "BH",
pvalueCutoff = 0.05,
qvalueCutoff = 0.05,
readable = TRUE)
barplot(cd14mono_ego,showCategory = 20)
```
</details>
## Plotting genomic regions
We can plot the frequency of Tn5 integration across regions of the genome for
cells grouped by cluster, cell type, or any other metadata stored in the object
for any genomic region using the `CoveragePlot()` function.
These represent pseudo-bulk accessibility tracks, where signal from
all cells within a group have been averaged together to visualize the DNA
accessibility in a region (thanks to Andrew Hill for giving the inspiration for
this function in his excellent [blog post](http://andrewjohnhill.com/blog/2019/04/12/streamlining-scatac-seq-visualization-and-analysis/)). Alongside these accessibility tracks we can visualize other
important information including gene annotation, peak coordinates, and genomic
links (if they're stored in the object). See the
[visualization vignette](visualization.html) for more information.
For plotting purposes, it's nice to have related cell types grouped together. We
can automatically sort the plotting order according to similarities across the
annotated cell types by running the `SortIdents()` function:
```{r message=FALSE, warning=FALSE}
pbmc <- SortIdents(pbmc)
```
We can then visualize the DA peaks open in CD4 naive cells and CD14 monocytes,
near some key marker genes for these cell types, CD4 and LYZ respectively.
Here we'll highlight the DA peaks regions in grey.
```{r message=FALSE, warning=FALSE, out.width='90%', fig.height=6}
# find DA peaks overlapping gene of interest
regions_highlight <- subsetByOverlaps(StringToGRanges(open_cd4naive), LookupGeneCoords(pbmc, "CD4"))
CoveragePlot(
object = pbmc,
region = "CD4",
region.highlight = regions_highlight,
extend.upstream = 1000,
extend.downstream = 1000
)
```
```{r message=FALSE, warning=FALSE, out.width='90%', fig.height=6}
regions_highlight <- subsetByOverlaps(StringToGRanges(open_cd14mono), LookupGeneCoords(pbmc, "LYZ"))
CoveragePlot(
object = pbmc,
region = "LYZ",
region.highlight = regions_highlight,
extend.upstream = 1000,
extend.downstream = 5000
)
```
We can also create an interactive version of these plots using the
`CoverageBrowser()` function. Here is a recorded demonstration showing how we
can use `CoverageBrowser()` to browser the genome and adjust plotting parameters
interactively. The "Save plot" button in `CoverageBrowser()` will add the
current plot to a list of `ggplot` objects that is returned when the browser
session is ended by pressing the "Done" button, allowing interesting views
to be saved during an interactive session.
<iframe width="560" height="315" src="https://www.youtube.com/embed/S9b5rN32IC8" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
<details>
<summary>**Working with datasets that were not quantified using CellRanger**</summary>
The CellRanger software from 10x Genomics generates several useful QC metrics
per-cell, as well as a peak/cell matrix and an indexed fragments file. In the
above vignette, we utilize the CellRanger outputs, but provide alternative
functions in Signac for many of the same purposes here.
### Generating a peak/cell or bin/cell matrix
The `FeatureMatrix` function can be used to generate a count matrix containing
any set of genomic ranges in its rows. These regions could be a set of peaks, or
bins that span the entire genome.
```{r eval=FALSE}
# not run
# peak_ranges should be a set of genomic ranges spanning the set of peaks to be quantified per cell
peak_matrix <- FeatureMatrix(
fragments = Fragments(pbmc),
features = peak_ranges
)
```
For convenience, we also include a `GenomeBinMatrix()` function that will
generate a set of genomic ranges spanning the entire genome for you, and run
`FeatureMatrix()` internally to produce a genome bin/cell matrix.
```{r eval=FALSE}
# not run
bin_matrix <- GenomeBinMatrix(
fragments = Fragments(pbmc),
genome = seqlengths(pbmc),
binsize = 5000
)
```
### Counting fraction of reads in peaks
The function `FRiP()` will count the fraction of reads in peaks for each cell,
given a peak/cell assay and a bin/cell assay. Note that this can be run on a
subset of the genome, so that a bin/cell assay does not need to be computed for
the whole genome. This will return a Seurat object will metadata added
corresponding to the fraction of reads in peaks for each cell.
```{r eval=FALSE}
# not run
total_fragments <- CountFragments('10k_pbmc_ATACv2_nextgem_Chromium_Controller_fragments.tsv.gz')
rownames(total_fragments) <- total_fragments$CB
pbmc$fragments <- total_fragments[colnames(pbmc), "frequency_count"]
pbmc <- FRiP(
object = pbmc,
assay = 'peaks',
total.fragments = 'fragments'
)
```
### Counting fragments in genome blacklist regions
The ratio of reads in genomic blacklist regions, that are known to artifactually
accumulate reads in genome sequencing assays, can be diagnostic of low-quality
cells. We provide blacklist region coordinates for several genomes (hg19, hg38,
mm9, mm10, ce10, ce11, dm3, dm6) in the Signac package for convenience. These
regions were provided by the ENCODE consortium, and we encourage users to cite
their [paper](https://doi.org/10.1038/s41598-019-45839-z) if you use the regions
in your analysis. The `FractionCountsInRegion()` function can be used to
calculate the fraction of all counts within a given set of regions per cell.
We can use this function and the blacklist regions to find the fraction of
blacklist counts per cell.
```{r eval=FALSE}
# not run
pbmc$blacklist_fraction <- FractionCountsInRegion(
object = pbmc,
assay = 'peaks',
regions = blacklist_hg38_unified
)
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
saveRDS(object = pbmc, file = "pbmc.rds")
```
</details>
<details>
<summary>**Session Info**</summary>
```{r}
sessionInfo()
```
</details>