/
QC_Plots.Rmd
376 lines (282 loc) · 18.7 KB
/
QC_Plots.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
---
title: "Plotting #2: QC Plots"
date: 'Compiled: `r format(Sys.Date(), "%B %d, %Y")`'
output: rmarkdown::html_vignette
theme: united
df_print: kable
vignette: >
%\VignetteIndexEntry{Plotting #2: QC Plots}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
***
<style>
p.caption {
font-size: 0.9em;
}
</style>
```{r setup, include=FALSE}
all_times <- list() # store the time for each chunk
knitr::knit_hooks$set(time_it = local({
now <- NULL
function(before, options) {
if (before) {
now <<- Sys.time()
} else {
res <- difftime(Sys.time(), now, units = "secs")
all_times[[options$label]] <<- res
}
}
}))
knitr::opts_chunk$set(
tidy = TRUE,
tidy.opts = list(width.cutoff = 95),
message = FALSE,
warning = FALSE,
time_it = TRUE
)
```
# QC Metrics & Plots
One of the first steps in all scRNA-seq analyses is performing a number of QC checks and plots so that data can be appropriately filtered. scCustomize contains a number of functions that can be used to quickly and easily generate some of the most relevant QC plots.
For this tutorial, I will be utilizing HCA bone marrow cell data from the SeuratData package.
```{r init}
library(ggplot2)
library(dplyr)
library(magrittr)
library(patchwork)
library(Seurat)
library(scCustomize)
library(qs)
# Load Example Dataset
hca_bm <- hcabm40k.SeuratData::hcabm40k
# Add pseudo group variable just for this vignette
hca_bm@meta.data$group[hca_bm@meta.data$orig.ident == "MantonBM1" | hca_bm@meta.data$orig.ident == "MantonBM2" | hca_bm@meta.data$orig.ident == "MantonBM3" | hca_bm@meta.data$orig.ident == "MantonBM4"] <- "Group 1"
hca_bm@meta.data$group[hca_bm@meta.data$orig.ident == "MantonBM5" | hca_bm@meta.data$orig.ident == "MantonBM6" | hca_bm@meta.data$orig.ident == "MantonBM7" | hca_bm@meta.data$orig.ident == "MantonBM8"] <- "Group 2"
```
```{r include=FALSE}
hca_bm <- UpdateSeuratObject(hca_bm)
```
```{r include=FALSE}
accepted_names <- Add_Mito_Ribo(object = hca_bm, list_species_names = TRUE)
```
## Adding QC Metrics
This is the starting point of nearly every single cell analysis and scCustomize contains a number of helper functions to make the process fast and easy.
## Add Mitochondrial and Ribosomal Gene Percentages
scCustomize contains easy wrapper function to automatically add both Mitochondrial and Ribosomal count percentages to meta.data slot. If you are using mouse, human, rat, zebrafish, drosophila, marmoset, or macaque data all you need to do is specify the `species` parameter.
```{r eval=FALSE}
# These defaults can be run just by providing accepted species name
hca_bm <- Add_Mito_Ribo(object = hca_bm, species = "Human")
```
*NOTE: This function works seemlessly for both Seurat and LIGER objects.*
To view list of accepted values for default species names simply set `list_species_names = TRUE`.
```{r echo=FALSE}
accepted_names %>%
kableExtra::kbl() %>%
kableExtra::kable_styling(bootstrap_options = c("bordered", "condensed", "responsive", "striped"))
```
However custom prefixes can be used for non-human/mouse/marmoset species with different annotations. Simply specify `species = other` and supply feature lists or regex patterns for your species of interest.
*NOTE: If desired please submit issue on GitHub for additional default species. Please include regex pattern or list of genes for both mitochondrial and ribosomal genes and I will add additional built-in defaults to the function.*
```{r eval=FALSE}
# Using gene name patterns
hca_bm <- Add_Mito_Ribo(object = hca_bm, species = "other", mito_pattern = "regexp_pattern", ribo_pattern = "regexp_pattern")
# Using feature name lists
mito_gene_list <- c("gene1", "gene2", "etc")
ribo_gene_list <- c("gene1", "gene2", "etc")
hca_bm <- Add_Mito_Ribo(object = hca_bm, species = "other", mito_features = mito_gene_list, ribo_features = ribo_gene_list)
# Using combination of gene lists and gene name patterns
hca_bm <- Add_Mito_Ribo(object = hca_bm, species = "Human", mito_features = mito_gene_list, ribo_pattern = "regexp_pattern")
```
#### Use of Ensembl IDs
scCustomize contains built in list of ensembl IDs that correspond to mitochondrial and ribosomal genes for all default species. If your object using ensembl IDs as features names then simply add `ensembl_ids` parameter.
```{r eval=FALSE}
# Using gene name patterns
hca_bm <- Add_Mito_Ribo(object = hca_bm, species = "Human", ensembl_ids = TRUE)
```
## Add Cell Complexity/Novelty QC Metrics
In addition to metrics like number of features and UMIs it can often be helpful to analyze the complexity of expression within a single cell. scCustomize provides functions to add two of these metrics to meta data.
### Cell Complexity (log10(nFeature) / log10(nCount))
scCustomize contains easy shortcut function to add a measure of cell complexity/novelty that can sometimes be useful to filter low quality cells. The metric is calculated by calculating the result of log10(nFeature) / log10(nCount).
```{r eval = FALSE}
# These defaults can be run just by providing accepted species name
hca_bm <- Add_Cell_Complexity(object = hca_bm)
```
*NOTE: This function works seemlessly for both Seurat and LIGER objects.*
### Add Top Percent Expression QC Metric
Additionally, (or alternatively), scCustomize contains another metric of complexity which is the top percent expression. The user supplies an integer value for `num_top_genes` (default is 50) which species the number of genes and the function returns percentage of counts occupied by top XX genes in each cell.
```{r eval = FALSE}
# These defaults can be run just by providing accepted species name
hca_bm <- Add_Top_Gene_Pct_Seurat(seurat_object = hca_bm, num_top_genes = 50)
```
## Add QC Metrics from Pathway Gene Lists
In addition to those standard QC metrics it can be helpful when using network based QC analysis to add the percent of expression of genes related to common pathways. This function and the network-based analysis is further extension of the analysis/QC from our recent publication: Gazestani & Kamath et al., 2023 ([*Cell*](https://doi.org/10.1016/j.cell.2023.08.005)).
In scCustomize the percent of gene expression from the following gene lists can be added as part of the `Add_Cell_QC_Metrics` (see next section for more details):
* Immediate Early Genes (for human and mouse only)
- Can be used in part to examine potential impact of dissociation or post-mortem signatures ([*Marsh et al., 2022*](https://doi.org/10.1038/s41593-022-01022-8)) or to identify acutely perturbed populations (gene list from [*Wu et al., 2017*](https://doi.org/10.1016/j.neuron.2017.09.026))
* Oxidative Phosphorylation, Apoptosis, & DNA Repair (all default species except Marmoset)
- Species specific gene lists from [MSigDB Hallmark Gene Sets](https://www.gsea-msigdb.org/gsea/msigdb/human/genesets.jsp?collection=H)
## Add All Cell QC Metrics with Single Function
To simplify the process of adding cell QC metrics scCustomize contains a wrapper function which can be customized to add all or some of the available QC metrics. The default parameters of the function `Add_Cell_QC_Metrics` will add:
* Mitochondrial and Ribosomal Percentages (default and custom species).
* Cell Complexity (log10(nFeature) / log10(nCount).
* Top XX Gene Percentage.
* Percentage of counts for IEG (human and mouse only).
* OXPHOS, APOP, and DNA Repair pathways (supported species only).
* Cell Cycle Scoring (Human only).
```{r}
hca_bm <- Add_Cell_QC_Metrics(seurat_object = hca_bm, species = "human")
```
### Customizing metrics
By changing the function parameters you can add a subset of available QC metrics. For instance:
```{r eval=FALSE}
hca_bm <- Add_Cell_QC_Metrics(seurat_object = hca_bm, species = "human", add_mito_ribo = TRUE, add_complexity = FALSE, add_top_pct = TRUE, add_IEG = TRUE, add_MSigDB = FALSE, add_cell_cycle = FALSE)
```
## Plotting QC Metrics
scCustomize has a number of quick QC plotting options for ease of use.
*NOTE: Most scCustomize plotting functions contain `...` parameter to allow user to supply any of the parameters for the original Seurat function that is being used under the hood.*
## `VlnPlot`-Based QC Plots
scCustomize contains a number of shortcut/wrapper functions for QC plotting, which are wrappers around `Seurat::VlnPlot()`/`VlnPlot_scCustom`.
* `QC_Plots_Genes()` Plots genes/features per cell/nucleus.
* `QC_Plots_UMIs()` Plots UMIs per cell/nucleus.
* `QC_Plots_Mito()` Plots mito% (named "percent_mito") per cell/nucleus.
* `QC_Plots_Complexity()` Plots cell complexity metric (log10GenesPerUMI) per cell/nucleus.
* `QC_Plots_Feature()` Plots "feature" per cell/nucleus. Using parameter `feature` to allow plotting of any applicable named feature in object\@meta.data slot.
* `QC_Plots_Combined_Vln()` Returns patchwork plot of `QC_Plots_Genes()`, `QC_Plots_UMIs()`, & `QC_Plots_Mito()`.
scCustomize functions have the added benefit of:
* Feature to plot set by default (except for `QC_Plots_Feature`).
* Added high/low cutoff parameters to allow for easy visualization of potential cutoff thresholds.
```{r eval=FALSE}
# All functions contain
p1 <- QC_Plots_Genes(seurat_object = hca_bm, low_cutoff = 600, high_cutoff = 5500)
p2 <- QC_Plots_UMIs(seurat_object = hca_bm, low_cutoff = 1200, high_cutoff = 45000)
p3 <- QC_Plots_Mito(seurat_object = hca_bm, high_cutoff = 20)
p4 <- QC_Plots_Complexity(seurat_object = hca_bm, high_cutoff = 0.8)
```
```{r echo=FALSE}
# All functions contain
p1 <- QC_Plots_Genes(seurat_object = hca_bm, low_cutoff = 600, high_cutoff = 5500, pt.size = 0.1)
p2 <- QC_Plots_UMIs(seurat_object = hca_bm, low_cutoff = 1200, high_cutoff = 45000, pt.size = 0.1)
p3 <- QC_Plots_Mito(seurat_object = hca_bm, high_cutoff = 20, pt.size = 0.1)
p4 <- QC_Plots_Complexity(seurat_object = hca_bm, high_cutoff = 0.8, pt.size = 0.1)
```
```{r, fig.height=7, fig.width=13}
wrap_plots(p1, p2, p3, p4, ncol = 4)
```
### Additional parameters
In addition to being able to supply Seurat parameters with `...` these plots like many others in scCustomize contain other additional parameters to customize plot output without need for post-plot ggplot2 modifications
* `plot_title`: Change plot title
* `x_axis_label`/`y_axis_label`: Change axis labels.
* `x_lab_rotate`: Should x-axis label be rotated 45 degrees?
* `y_axis_log`: Should y-axis in linear or log10 scale.
* `plot_median` & `median_size`: Plot a line at the median of each x-axis identity.
* `plot_boxplot`: Plot boxplot on top of the violin.
```{r, fig.height=5, fig.width=13, fig.align='center', fig.cap="*Setting `y_axis_log` can be very helpful for initial plots where outliers skew the visualization of the majority of the data without excluding data by setting y-axis limit.*"}
p1 <- QC_Plots_UMIs(seurat_object = hca_bm, low_cutoff = 1200, high_cutoff = 45000, pt.size = 0.1)
p2 <- QC_Plots_UMIs(seurat_object = hca_bm, low_cutoff = 1200, high_cutoff = 45000, pt.size = 0.1, y_axis_log = TRUE)
wrap_plots(p1, p2, ncol = 2)
```
```{r, fig.height=5, fig.width=13, fig.align='center', fig.cap="*Plotting median values by setting the `plot_median = TRUE` parameter.*"}
p1 <- QC_Plots_UMIs(seurat_object = hca_bm, low_cutoff = 1200, high_cutoff = 45000, pt.size = 0, plot_median = TRUE)
p2 <- QC_Plots_UMIs(seurat_object = hca_bm, low_cutoff = 1200, high_cutoff = 45000, pt.size = 0, y_axis_log = TRUE, plot_median = TRUE)
wrap_plots(p1, p2, ncol = 2)
```
```{r, fig.height=5, fig.width=13, fig.align='center', fig.cap="*Plotting boxplot by setting the `plot_boxplot = TRUE` parameter.*"}
p1 <- QC_Plots_UMIs(seurat_object = hca_bm, low_cutoff = 1200, high_cutoff = 45000, pt.size = 0, plot_boxplot = TRUE)
p2 <- QC_Plots_UMIs(seurat_object = hca_bm, low_cutoff = 1200, high_cutoff = 45000, pt.size = 0, y_axis_log = TRUE, plot_boxplot = TRUE)
wrap_plots(p1, p2, ncol = 2)
```
### Combined Plotting Function
As a shortcut you can return single patchwork plot of the 3 main QC Plots (Genes, UMIs, %Mito) by using single function, `QC_Plots_Combined_Vln()`.
```{r, fig.height=7, fig.width=13}
QC_Plots_Combined_Vln(seurat_object = hca_bm, feature_cutoffs = c(600, 5500), UMI_cutoffs = c(1200, 45000), mito_cutoffs = 20, pt.size = 0.1)
```
## `FeatureScatter`-Based QC Plots
scCustomize contains 3 functions which wrap `Seurat::FeatureScatter()` with added visualization of potential cutoff thresholds and some additional functionality:
* `QC_Plot_UMIvsGene()` Plots genes vs UMIs per cell/nucleus
* `QC_Plot_GenevsFeature()` Plots Genes vs. "feature" per cell/nucleus. Using parameter `feature1` to allow plotting of any applicable named feature in object\@meta.data slot.
* `QC_Plot_UMIvsFeature()` Plots UMIs vs. "feature" per cell/nucleus. Using parameter `feature1` to allow plotting of any applicable named feature in object\@meta.data slot.
### New/Modified functionality
* Better default color palettes
* `shuffle = TRUE` by default to prevent hiding of datasets
* Ability to set & visualize potential cutoff thresholds (similar to VlnPlot based QC Plots above)
* Report potential post filtering correlation in addition to whole dataset correlation when using `QC_Plot_UMIvsGene` (based on values provided to high and low cutoff parameters)
```{r eval=FALSE}
# All functions contain
QC_Plot_UMIvsGene(seurat_object = hca_bm, low_cutoff_gene = 600, high_cutoff_gene = 5500, low_cutoff_UMI = 500, high_cutoff_UMI = 50000)
QC_Plot_GenevsFeature(seurat_object = hca_bm, feature1 = "percent_mito", low_cutoff_gene = 600, high_cutoff_gene = 5500, high_cutoff_feature = 20)
```
```{r echo=FALSE, fig.height=7, fig.width=13, fig.align='center'}
# All functions contain
p1 <- QC_Plot_UMIvsGene(seurat_object = hca_bm, low_cutoff_gene = 600, high_cutoff_gene = 5500, low_cutoff_UMI = 1200, high_cutoff_UMI = 45000, x_axis_label = "UMIs per Cell")
p2 <- QC_Plot_GenevsFeature(seurat_object = hca_bm, feature1 = "percent_mito", low_cutoff_gene = 600, high_cutoff_gene = 5500, high_cutoff_feature = 20)
wrap_plots(p1, p2, ncol = 2)
```
### Color data by continuous meta data variable
`QC_Plot_UMIvsGene` contains the ability to color points by continuous meta data variables.
This can be used to plot % of mito reads in addition to UMI vs. Gene comparisons
```{r eval=FALSE}
QC_Plot_UMIvsGene(seurat_object = hca_bm, meta_gradient_name = "percent_mito", low_cutoff_gene = 600, high_cutoff_gene = 5500, high_cutoff_UMI = 45000)
QC_Plot_UMIvsGene(seurat_object = hca_bm, meta_gradient_name = "percent_mito", low_cutoff_gene = 600, high_cutoff_gene = 5500, high_cutoff_UMI = 45000, meta_gradient_low_cutoff = 20)
```
```{r echo=FALSE, fig.height=7, fig.width=13, fig.cap="*`QC_Plot_UMIvsGene()` when using `meta_gradient_name` outputs plot colored by meta data variable (left) to view only points above potential cutoff `meta_gradient_low_cutoff` can be specified to alter the plotting (right).*", fig.align='center'}
p1 <- QC_Plot_UMIvsGene(seurat_object = hca_bm, meta_gradient_name = "percent_mito", low_cutoff_gene = 600, high_cutoff_gene = 5500, high_cutoff_UMI = 45000)
p2 <- QC_Plot_UMIvsGene(seurat_object = hca_bm, meta_gradient_name = "percent_mito", low_cutoff_gene = 600, high_cutoff_gene = 5500, high_cutoff_UMI = 45000, meta_gradient_low_cutoff = 20)
wrap_plots(p1, p2, ncol = 2)
```
### Combination Plots
If you are interested in viewing `QC_Plot_UMIvsGene` both by discrete grouping variable and by continuous variable without writing function twice you can use `combination = TRUE` and plot output will contain both plots.
```{r, fig.height=7, fig.width=13, fig.cap="*`QC_Plot_UMIvsGene()` when using `combination = TRUE` will output both the Gene x UMI by active identity and with meta data gradient coloring.*", fig.align='center'}
QC_Plot_UMIvsGene(seurat_object = hca_bm, meta_gradient_name = "percent_mito", low_cutoff_gene = 600, high_cutoff_gene = 5500, high_cutoff_UMI = 45000, meta_gradient_low_cutoff = 20, combination = TRUE)
```
## Histogram-Based QC Plots
Finally, scCustomize contains a function to plot QC metrics as histogram instead of violin for visualization of the distribution of feature.
```{r, fig.height=7, fig.width=7, fig.align='center'}
QC_Histogram(seurat_object = hca_bm, features = "percent_mito", low_cutoff = 15)
```
```{r, fig.height=7, fig.width=13, fig.cap="*`QC_Histogram` also supports splitting plots by meta.data variables*", fig.align='center'}
QC_Histogram(seurat_object = hca_bm, features = "percent_mito", low_cutoff = 15, split.by = "group")
```
## Analyze Median QC Values per Sample/Library
scCustomize also contains a few helpful functions for returning and plotting the median values for these metrics on per sample/library basis.
### Calculate Median Values & Return data.frame
scCustomize contains function `Median_Stats` to quickly calculate the medians for basic QC stats (Genes/, UMIs/, %Mito/Cell, etc) and return a data.frame.
```{r}
median_stats <- Median_Stats(seurat_object = hca_bm, group_by_var = "orig.ident")
```
```{r echo=FALSE}
median_stats %>%
kableExtra::kbl() %>%
kableExtra::kable_styling(bootstrap_options = c("bordered", "condensed", "responsive", "striped"))
```
The `Median_Stats` function has some column names stored by default but will also calculate medians for additional meta.data columns using the optional `median_var` parameter
```{r eval=FALSE}
median_stats <- Median_Stats(seurat_object = hca_bm, group_by_var = "orig.ident", median_var = "meta_data_column_name")
```
### Plotting Median Values
scCustomize also contains a few functions to plot some of these median value calculations, which can be used on their own without need to return data.frame first.
* `Plot_Median_Genes()`
* `Plot_Median_UMIs()`
* `Plot_Median_Mito()`
* `Plot_Median_Other()`
- *Used to plot any other numeric variable present in object meta.data slot.*
```{r eval=FALSE}
Plot_Median_Genes(seurat_object = hca_bm, group_by = "group")
Plot_Median_UMIs(seurat_object = hca_bm, group_by = "group")
Plot_Median_Mito(seurat_object = hca_bm, group_by = "group")
Plot_Median_Other(seurat_object = hca_bm, median_var = "percent_ribo", group_by = "group")
```
```{r echo=FALSE, fig.align='center', fig.height=10, fig.width=13}
p1 <- Plot_Median_Genes(seurat_object = hca_bm, group_by = "group")
p2 <- Plot_Median_UMIs(seurat_object = hca_bm, group_by = "group")
p3 <- Plot_Median_Mito(seurat_object = hca_bm, group_by = "group")
p4 <- Plot_Median_Other(seurat_object = hca_bm, median_var = "percent_ribo", group_by = "group")
wrap_plots(p1, p2, p3, p4, ncol = 2)
```
### Plot Number of Cells/Nuclei per Sample
scCustomize also contains plotting function to plot the number of cells or nuclei per sample.
Since the HCA Bone Marrow dataset has exactly the same number of cells per sample we will use the microglia object from the Analysis Plots vignette.
```{r include=FALSE}
marsh_mouse_micro <- qread(file = "assets/marsh_2020_micro.qs")
```
```{r echo=FALSE, fig.align='center', fig.height=6, fig.width=6}
Plot_Cells_per_Sample(seurat_object = marsh_mouse_micro, group_by = "Transcription_Method")
```