/
Introduction.Rmd
372 lines (289 loc) · 11.8 KB
/
Introduction.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
---
title: "CuratedAtlasQueryR"
output: "html_vignette"
params:
demo_metadata: true
vignette: >
%\VignetteIndexEntry{CuratedAtlasQueryR}
%\VignetteEncoding{UTF-8}
%\VignetteEngine{knitr::rmarkdown}
knit: >
(function(x, ...){
proj_root <- rprojroot::find_package_root_file() |> normalizePath()
rmarkdown::render(
x,
output_format = "html_document",
params = list(
demo_metadata = TRUE
)
)
rmarkdown::render(
x,
output_file = "README.md",
output_format = "github_document",
output_dir = proj_root,
knit_root_dir = proj_root,
params = list(
demo_metadata = FALSE
)
)
})
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
root_dir <- knitr::opts_knit$get("root.dir")
if (!is.null(root_dir)){
# This hack fixes the relative image paths.
# See https://github.com/rstudio/rmarkdown/issues/2473
knitr::opts_knit$set(
output.dir = root_dir
)
}
proj_root <- rprojroot::find_package_root_file() |> normalizePath()
# Utility function for figures to force them to have the correct path
find_figure <- function(names){
rprojroot::find_package_root_file() |>
file.path("man", "figures", names)
}
METADATA_URL = if (params$demo_metadata)
CuratedAtlasQueryR::SAMPLE_DATABASE_URL else
CuratedAtlasQueryR::get_database_url
```
<!-- badges: start -->
[![Lifecycle:maturing](https://img.shields.io/badge/lifecycle-maturing-blue.svg)](https://www.tidyverse.org/lifecycle/#maturing)
<!-- badges: end -->
`CuratedAtlasQuery` is a query interface that allow the programmatic exploration and retrieval of the harmonised, curated and reannotated CELLxGENE single-cell human cell atlas. Data can be retrieved at cell, sample, or dataset levels based on filtering criteria.
Harmonised data is stored in the ARDC Nectar Research Cloud, and most
`CuratedAtlasQuery` functions interact with Nectar via web requests, so a
network connection is required for most functionality.
```{r, echo=FALSE, out.height = c("139px"), out.width = "120x" }
find_figure("logo.png") |> knitr::include_graphics()
```
```{r, echo=FALSE, out.height = c("58px"), out.width = c("155x", "129px", "202px", "219px", "180px")}
c(
"svcf_logo.jpeg",
"czi_logo.png",
"bioconductor_logo.jpg",
"vca_logo.png",
"nectar_logo.png"
) |>
find_figure() |>
knitr::include_graphics()
```
# Query interface
## Installation
```{r, eval=FALSE}
devtools::install_github("stemangiola/CuratedAtlasQueryR")
```
## Load the package
```{r, message=FALSE, warning=FALSE}
library(CuratedAtlasQueryR)
```
## Load and explore the metadata
### Load the metadata
```{r, eval=FALSE}
# Note: in real applications you should use the default value of remote_url
metadata <- get_metadata(remote_url = METADATA_URL)
metadata
```
```{r, echo=FALSE}
# Note: a custom cache is used here ONLY for R CHECK compliance purposes. Users will NOT need to specify a custom cache
metadata <- get_metadata(remote_url = METADATA_URL, cache_directory = tempdir())
metadata
```
The `metadata` variable can then be re-used for all subsequent queries.
### Explore the tissue
```{r}
metadata |>
dplyr::distinct(tissue, file_id)
```
## Download single-cell RNA sequencing counts
### Query raw counts
```{r}
single_cell_counts =
metadata |>
dplyr::filter(
ethnicity == "African" &
stringr::str_like(assay, "%10x%") &
tissue == "lung parenchyma" &
stringr::str_like(cell_type, "%CD4%")
) |>
get_single_cell_experiment()
single_cell_counts
```
### Query counts scaled per million
This is helpful if just few genes are of interest, as they can be compared across samples.
```{r}
single_cell_counts =
metadata |>
dplyr::filter(
ethnicity == "African" &
stringr::str_like(assay, "%10x%") &
tissue == "lung parenchyma" &
stringr::str_like(cell_type, "%CD4%")
) |>
get_single_cell_experiment(assays = "cpm")
single_cell_counts
```
### Extract only a subset of genes
```{r}
single_cell_counts =
metadata |>
dplyr::filter(
ethnicity == "African" &
stringr::str_like(assay, "%10x%") &
tissue == "lung parenchyma" &
stringr::str_like(cell_type, "%CD4%")
) |>
get_single_cell_experiment(assays = "cpm", features = "PUM1")
single_cell_counts
```
### Extract the counts as a Seurat object
This convert the H5 SingleCellExperiment to Seurat so it might take long time and occupy a lot of memory depending on how many cells you are requesting.
```{r}
single_cell_counts_seurat =
metadata |>
dplyr::filter(
ethnicity == "African" &
stringr::str_like(assay, "%10x%") &
tissue == "lung parenchyma" &
stringr::str_like(cell_type, "%CD4%")
) |>
get_seurat()
single_cell_counts_seurat
```
## Save your `SingleCellExperiment`
The returned `SingleCellExperiment` can be saved with two modalities, as `.rds` or as `HDF5`.
### Saving as RDS (fast saving, slow reading)
Saving as `.rds` has the advantage of being fast, andd the `.rds` file occupies very little disk space as it only stores the links to the files in your cache.
However it has the disadvantage that for big `SingleCellExperiment` objects, which merge a lot of HDF5 from your `get_single_cell_experiment`, the display and manipulation is going to be slow.
In addition, an `.rds` saved in this way is not portable: you will not be able
to share it with other users.
```{r}
single_cell_counts |> saveRDS("single_cell_counts.rds")
```
### Saving as HDF5 (slow saving, fast reading)
Saving as `.hdf5` executes any computation on the `SingleCellExperiment` and writes it to disk as a monolithic `HDF5`.
Once this is done, operations on the `SingleCellExperiment` will be comparatively very fast.
The resulting `.hdf5` file will also be totally portable and sharable.
However this `.hdf5` has the disadvantage of being larger than the corresponding `.rds` as it includes a copy of the count information, and the saving process is going to be slow for large objects.
```{r}
single_cell_counts |> HDF5Array::saveHDF5SummarizedExperiment("single_cell_counts", replace = TRUE)
```
## Visualise gene transcription
We can gather all CD14 monocytes cells and plot the distribution of HLA-A across all tissues
```{r, results='hide'}
suppressPackageStartupMessages({
library(ggplot2)
})
# Plots with styling
counts <- metadata |>
# Filter and subset
dplyr::filter(cell_type_harmonised == "cd14 mono") |>
dplyr::filter(file_id_db != "c5a05f23f9784a3be3bfa651198a48eb") |>
# Get counts per million for HCA-A gene
get_single_cell_experiment(assays = "cpm", features = "HLA-A") |>
suppressMessages() |>
# Add feature to table
tidySingleCellExperiment::join_features("HLA-A", shape = "wide") |>
# Rank x axis
tibble::as_tibble()
# Plot by disease
counts |>
dplyr::with_groups(disease, ~ .x |> dplyr::mutate(median_count = median(`HLA.A`, rm.na=TRUE))) |>
# Plot
ggplot(aes(forcats::fct_reorder(disease, median_count,.desc = TRUE), `HLA.A`,color = file_id)) +
geom_jitter(shape=".") +
# Style
guides(color="none") +
scale_y_log10() +
theme_bw() +
theme(axis.text.x = element_text(angle = 60, vjust = 1, hjust = 1)) +
xlab("Disease") +
ggtitle("HLA-A in CD14 monocytes by disease")
```
```{r echo=FALSE}
find_figure("HLA_A_disease_plot.png") |> knitr::include_graphics()
```
```{r, results='hide'}
# Plot by tissue
counts |>
dplyr::with_groups(tissue_harmonised, ~ .x |> dplyr::mutate(median_count = median(`HLA.A`, rm.na=TRUE))) |>
# Plot
ggplot(aes(forcats::fct_reorder(tissue_harmonised, median_count,.desc = TRUE), `HLA.A`,color = file_id)) +
geom_jitter(shape=".") +
# Style
guides(color="none") +
scale_y_log10() +
theme_bw() +
theme(axis.text.x = element_text(angle = 60, vjust = 1, hjust = 1)) +
xlab("Tissue") +
ggtitle("HLA-A in CD14 monocytes by tissue") +
theme(legend.position = "none")
```
```{r echo=FALSE}
find_figure("HLA_A_tissue_plot.png") |> knitr::include_graphics()
```
```{r}
metadata |>
# Filter and subset
dplyr::filter(cell_type_harmonised=="nk") |>
# Get counts per million for HCA-A gene
get_single_cell_experiment(assays = "cpm", features = "HLA-A") |>
suppressMessages() |>
# Plot
tidySingleCellExperiment::join_features("HLA-A", shape = "wide") |>
ggplot(aes(tissue_harmonised, `HLA.A`, color = file_id)) +
theme_bw() +
theme(
axis.text.x = element_text(angle = 60, vjust = 1, hjust = 1),
legend.position = "none"
) +
geom_jitter(shape=".") +
xlab("Tissue") +
ggtitle("HLA-A in nk cells by tissue")
```
## Obtain Unharmonised Metadata
Various metadata fields are *not* common between datasets, so it does not
make sense for these to live in the main metadata table. However, we can
obtain it using the `get_unharmonised_metadata()` function. This function
returns a data frame with one row per dataset, including the `unharmonised`
column which contains unharmnised metadata as a nested data frame.
```{r}
harmonised <- metadata |> dplyr::filter(tissue == "kidney blood vessel")
unharmonised <- get_unharmonised_metadata(harmonised)
unharmonised
```
Notice that the columns differ between each dataset's data frame:
```{r}
dplyr::pull(unharmonised) |> head(2)
```
# Cell metadata
Dataset-specific columns (definitions available at cellxgene.cziscience.com)
`cell_count`, `collection_id`, `created_at.x`, `created_at.y`, `dataset_deployments`, `dataset_id`, `file_id`, `filename`, `filetype`, `is_primary_data.y`, `is_valid`, `linked_genesets`, `mean_genes_per_cell`, `name`, `published`, `published_at`, `revised_at`, `revision`, `s3_uri`, `schema_version`, `tombstone`, `updated_at.x`, `updated_at.y`, `user_submitted`, `x_normalization`
Sample-specific columns (definitions available at cellxgene.cziscience.com)
`sample_`, `sample_name`, `age_days`, `assay`, `assay_ontology_term_id`, `development_stage`, `development_stage_ontology_term_id`, `ethnicity`, `ethnicity_ontology_term_id`, `experiment___`, `organism`, `organism_ontology_term_id`, `sample_placeholder`, `sex`, `sex_ontology_term_id`, `tissue`, `tissue_harmonised`, `tissue_ontology_term_id`, `disease`, `disease_ontology_term_id`, `is_primary_data.x`
Cell-specific columns (definitions available at cellxgene.cziscience.com)
`cell_`, `cell_type`, `cell_type_ontology_term_idm`, `cell_type_harmonised`, `confidence_class`, `cell_annotation_azimuth_l2`, `cell_annotation_blueprint_singler`
Through harmonisation and curation we introduced custom column, not present in the original CELLxGENE metadata
- `tissue_harmonised`: a coarser tissue name for better filtering
- `age_days`: the number of days corresponding to the age
- `cell_type_harmonised`: the consensus call identity (for immune cells) using the original and three novel annotations using Seurat Azimuth and SingleR
- `confidence_class`: an ordinal class of how confident `cell_type_harmonised` is. 1 is complete consensus, 2 is 3 out of four and so on.
- `cell_annotation_azimuth_l2`: Azimuth cell annotation
- `cell_annotation_blueprint_singler`: SingleR cell annotation using Blueprint reference
- `cell_annotation_blueprint_monaco`: SingleR cell annotation using Monaco reference
- `sample_id_db`: Sample subdivision for internal use
- `file_id_db`: File subdivision for internal use
- `sample_`: Sample ID
- `.sample_name`: How samples were defined
# RNA abundance
The `raw` assay includes RNA abundance in the positive real scale (not transformed with non-linear functions, e.g. log sqrt). Originally CELLxGENE include a mix of scales and transformations specified in the `x_normalization` column.
The `cpm` assay includes counts per million.
# Session Info
```{r}
sessionInfo()
```