vignettes/Introduction.Rmd

---
title: "CuratedAtlasQueryR"
output: "html_vignette"
params:
  demo_metadata: true
vignette: >
  %\VignetteIndexEntry{CuratedAtlasQueryR}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
knit: >
    (function(x, ...){
        proj_root <- rprojroot::find_package_root_file() |> normalizePath()
        rmarkdown::render(
            x,
            output_format = "html_document",
            params = list(
                demo_metadata = TRUE
            )
        )
        rmarkdown::render(
            x,
            output_file = "README.md",
            output_format = "github_document",
            output_dir = proj_root,
            knit_root_dir = proj_root,
            params = list(
                demo_metadata = FALSE
            )
        )
    })
---

```{r, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>"
)
root_dir <- knitr::opts_knit$get("root.dir")
if (!is.null(root_dir)){
    # This hack fixes the relative image paths. 
    # See https://github.com/rstudio/rmarkdown/issues/2473
    knitr::opts_knit$set(
        output.dir = root_dir
    )
}
proj_root <- rprojroot::find_package_root_file() |> normalizePath() 
# Utility function for figures to force them to have the correct path
find_figure <- function(names){
    rprojroot::find_package_root_file() |>
        file.path("man", "figures", names) 
}
METADATA_URL = if (params$demo_metadata)
       CuratedAtlasQueryR::SAMPLE_DATABASE_URL else
       CuratedAtlasQueryR::get_database_url

```

<!-- badges: start -->
[![Lifecycle:maturing](https://img.shields.io/badge/lifecycle-maturing-blue.svg)](https://www.tidyverse.org/lifecycle/#maturing)
<!-- badges: end -->

`CuratedAtlasQuery` is a query interface that allow the programmatic exploration and retrieval of the harmonised, curated and reannotated CELLxGENE single-cell human cell atlas. Data can be retrieved at cell, sample, or dataset levels based on filtering criteria.

Harmonised data is stored in the ARDC Nectar Research Cloud, and most 
`CuratedAtlasQuery` functions interact with Nectar via web requests, so a 
network connection is required for most functionality.

```{r, echo=FALSE, out.height = c("139px"), out.width = "120x" }
find_figure("logo.png") |> knitr::include_graphics()
```

```{r, echo=FALSE, out.height = c("58px"), out.width = c("155x", "129px", "202px", "219px", "180px")}
c(
  "svcf_logo.jpeg", 
  "czi_logo.png", 
  "bioconductor_logo.jpg",
  "vca_logo.png",
  "nectar_logo.png"
) |> 
    find_figure() |>
    knitr::include_graphics()
```

# Query interface

## Installation

```{r, eval=FALSE}
devtools::install_github("stemangiola/CuratedAtlasQueryR")
```

## Load the package

```{r, message=FALSE, warning=FALSE}
library(CuratedAtlasQueryR)
```

## Load and explore the metadata

### Load the metadata

```{r, eval=FALSE}
# Note: in real applications you should use the default value of remote_url 
metadata <- get_metadata(remote_url = METADATA_URL)
metadata
```

```{r, echo=FALSE}
# Note: a custom cache is used here ONLY for R CHECK compliance purposes. Users will NOT need to specify a custom cache
metadata <- get_metadata(remote_url = METADATA_URL, cache_directory = tempdir())
metadata
```
 

The `metadata` variable can then be re-used for all subsequent queries.

### Explore the tissue 

```{r}
metadata |>
    dplyr::distinct(tissue, file_id) 
```

## Download single-cell RNA sequencing counts 

### Query raw counts

```{r}
single_cell_counts = 
    metadata |>
    dplyr::filter(
        ethnicity == "African" &
        stringr::str_like(assay, "%10x%") &
        tissue == "lung parenchyma" &
        stringr::str_like(cell_type, "%CD4%")
    ) |>
    get_single_cell_experiment()

single_cell_counts
```

### Query counts scaled per million

This is helpful if just few genes are of interest, as they can be compared across samples.

```{r}
single_cell_counts = 
    metadata |>
    dplyr::filter(
        ethnicity == "African" &
        stringr::str_like(assay, "%10x%") &
        tissue == "lung parenchyma" &
        stringr::str_like(cell_type, "%CD4%")
    ) |>
    get_single_cell_experiment(assays = "cpm")

single_cell_counts
```

### Extract only a subset of genes

```{r}
single_cell_counts = 
    metadata |>
    dplyr::filter(
        ethnicity == "African" &
        stringr::str_like(assay, "%10x%") &
        tissue == "lung parenchyma" &
        stringr::str_like(cell_type, "%CD4%")
    ) |>
    get_single_cell_experiment(assays = "cpm", features = "PUM1")

single_cell_counts
```

### Extract the counts as a Seurat object

This convert the H5 SingleCellExperiment to Seurat so it might take long time and occupy a lot of memory depending on how many cells you are requesting.

```{r}
single_cell_counts_seurat = 
    metadata |>
    dplyr::filter(
        ethnicity == "African" &
        stringr::str_like(assay, "%10x%") &
        tissue == "lung parenchyma" &
        stringr::str_like(cell_type, "%CD4%")
    ) |>
    get_seurat()

single_cell_counts_seurat
```

## Save your `SingleCellExperiment`

The returned `SingleCellExperiment` can be saved with two modalities, as `.rds` or as `HDF5`.

### Saving as RDS (fast saving, slow reading)

Saving as `.rds` has the advantage of being fast, andd the `.rds` file occupies very little disk space as it only stores the links to the files in your cache.

However it has the disadvantage that for big `SingleCellExperiment` objects, which merge a lot of HDF5 from your `get_single_cell_experiment`, the display and manipulation is going to be slow.
In addition, an `.rds` saved in this way is not portable: you will not be able
to share it with other users.

```{r}
single_cell_counts |> saveRDS("single_cell_counts.rds")
```

### Saving as HDF5 (slow saving, fast reading)

Saving as `.hdf5` executes any computation on the `SingleCellExperiment` and writes it to disk as a monolithic `HDF5`.
Once this is done, operations on the `SingleCellExperiment` will be comparatively very fast.
The resulting `.hdf5` file will also be totally portable and sharable.

However this `.hdf5` has the disadvantage of being larger than the corresponding `.rds` as it includes a copy of the count information, and the saving process is going to be slow for large objects.

```{r}
single_cell_counts |> HDF5Array::saveHDF5SummarizedExperiment("single_cell_counts", replace = TRUE)
```

## Visualise gene transcription

We can gather all CD14 monocytes cells and plot the distribution of HLA-A across all tissues

```{r, results='hide'}
suppressPackageStartupMessages({
    library(ggplot2)
})

# Plots with styling
counts <- metadata |>
  # Filter and subset
  dplyr::filter(cell_type_harmonised == "cd14 mono") |>
  dplyr::filter(file_id_db != "c5a05f23f9784a3be3bfa651198a48eb") |> 
  
  # Get counts per million for HCA-A gene
  get_single_cell_experiment(assays = "cpm", features = "HLA-A") |> 
  suppressMessages() |>
  
  # Add feature to table
  tidySingleCellExperiment::join_features("HLA-A", shape = "wide") |> 
    
  # Rank x axis
  tibble::as_tibble()

# Plot by disease
counts |>
  dplyr::with_groups(disease, ~ .x |> dplyr::mutate(median_count = median(`HLA.A`, rm.na=TRUE))) |> 
  
  # Plot
  ggplot(aes(forcats::fct_reorder(disease, median_count,.desc = TRUE), `HLA.A`,color = file_id)) +
  geom_jitter(shape=".") +
    
  # Style
  guides(color="none") +
  scale_y_log10() +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 60, vjust = 1, hjust = 1)) + 
  xlab("Disease") + 
  ggtitle("HLA-A in CD14 monocytes by disease") 
```

```{r echo=FALSE}
find_figure("HLA_A_disease_plot.png") |> knitr::include_graphics()
```

```{r, results='hide'}
# Plot by tissue
counts |> 
  dplyr::with_groups(tissue_harmonised, ~ .x |> dplyr::mutate(median_count = median(`HLA.A`, rm.na=TRUE))) |> 
  
  # Plot
  ggplot(aes(forcats::fct_reorder(tissue_harmonised, median_count,.desc = TRUE), `HLA.A`,color = file_id)) +
  geom_jitter(shape=".") +
    
  # Style
  guides(color="none") +
  scale_y_log10() +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 60, vjust = 1, hjust = 1)) + 
  xlab("Tissue") + 
  ggtitle("HLA-A in CD14 monocytes by tissue") + 
  theme(legend.position = "none")
```
```{r echo=FALSE}
find_figure("HLA_A_tissue_plot.png") |> knitr::include_graphics()
```

```{r}
metadata |> 
    
  # Filter and subset
  dplyr::filter(cell_type_harmonised=="nk") |> 

  # Get counts per million for HCA-A gene 
  get_single_cell_experiment(assays = "cpm", features = "HLA-A") |> 
  suppressMessages() |>

  # Plot
  tidySingleCellExperiment::join_features("HLA-A", shape = "wide") |> 
  ggplot(aes(tissue_harmonised, `HLA.A`, color = file_id)) +
  theme_bw() +
  theme(
      axis.text.x = element_text(angle = 60, vjust = 1, hjust = 1),
      legend.position = "none"
  ) + 
  geom_jitter(shape=".") + 
  xlab("Tissue") + 
  ggtitle("HLA-A in nk cells by tissue")
```

## Obtain Unharmonised Metadata

Various metadata fields are *not* common between datasets, so it does not
make sense for these to live in the main metadata table. However, we can
obtain it using the `get_unharmonised_metadata()` function. This function
returns a data frame with one row per dataset, including the `unharmonised` 
column which contains unharmnised metadata as a nested data frame.

```{r}
harmonised <- metadata |> dplyr::filter(tissue == "kidney blood vessel")
unharmonised <- get_unharmonised_metadata(harmonised)
unharmonised
```

Notice that the columns differ between each dataset's data frame:

```{r}
dplyr::pull(unharmonised) |> head(2)
```

# Cell metadata

Dataset-specific columns (definitions available at cellxgene.cziscience.com)

`cell_count`, `collection_id`, `created_at.x`, `created_at.y`, `dataset_deployments`, `dataset_id`, `file_id`, `filename`, `filetype`, `is_primary_data.y`, `is_valid`, `linked_genesets`, `mean_genes_per_cell`, `name`, `published`, `published_at`, `revised_at`, `revision`, `s3_uri`, `schema_version`, `tombstone`, `updated_at.x`, `updated_at.y`, `user_submitted`, `x_normalization`

Sample-specific columns (definitions available at cellxgene.cziscience.com)

`sample_`, `sample_name`, `age_days`, `assay`, `assay_ontology_term_id`, `development_stage`, `development_stage_ontology_term_id`, `ethnicity`, `ethnicity_ontology_term_id`, `experiment___`, `organism`, `organism_ontology_term_id`, `sample_placeholder`, `sex`, `sex_ontology_term_id`, `tissue`, `tissue_harmonised`, `tissue_ontology_term_id`, `disease`, `disease_ontology_term_id`, `is_primary_data.x`

Cell-specific columns (definitions available at cellxgene.cziscience.com)

`cell_`, `cell_type`, `cell_type_ontology_term_idm`, `cell_type_harmonised`, `confidence_class`, `cell_annotation_azimuth_l2`, `cell_annotation_blueprint_singler` 

Through harmonisation and curation we introduced custom column, not present in the original CELLxGENE metadata

- `tissue_harmonised`: a coarser tissue name for better filtering
- `age_days`: the number of days corresponding to the age
- `cell_type_harmonised`: the consensus call identity (for immune cells) using the original and three novel annotations using Seurat Azimuth and SingleR
- `confidence_class`: an ordinal class of how confident `cell_type_harmonised` is. 1 is complete consensus, 2 is 3 out of four and so on.             
- `cell_annotation_azimuth_l2`: Azimuth cell annotation
- `cell_annotation_blueprint_singler`: SingleR cell annotation using Blueprint reference
- `cell_annotation_blueprint_monaco`: SingleR cell annotation using Monaco reference
- `sample_id_db`: Sample subdivision for internal use
- `file_id_db`: File subdivision for internal use
- `sample_`: Sample ID
- `.sample_name`: How samples were defined

# RNA abundance

The `raw` assay includes RNA abundance in the positive real scale (not transformed with non-linear functions, e.g. log sqrt). Originally CELLxGENE include a mix of scales and transformations specified in the `x_normalization` column.

The `cpm` assay includes counts per million.

# Session Info
```{r}
sessionInfo()
```