vignettes/tidy_single_cell.Rmd

---
title: "Single-cell Tidy Transcriptomics - analysis of single-cell RNA sequencing data with R tidy principles"
author:
  - name: Stefano Mangiola
    affiliation: The Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, VIC 3052, Melbourne, Australia; Department of Medical Biology, The University of Melbourne, Parkville, VIC 3010, Melbourne, Australia
  - name: Maria Doyle
    affiliation: Peter MacCallum Cancer Centre, 305 Grattan Street, Parkville, Melbourne, Victoria, Australia
date: "11 November 2020"
vignette: >
  %\VignetteIndexEntry{Single-cell Tidy Transcriptomics - analysis of single-cell RNA sequencing data with R tidy principles}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
bibliography: "`r file.path(system.file(package='zhejiang2020', 'vignettes'), 'tidytranscriptomics.bib')`"
output:
  BiocStyle::html_document:
    fig_caption: true
---

![](tidybulk_logo.png){width=100px}


## Introduction

# Set-up

```{r setup2, eval=TRUE, message=FALSE}

library(zhejiang2020)

# Bioconductor
library(scran)
library(scater)
library(EnsDb.Hsapiens.v86)

# tidyverse core packages
library(tibble)
library(dplyr)
library(tidyr)
library(readr)
library(magrittr)
library(ggplot2)
library(ggbeeswarm)

#library(tidyHeatmap)
library(SingleCellExperiment)
library(tidySingleCellExperiment)
```


### Participation

After the lecture, participants are expected to follow along the hands-on session. we highly recommend participants bringing your own laptop.

### _R_ / _Bioconductor_ packages used

The following R/Bioconductor packages will be explicitly used: 

* tidySingleCellExperiment
* DropletUtils
* scran
* scater
* singleR

### Time outline

| Activity                         | Time |
|----------------------------------|------|
| Analysis workflow                | 30m  |
| Q & A                            | 20m  |


## Data loading

We get the original SingleCellExperiment data used in the previous session.

```{r}
zhejiang2020::single_cell_experiment
```

We can get a tidy representation where cell-wise information is displayed. The dataframe is displayed as `tibble abstraction`, to indicate that appears and act as a `tibble`, but underlies a `SingleCellExperiment`.

```{r}
counts = 
  zhejiang2020::single_cell_experiment %>% 
  tidy()

counts
```

If we need, we can extract transcript information too. A regular tibble will be returned for independent visualisation and analyses.

```{r}
counts %>%
  join_transcripts("ENSG00000228463")
```

As before, we add gene symbols to the data

```{r}
#--- gene-annotation ---#
rownames(counts) <- 
  uniquifyFeatureNames(
    rowData(counts)$ID,
    rowData(counts)$Symbol
  )
```

We check the mitochondrial expression for all cells

```{r}
# Gene product location
location <- mapIds(
  EnsDb.Hsapiens.v86, 
  keys=rowData(counts)$ID, 
  column="SEQNAME",
  keytype="GENEID"
)

#--- quality-control ---#
counts_annotated = 
  counts %>%
  
  # Join mitochondrion statistics
  left_join(
    perCellQCMetrics(., subsets=list(Mito=which(location=="MT"))) %>%
    as_tibble(rownames="cell"),
    by="cell"
  ) %>%
  
  # Label cells
  mutate(high_mitochondrion = isOutlier(subsets_Mito_percent, type="higher")) 
```

We can plot various statistics

```{r}
counts_annotated %>%
  plotColData(
    y = "subsets_Mito_percent",
    colour_by = "high_mitochondrion"
  ) +
  ggtitle("Mito percent")

counts_annotated %>%
  ggplot(aes(x=1,y=subsets_Mito_percent,
             color = high_mitochondrion, 
             alpha=high_mitochondrion,
             size= high_mitochondrion
            )) +
  ggbeeswarm::geom_quasirandom() +

  # Customisation
  scale_color_manual(values=c("black", "#e11f28")) +
  scale_size_discrete(range = c(0, 2))

```

We can filter the the alive cells using dplyr function.

```{r}
counts_filtered = 
  counts_annotated %>%
  
  # Filter data
  filter(!high_mitochondrion)

counts_filtered
```


### Scaling

As before, we can use standard Bioconductor utilities for calculating scaled log counts.

```{r}
#--- normalization ---#
set.seed(1000)

# Calculate clusters
clusters <- quickCluster(counts_filtered)

# Add scaled counts
counts_scaled <- 
  counts_filtered %>%
  computeSumFactors(cluster=clusters) %>%
  logNormCounts()

counts_scaled %>%
  join_transcripts("CD79B")
```


### Detect variable gene-transcripts

As before, we can use standard Bioconductor utilities to identify variable genes.

```{r}
#--- variance-modelling ---#
set.seed(1001)
gene_variability <- modelGeneVarByPoisson(counts_scaled)
top_variable <- getTopHVGs(gene_variability, prop=0.1)
```

### Dimensionality reduction

As before, we can use standard Bioconductor utilities to calculate reduced dimensions.

```{r}
#--- dimensionality-reduction ---#
set.seed(10000)
counts_reduction <- 
  counts_scaled %>%
  denoisePCA(subset.row=top_variable, technical=gene_variability) %>%
  runTSNE(dimred="PCA") %>%
  runUMAP(dimred="PCA")

counts_reduction
```

## Clustering

We use mutate function from dplyr to attach the cluster label to the existing dataset.

```{r}
counts_cluster <-
  counts_reduction %>%
  mutate(
    cluster = 
      buildSNNGraph(., k=10, use.dimred = 'PCA') %>%
      igraph::cluster_louvain() %$%
      membership %>%
      as.factor()
  )
  
counts_cluster

```

We can customise the tSNE plot plotting with ggplot.

```{r}
plotTSNE(counts_cluster, colour_by="cluster",text_by="cluster")

counts_cluster %>%
  ggplot(aes(
    TSNE1, TSNE2, 
    color=cluster, 
    size = 1/subsets_Mito_percent 
  )) +
  geom_point(alpha=0.2) +
  theme_bw()
```