<a href="https://colab.research.google.com/github/tuonglab/scRNA_workshop/blob/master/notebook/scRNA_workshop.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# scRNA-seq workshop analysis demo!

Before we dive into the demo, let's first install the necessary packages.

In [None]:
# setup the notebook
!pip install -qqq scanpy[leiden] harmonypy celltypist
# then clone the repository so that we have all the data and notebooks ready to go
!git clone https://github.com/tuonglab/scRNA_workshop.git

# Single-cell RNA seq analysis Demo

This demo will show you the common steps involved to get you started on single cell analysis in Python using [`Scanpy`](https://scanpy.readthedocs.io/en/stable/), the toolkit for analysing single-cell gene expression data.

<a href="https://scanpy.readthedocs.io/en/stable/"><img src="https://scanpy.readthedocs.io/en/stable/_static/Scanpy_Logo_BrightFG.svg" alt="anndata_schema" width="100">


## Preprocessing and Quality Control

First, import packages needed for single-cell RNA seq analysis.

In [None]:
import os

import scanpy as sc
import pandas as pd

# change to working directory
os.chdir("scRNA_workshop")

### Reading in files for analysis

For this demo, we have already saved the starting raw datafile as an `.h5ad` file which is a common file format used in single-cell analysis. You can read in the file using the `read_h5ad` function from [`anndata`](https://anndata.readthedocs.io/) package.

This file contains the raw counts of the cells and genes, as well as the metadata associated with the cells and genes.

The file is saved in the `data` folder.


<a href="https://anndata.readthedocs.io/"><img src="https://raw.githubusercontent.com/scverse/anndata/main/docs/_static/img/anndata_schema.svg" alt="anndata_schema" width="500">

The dataset we will be demo-ing today is on the human prostate.

<a href="https://www.frontiersin.org/journals/endocrinology/articles/10.3389/fendo.2022.1006101/full"><img src="https://www.frontiersin.org/files/Articles/1006101/fendo-13-1006101-HTML-r1/image_m/fendo-13-1006101-g001.jpg" alt="human prostate schema" width="500">


In [None]:
adata = sc.read_h5ad("data/prostate_demo.h5ad")
adata

## Standard Quality control

A very common QC step is to assess the mitochondrial content.

High mitochondrial content is often associated with poor quality cells. We can calculate the percentage of mitochondrial genes in each cell and plot it.

In [None]:
# mitochondrial genes, "MT-" for human, "Mt-" for mouse
adata.var["mt"] = adata.var_names.str.startswith("MT-")
sc.pp.calculate_qc_metrics(
    adata,
    qc_vars=[
        "mt",
    ],
    inplace=True,
    log1p=True,
)

In [None]:
sc.pl.violin(
    adata,
    [
        "n_genes_by_counts",  # the number of genes expressed in the count matrix
        "total_counts",  # the total umi counts per cell
        "pct_counts_mt",  # the percentage of counts in mitochondrial/ribosomal genes
    ],
    jitter=0.4,
    multi_panel=True,
)

Continue processing with "good" cells only..

In [None]:
# filter cells if they do not express at least 200 genes
sc.pp.filter_cells(adata, min_genes=200)
# filter genes if they are expressed in at least 3 cells
sc.pp.filter_genes(adata, min_cells=3)
# always check after you have done some filtering to ensure that you are happy with the results
adata

## Normalisation

In [None]:
# Normalise (library-size correct) the data matrix 𝐗 to 10,000 counts per cell, so that information become comparable between cells.
sc.pp.normalize_total(adata, target_sum=1e4)

# Logarithmise the data:
sc.pp.log1p(adata)

## Highly Variable Feature/Gene selection

Identify and inspect highly-variable genes

In [None]:
# (Expects logarithimised data)
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
sc.pl.highly_variable_genes(adata)
# stash the normalised counts in .raw, before we subset to only highly variable genes
adata.raw = adata

## Dimensionality Reduction

### Step 1: Subset to only highly variable genes

In [None]:
# Actually do the filtering for PCA
adata = adata[:, adata.var["highly_variable"] == True].copy()
adata

### Step 2: Regress out effects of "total_counts" per cell and percentage of mitochondrial genes expressed

In [None]:
sc.pp.regress_out(adata, ["total_counts", "pct_counts_mt"])

### Step 3: Scale each gene to unit variance. Clip values exceeding standard deviation of 10.

In [None]:
sc.pp.scale(adata, max_value=10)

### Step 4: Perform Principal Component Analysis (PCA)

In [None]:
sc.tl.pca(adata)
# visualise the variance contribution by each PC
sc.pl.pca_variance_ratio(adata, log=True, n_pcs=50)

### Step 5: Compute neighbourhood graph

In [None]:
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)

### Step 6: Embed the neighbourhood graph using UMAP

UMAP stands for Uniform Manifold Approximation and Projection. It is a non-linear dimensionality reduction technique that is well-suited for preserving local structure in high-dimensional data.

In [None]:
sc.tl.umap(adata)

#### Visualise UMAP:

In [None]:
sc.pl.umap(adata, color=["patient", "group", "age"])

### Reapeat with batch correction!

Harmony is a popular batch correction tool that iteratively learns a cell-specific linear correction function at the level of PCA space. It is a very powerful tool for batch correction in single-cell analysis.

It can project cells into a shared embedding in which cells group by cell type rather than dataset-specific conditions, simultaneously accounts for multiple experimental and biological factors.

In [None]:
# repeat the steps until umap
sc.external.pp.harmony_integrate(adata, key="barcode")
sc.pp.neighbors(adata, n_neighbors=10, use_rep="X_pca_harmony")
sc.tl.umap(adata)
sc.pl.umap(adata, color=["patient"])

### Step 6: Clustering

We will use the `leiden` algorithm to cluster the cells into different groups. It is a graph-based clustering algorithm that is very popular in single-cell analysis. It is based on optimizing a modularity function that is used to detect communities in networks. It has a resolution parameter that can be tuned to get different levels of granularity in the clustering.

In [None]:
sc.tl.leiden(adata, resolution=0.5)
sc.pl.umap(adata, color="leiden")

### Step 7: Cell type annotation

One of the most important steps for scRNA-seq analysis is to perform cell-type annotation. If we don't do this, we can't interpret the data.

Today, we will use a tool called `CellTypist` to help us automate this process.

[`CellTypist`](https://www.celltypist.org/) is a tool that uses a machine learning model to predict cell types based on marker genes. It is a very powerful tool that can be used to predict cell types in single-cell data.

First, install `celltypist` and prepare the data.

In [None]:
# make a copy of the log1p normalised data for celltypist
for_celltypist = adata.raw.to_adata()

In [None]:
# load up celltypist
import celltypist
from celltypist import models
#Download a specific model, for example, `Immune_All_Low.pkl`.
models.download_models(model = 'Immune_All_Low.pkl')

Run celltypist on our data and allow it to predict labels on each single cell with all the specifications needed.

In [None]:
predictions = celltypist.annotate(for_celltypist, model = 'Immune_All_Low.pkl', majority_voting = True)
# transfer the predictions back to the original adata
adata.obs["celltypist_majority_voting"] = predictions.predicted_labels.majority_voting

Visualise the data via umap with the new celltypist labels

In [None]:
sc.pl.umap(adata, color=["celltypist_majority_voting"])

### Let's examine some genes

In [None]:
# these are T-cell genes
marker_genes = [
    "CD4",
    "CD8B",
    "FOXP3",
    "SELL",
    "CCR7",
    "MKI67",
    "NKG7",
    "GATA3",
    "RORC",
    "CXCR5",
    "CD69",
    "GZMK",
]
sc.pl.umap(adata, color=marker_genes, ncols=3)

In [None]:
sc.pl.dotplot(
    adata, marker_genes, groupby="celltypist_majority_voting", standard_scale="var", color_map="Blues"
)

This marks the end of the demo. Good job!

# Other useful resources
For more details on the dataset that we demoed today, please checkout the original publication and data portal:

https://www.prostatecellatlas.org/

<a href="https://doi.org/10.1016/j.celrep.2021.110132"><img src="https://www.prostatecellatlas.org/assets/cover.jpg" alt="prostate_cellrep" width="200">

If you have any questions, email Kelvin at z.tuong@uq.edu.au