analysis/gbcd_hnscc_intro.Rmd

---
title: "Dissecting tumor transcriptional heterogeneity from multi-tumor single-cell RNA-seq data using GBCD"
author: Yusha Liu
output: workflowr::wflow_html
---

Overview
--------

This vignette shows how to apply "generalized binary covariance
decomposition" (GBCD) to jointly analyze single-cell RNA-seq
(scRNA-seq) data from *malignant cells* collected from multiple
patients and/or studies, using a head and neck squamous cell carcinoma
(HNSCC) dataset from [Puram et al. 2017][hnscc].

GBCD can effectively dissect tumor transcriptional heterogeneity into
patient/study-specific and shared gene expression programs
(GEPs). GBCD is "unsupervised" in that, unlike tumor-by-tumor (e.g.,
[Puram et al. (2017)][hnscc]) and many harmonization approaches (e.g.,
[Harmony][harmony], [Liger][liger]), it does not use information about
which cell comes from which tumor or study. Instead, GBCD only
requires the combined scRNA-seq data from all tumors, which are stored
as an $N \times J$ matrix $Y$ of expression values with entries
$y_{ij}$, where $i=1,\dots,N$ indexes malignant cells and
$j=1,\dots,J$ indexes genes. In typical applications, $Y$ contains
log-transformed pseudo-count-modified UMI counts ("log-pc counts").

GBCD ultimately yields a decomposition of the expression data matrix
$Y$ into matrices $L$ and $F$ such that $Y \approx L F^T$, or
equivalently, $$y_{ij} \approx \sum_{k=1}^K l_{ik} f_{jk}.$$ The $K$
components should be interpretable as GEPs, with $l_{ik}$ representing
the membership of cell $i$ in GEP $k$, and $f_{jk}$ representing the
effect of GEP $k$ on the expression of gene $j$. When $y_{ij}$ are
log-pc counts, each $f_{jk}$ approximately represents the log-fold
change (LFC) associated with membership in GEP $k$, so we refer to the
$f_{jk}$ values as LFCs, and to the vector of LFCs $(f_{1k}, \dots,
f_{Jk})^T$ as the "signature" of GEP $k$.

```{r knitr-opts, include=FALSE}
knitr::opts_chunk$set(comment = "#", collapse = TRUE, results = "hold",
                      fig.align = "center", dpi = 120)
```

We begin our analysis by loading the R packages used in the analysis,
as well as some custom functions implementing the GBCD model fitting
and interpretation.. Then we show how to apply GBCD to the HNSCC
dataset step-by-step.

```{r load-pkgs, message=FALSE, warning=FALSE}
library(ggplot2)
library(cowplot)
library(RColorBrewer)
library(ggrepel)
library(pheatmap)
library(gridExtra)
library(Seurat)
library(Matrix)
library(ebnm)
library(flashier)
library(magrittr)
library(ashr)
source("../code/fit_cov_ebnmf.R")
```

The HNSCC dataset
-----------------

This data set contains gene expression data for n = 2,176 malignant
cells collected in primary tumors from 10 HNSCC patients, as well as
matching lymph node (LN) metastases from 5 of these
patients. [Puram et al. (2017)][hnscc] found that each of these 10
patients mapped to a molecular subtype of HNSCC, whose
signatures were previously defined by analysis of bulk expression data
of 279 [TCGA][tcga] HNSCC tumors.

Unlike more recently generated data sets, the HNSCC data are not UMI
counts; rather, they are read counts produced by SMART-Seq2. Following
[Puram et al. (2017)][hnscc], we define the transformed counts as
$y_{ij} = \log_2(1 + \mathrm{TPM}_{ij}/10)$, where $\mathrm{TPM}_{ij}$
is the transcript-per-million (TPM) value for gene $j$ in cell $i$.

```{r load-data}
load("../hnscc/hnscc.RData")
dim(Y)
print(head(info),row.names = FALSE)
```

Visualize HNSCC data using *t*-SNE
----------------------------------

We plot the 2-D *t*-SNE embedding of these $2,176$ malignant cells,
which are colored by patient of origin and tumor stage (primary tumor,
LN metastasis), and shaped by tumor molecular subtype. We set seed to
make the *t*-SNE reproducible. 

```{r run-tSNE, warnings=FALSE, messages=FALSE}
set.seed(100)
hnscc <- CreateSeuratObject(counts = t(Y), project = "HNSCC", meta.data = info)
hnscc <- FindVariableFeatures(hnscc, selection.method = "vst",
                              nfeatures = 5000)
all.genes <- rownames(hnscc)
hnscc <- ScaleData(hnscc, features = all.genes)
hnscc <- RunPCA(hnscc, features = VariableFeatures(object = hnscc),
                npcs = 50, verbose = FALSE)
hnscc <- RunTSNE(hnscc, dims = 1:50)
```

Due to the presence of strong inter-tumor heterogeneity, these cells
demonstrate strong patient effects that are typical of cancer data;
the major structure in the *t*-SNE visualization is the clustering of
the cells by patient.

```{r plot-tSNE, fig.height=4, fig.width=5}
DimPlot(hnscc, label = TRUE, repel = TRUE, pt.size = 1, reduction = "tsne",
        group.by = "subject", shape.by = "subtype", cols = subject_col) +
  guides(shape = guide_legend(override.aes = list(size = 3)), ncol = 1) +
  theme(text = element_text(size = 10)) +
  scale_shape_manual(values = c(15,16,18))
```

Apply GBCD to estimate GEP memberships and signatures
---------------------------------

This is accomplished with a call to the custom function
`flash_fit_cov_ebmf`.

To fit GBCD, you need to specify `Kmax`, an upper bound of the number
of components $K$, for initialization; the number of GEPs returned by
GBCD is close to but often not equal to (and can be larger than)
`Kmax`. Generally speaking, a larger `Kmax` allows for identifying
finer structure in tumor transcriptional heterogeneity at the expense
of higher computational cost. A good choice of `Kmax` depends on the
complexity of the analyzed data (e.g., the number of patients and
studies contained), and is often unknown in advance. You1 are
encouraged to explore and compare the results for different values of
`Kmax`. After some initial exploration, we set `Kmax` to 24.

You also need to specify a prior for GEP membership values. GBCD
assigns a "generalized binary" (GB) prior independently to each entry
of $L$, $$l_{ik}\sim (1-\pi_k^l) \delta_0 + \pi_k^l \: N_{+} (\mu_k,
\sigma_k^2),$$ where the ratio $\sigma_k/\mu_k$ is fixed at some
pre-specified small value. In this analysis, we set $\sigma_k/\mu_k
= 0.04$ for all $k$.

```{r fit-gbcd, eval=FALSE}
fit.gbcd <-
  flash_fit_cov_ebnmf(Y = Y, Kmax = 24,
                      prior = flash_ebnm(prior_family = "generalized_binary",
                                         scale = 0.04),
   				      extrapolate = FALSE)
```

Note that you can choose other nonnegative priors for GEP membership
values so long as they are defined in the [ebnm][ebnm] package. For
example, you can chose a point-exponential prior by setting `prior =
ebnm::ebnm_point_exponential`.

The time to fit GBCD depends on the choice of `Kmax` and the size of
the data set being analyzed. It took us a few hours to fit the GBCD
model with `Kmax = 24` to the HNSCC data set.

Since it takes a while to fit GBCD to the HNSCC dataset, we directly
load in the saved output from the above call and show how to interpret
the results.

```{r load-fit}
load("../hnscc/hnscc_gbcd.RData")
```

The `flash_fit_cov_ebnmf` output contains the membership estimates,
stored as an $n \times K$ matrix,

```{r summary-fit-L}
dim(fit.gbcd$L)
fit.gbcd$L[1:4,1:5]
```

and posterior statistics for the LFC estimates, including the
posterior means, stored as an $m \times K$ matrix,

```{r summary-fit-F}
dim(fit.gbcd$F$lfc)
fit.gbcd$F$lfc[1:4,1:5]
```

GEP membership estimates
------------------------

We now examine at the entries of the `L` matrix giving membership
estimates of cell $i$ in GEP $k$. Here we visually compare the
memberships to the patient labels and provided tumor subtype
annotations with a heatmap. 

```{r plot-L, fig.height=4, fig.width=7}
anno <- data.frame(subject = info$subject,
                   subtype = info$subtype)
rownames(anno) <- rownames(fit.gbcd$L)
anno_colors <- list(subject = subject_col,
                    subtype = subtype_col)
cols <- colorRampPalette(c("gray96","red"))(50)
brks <- seq(0,1,0.02)
rows <- order(anno$subject)
p <- pheatmap(fit.gbcd$L[rows, -1], cluster_rows = FALSE, cluster_cols = FALSE,
         show_rownames = FALSE, annotation_row = anno,
         annotation_colors = anno_colors, annotation_names_row = FALSE,
		 angle_col = 45, fontsize = 8, color = cols, breaks = brks,
		 main = "")
print(p)		 
```

This heatmap shows the membership values of the n = 2,176 cells (the
rows in the heatmap) and the 28 GEPs (the columns in the
heatmap). (The membership values for the "baseline" GEP are not
shown.) Cells are arranged top-to-bottom by tumor molecular subtype
and patient of origin. We rescaled the membership values separately for
each GEP so that the maximum membership for each GEP is always 1.

From this heatmap, we see that GEPs 1, 2 and 3 correspond closely to
the molecular subtypes previously defined based on bulk RNA-seq data:
GEP1 is largely active only in cells from the 2 classical patients,
GEP2 is mainly active in cells from the 7 basal patients, and GEP3 is
mainly active in cells from the 1 atypical patient. This demonstrates
that GBCD can reconstruct the molecular subtype information from the
single cell data alone.

Among other GEPs, some are active predominantly in an individual
patient and can thus be interpreted as patient-specific GEPs.  We
interpret GEPs 4, 7, 9, 11, 25, 27 as "patient-specific GEPs"; the
remaining GEPs are shared across multiple patients but are capturing
something different than these previously defined molecular subtypes.

GEP signature estimates
-----------------------

The GEP signature estimates $f_{jk}$ represent approximately the
log-fold change (using t he base-2 logarithm) associated with
membership in GEP $k$. We take the approach commonly used in
differential expression analysis and create a "volcano plot" to
visualize the gene signature for a GEP. As an example, this is the
volcano plot for GEP2:

```{r plot-lfc, fig.height=3.5, fig.width=5}
k <- "GEP2"
pdat <- data.frame(gene = rownames(fit.gbcd$F$lfc), 
                   lfc  = fit.gbcd$F$lfc[,k], 
                   z    = abs(fit.gbcd$F$z_score[,k]), 
                   lfsr = fit.gbcd$F$lfsr[,k],
                   stringsAsFactors = FALSE)
pdat <- transform(pdat,lfsr = cut(lfsr, c(-1,0.001,0.01,0.05,Inf)))
rows  <- with(pdat, which(!(abs(lfc) > quantile(abs(lfc),0.999) | (z > 10))))
pdat[rows, "gene"] <- ""
ggplot(pdat, aes(x = lfc, y = z, color = lfsr, label = gene)) +
  geom_point() + 
  geom_text_repel(color = "black", size = 2.3, segment.color = "black",
                  segment.size = 0.25, min.segment.length = 0,
                  max.overlaps = Inf, na.rm = TRUE) +
  scale_color_manual(values = c("coral","orange","gold","deepskyblue")) +
  labs(x = "log-fold change", y = "|posterior z-score|") + 
  theme_cowplot(font_size = 10)
```

The significance measure here is not a *p*-value, but rather a [local
false sign rate (lfsr)][lfsr].

Session info
------------

This is the version of R and the packages that were used to generate
these results.

```{r session-info}
sessionInfo()
```

[hnscc]: https://doi.org/10.1016/j.cell.2017.10.044
[harmony]: https://portals.broadinstitute.org/harmony
[liger]: https://github.com/welch-lab/liger
[tcga]: https://doi.org/10.1038%2Fnature14129
[ebnm]: https://stephenslab.github.io/ebnm/
[lfsr]: https://doi.org/10.1093/biostatistics/kxw041