vignettes/single_cell_rnaseq_practical.Rmd

---
title: "Analysis of single-cell RNA-seq data using a topic model, Part 2: practical implementation"
author: Peter Carbonetto
date: "`r Sys.Date()`"
output:
  html_document:
    toc: yes
    toc_float: no
    highlight: textmate
    theme: readable
vignette: >
  %\VignetteIndexEntry{Analysis of single-cell RNA-seq data using a topic model, Part 2: practical implementation}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

This vignette continues from [Part 1][vignette-part-1], where we
introduced the basic steps of a topic modeling analysis for
single-cell data. Here we give more detailed guidance and discuss
complications that may arise.

```{r knitr-opts, include=FALSE}
knitr::opts_chunk$set(comment = "#",collapse = TRUE,results = "hold",
                      fig.align = "center",dpi = 120)
```

We begin our analysis by loading the packages. Then we set the seed so
that the results can be reproduced.

```{r load-pkgs, message=FALSE, warning=FALSE}
library(Matrix)
library(fastTopics)
library(ggplot2)
library(cowplot)
set.seed(1)
```

Now we load the data set and the $K = 6$ pre-fitted topic model.

```{r load-data}
data(pbmc_facs)
counts <- pbmc_facs$counts
fit <- pbmc_facs$fit
```

Assessing convergence of model fitting
--------------------------------------

The topic model was fitted using the `fit_topic_model` function. This
function, as we mentioned, hides most of the complexities of model
fitting. Nevertheless, it is a good idea to check that the model
fitting has converged to a maximum-likelihood solution. This is
easily done with 'fastTopics':

```{r plot-loglik, fig.height=2, fig.width=4}
plot_progress(fit,x = "iter",add.point.every = 10,colors = "black") +
  theme_cowplot(font_size = 10)
```

Internally, `fit_topic_model` first fits a non-negative matrix
factorization (NMF) to the count data, then converts the fitted NMF to
a topic model, so the plot shows the improvement in the model fit over
time. Judging by this plot, the parameter estimates get close to a
maxium-likelihood solution after about 150 updates.

For larger or more complex data sets, more updates may be needed to
obtain a good fit. For guidance, see `help(fit_topic_model)` and
`help(fit_poisson_nmf)`.

Evaluating fit: single-cell likelihoods
---------------------------------------

The topic model can be used to calculate *a likelihood for each cell:*

```{r loglik-1}
loglik <- loglik_multinom_topic_model(counts,fit)
```

This can be used to assess how well the topic model "fits" each cell.

```{r loglik-2, fig.height=2, fig.width=4.5}
pdat <- data.frame(loglik)
ggplot(pdat,aes(loglik)) +
  geom_histogram(bins = 64,color = "white",fill = "black",size = 0.25) +
  labs(y = "number of cells") +
  theme_cowplot(font_size = 10)
```

Most of the poorly fit cells are in the CD34+ subpopulation:

```{r loglik-3, fig.height=2, fig.width=5.5}
subpop_colors <- c("dodgerblue","forestgreen","darkmagenta","skyblue","gold")
pdat <- data.frame(loglik = loglik,subpop = pbmc_facs$samples$subpop)
ggplot(pdat,aes(x = loglik,fill = subpop)) +
  geom_histogram(bins = 64,color = "white",size = 0.25) +
  scale_fill_manual(values = subpop_colors) +
  labs(y = "number of cells") +
  theme_cowplot(font_size = 10)
```

Visualizing the topics without cell labels
------------------------------------------

In [Part 1][single_cell_rnaseq_practical], we made use of additional
information about the cells to help us interpret the topics. Even
without the cell labels, the Structure plot can still be effective:

```{r structure-plot-without-labels, fig.width=6.5, fig.height=1.5, results="hide", message=FALSE}
topic_colors <- c("skyblue","forestgreen","darkmagenta","dodgerblue",
                  "gold","darkorange")
structure_plot(fit,colors = topic_colors)
```

From this Structure plot, the topics clearly distinguish the B cells
(dark blue), CD14+ monocytes (green) and CD34+ cells (purple).  The NK
cells (light blue) and T cells (cells high proportions of yellow and
orange) are harder to distinguish, and this is reflected in sharing of
topics (mostly topic 6) among NK and T cells.

Of course, without the cell labels, we cannot know that the topics
correspond to these cell types without further downstream
analyses---for this, we performed a GoM DE analysis (see
[Part 1][vignette-part-1]).

Identifying clusters from the topic proportions
-----------------------------------------------

In more complex data sets, some structure may not show up well without
additional fine-tuning of the plot.

One simple strategy that often works well is to first identify
clusters in the topic proportions matrix, $L$. Here we use $k$-means
to identify clusters, but other methods could be used.

```{r clustering-1}
set.seed(1)
pca <- prcomp(fit$L)$x
clusters <- kmeans(pca,centers = 6,iter.max = 100)$cluster
summary(factor(clusters))
```

We chose 6 clusters, but to be clear the best number of clusters may
not be the same as the number of topics.

Now we create another Structure plot in which the cells are grouped
according to the clusters:

```{r structure-plot-by-cluster-1, fig.width=7.5, fig.height=1.75, results="hide", message=FALSE}
structure_plot(fit,topics = 1:6,colors = topic_colors,gap = 25,
               grouping = clusters) +
  theme(axis.text.x = element_blank())
```

k-means somewhat arbitrarily split the T cells (cells with high
proportions of topics 5 and 6) into clusters 1 and 3. Therefore, we
merge these two clusters:

```{r clustering-2}
clusters[clusters == 3] <- 1
clusters <- factor(clusters)
```

We have found that visual inspection of the clusters followed by
manual refinement is often be needed to get the "right" clustering.

Now we can label the clusters by these cell types in the plot.
(Noting that this labeling is usually not possible until downstream
analyses have been performed.)

```{r structure-plot-by-cluster-2, fig.width=7.5, fig.height=1.75, results="hide", message=FALSE}
levels(clusters) <- c("T","CD14+","B","CD34+","NK")
structure_plot(fit,topics = 1:6,colors = topic_colors,gap = 25,
               grouping = clusters)
```

It is also sometimes helpful to visualize the structure in PCA plots
which show the projection of the cells onto PCs of the topic
proportions:

```{r pca-plot-1, fig.width=3.5, fig.height=2.5, message=FALSE, warning=FALSE}
cluster_colors <- c("gold","forestgreen","dodgerblue","darkmagenta","skyblue")
pca_plot(fit,fill = clusters) +
  scale_fill_manual(values = cluster_colors)
```

When there are many overlapping points, plotting the *density* can
sometimes show the clusters more clearly:

```{r pca-plot-2, fig.width=3.75, fig.height=2.5, message=FALSE}
pca_hexbin_plot(fit,bins = 24)
```

More on differential expression analysis
----------------------------------------

In [Part 1][vignette-part-1], we performed a DE analysis using the
topic model.

```{r diff-count-analysis}
de <- pbmc_facs$de
```

When the volcano plot shows many overlapping differentially expressed
genes, it can be helpful to explore the results interactively. The
function `volcano_plotly` creates an *interactive volcano plot* that can
be viewed in a Web browser. For example, here is an interactive
volcano plot for the T cells topic:

```{r volcano-plotly, warning=FALSE}
genes <- pbmc_facs$genes
p <- volcano_plotly(de,k = 6,file = "volcano_plot_t_cells.html",
                    labels = genes$symbol,ymax = 100)
```

This call creates a new file `volcano_plot_t_cells.html`. The
interactive volcano plot can also be viewed [here][volcano-plotly], or
by calling `print(p)` in R.

Session info
------------

This is the version of R and the packages that were used to generate
these results.

```{r session-info}
sessionInfo()
```

[vignette-part-1]: https://stephenslab.github.io/fastTopics/articles/single_cell_rnaseq_basic.html
[volcano-plotly]: https://stephenslab.github.io/fastTopics/articles/volcano_plot_t_cells.html
[zheng-2017]: https://doi.org/10.1038/ncomms14049