# 2018-10-04 Monocle
I looked at several packages to perform the analysis of the single-cell RNA-seq data, and I found that one of the most cited, best documented packages is an R package called Monocle. I want to explore here the possibilities that open up to use this package to perform the requested task.

In [None]:
# load required libraries
library(monocle)
library(biomaRt)

## Loading data

The first thing we'll do is we'll load the data into a data frame that R can handle easily. The Monocle package requires to load three things:
- the expression matrix (loaded into a data frame)
- the sample sheet (*phenoData*) which contains the information on all the cells
- the gene annotation data (*featureData*) which contains information on the genes in the expression matrix

The expression matrix has been given to us by the CNAG. Separately I wrote files that describe the characteristics of the cells in each well, as well as very basic information on the genes. We now load the data.

In [None]:
# file names
matrices_dir <- "/home/rcortini/work/CRG/projects/sc_hiv/data/matrices"
sample_name <- "P2449"
sample_sheet_fname <- sprintf("%s/monocle/%s.pd.tsv", matrices_dir, sample_name)
expr_matrix_fname <- sprintf("%s/%s.tsv.gz", matrices_dir, sample_name)

Before loading the data a few notes on the options given to the `read.delim` and `read.table` functions. I write `header = TRUE` and `row.names = 1` because the first row and the first column contain the names of the genes, the cells, or the column labels. It is very important to give the `check.names = FALSE` option to `read.table` because otherwise it will automatically convert a dash symbol into a dot, generating an inconsistency between the names of the cells in the `expr_matrix` and the `phenoData`.

In [None]:
# load data
sample_sheet <- read.delim(sample_sheet_fname, header = TRUE, row.names = 1)
expr_matrix <- read.table(expr_matrix_fname, header = TRUE, row.names = 1,
                          sep = "\t", check.names = FALSE)

For the gene annotations we extract the names of the genes from the expression matrix, then we use the biomaRt package to get all the symbols associated to the gene names.

In [None]:
gene_full_names <- row.names(expr_matrix)
gene_short_names <- gsub("\\..*","",gene_full_names)
mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
allgenes <- getBM(attributes = c("ensembl_gene_id", "hgnc_symbol"),
                  values = gene_short_names,
                  mart = mart)
allgenes_idx <- match(gene_short_names, allgenes$ensembl_gene_id)
gene_annotations <- data.frame(gene_full_names, gene_short_names,
                               allgenes$hgnc_symbol[allgenes_idx],
                               check.names = FALSE, row.names = 1)
colnames(gene_annotations) <- c("gene_short_name", "symbol")

We're now ready to put all this data together in the data structures provided by the Monocle package. We're going to give the `expressionFamily = negbinomial.size()` option to the `newCellDataSet` function because the matrices represent number of reads for each gene, unnormalized.

In [None]:
# apply constructs from monocle package
pd <- new("AnnotatedDataFrame", data = sample_sheet)
fd <- new("AnnotatedDataFrame", data = gene_annotations)
HSMM <- newCellDataSet(as.matrix(expr_matrix),
    phenoData = pd, featureData = fd, expressionFamily = negbinomial.size())

`HSMM` is the basic data structure that contains all the information on the experiment that we have. Monocle requires to call functions that estimate size factors and dispersions, to then evaluate differences between cells in successive analyses.

In [None]:
# estimate size factors and dispersions
HSMM <- estimateSizeFactors(HSMM)
HSMM <- estimateDispersions(HSMM)

The `HSMM` object is now ready to be used.

## Filtering out dead cells

The next step in the analysis is to remove cells from the analysis that do not pass quality control. These are already quite evident from the previous plots I made in the Python notebooks. The way we do this in the Monocle package is to add a column to the `phenoData` structure with parameters that allow us to identify cells that don't pass quality control.

We'll add a column that we call `Total_mRNAs` to the `phenoData`.

In [None]:
# add total expression to experiment phenoData
pData(HSMM)$Total_mRNAs <- Matrix::colSums(exprs(HSMM))
print(head(pData(HSMM)))

We now want to eliminate from the data set those cells that have too many or too few reads. We'll define the lower limit and upper limit as the ones defined by having less than mean - two standard deviations and mean + two standard deviations. To calculate these means we'll only use the class of "control" cells, that are the Jurkat + the non-treated latent cells.

In [None]:
# define the class of control cells
jkt <- row.names(subset(pData(HSMM), label == "Jurkat"))
jlat_untreated <- row.names(subset(pData(HSMM), label == "J-Lat+DMSO"))
controls <- union(jkt, jlat_untreated)

In [None]:
# define lower and upper bound on the total mRNA values
mRNA_mean <- mean(log10(pData(HSMM[,controls])$Total_mRNAs))
mRNA_std  <- sd(log10(pData(HSMM[,controls])$Total_mRNAs))
upper_bound <- 10^(mRNA_mean + 2*mRNA_std)
lower_bound <- 10^(mRNA_mean - 2*mRNA_std)

In [None]:
# remove cells that don't pass the criterion
HSMM <- HSMM[,pData(HSMM)$Total_mRNAs > lower_bound &
              pData(HSMM)$Total_mRNAs < upper_bound]

Now the `HSMM` data structure contains the information we want. Let's have a look at the distribution of values of total mRNA counts across the samples we selected.

In [None]:
qplot(Total_mRNAs, data = pData(HSMM), color = label, geom =
"density") +
geom_vline(xintercept = lower_bound) +
geom_vline(xintercept = upper_bound)

In [None]:
HSMM <- detectGenes(HSMM, min_expr = 0.1)
head(fData(HSMM))

Once we're through with this part, we should verify that the distribution of FPKM in the class of expressed genes follows a roughly log-normal distribution.

In [None]:
library(reshape2)

# this generates the list of genes that are expressed in at least 10 cells
expressed_genes <- row.names(subset(fData(HSMM),
    num_cells_expressed >= 10))

# Log-transform each value in the expression matrix.
L <- log(exprs(HSMM[expressed_genes,]))

# Standardize each gene, so that they are all on the same scale,
# Then melt the data with plyr so we can plot it easily
melted_dens_df <- melt(Matrix::t(scale(Matrix::t(L))))

# Plot the distribution of the standardized gene expression values.
qplot(value, geom = "density", data = melted_dens_df) +
stat_function(fun = dnorm, size = 0.5, color = 'red') +
xlab("Standardized log(FPKM)") +
ylab("Density")

This seems to be ok.

## Classification of cells
We initialize a `CellTypeHierarchy` to then perform differential expression analysis.

In [None]:
cth <- newCellTypeHierarchy()
cth <- addCellType(cth, "Controls", classify_func =
    function(x) { x["FILIONG01",] < 1 & colnames(x) %in% controls})
cth <- addCellType(cth, "NonResponders", classify_func =
    function(x) { x["FILIONG01",] < 1 & ! colnames(x) %in% controls})
cth <- addCellType(cth, "Responders", classify_func =
    function(x) { x["FILIONG01",] >= 1 })
HSMM <- classifyCells(HSMM, cth, 0.1)
table(pData(HSMM)$CellType)

# Differential expression analysis
Once we have our classes and our labels, we can proceed with differential expression analysis. We have a very large list of genes in our list, many of which are non-coding RNAs.

First, I remove symbols that don't have a name.

In [None]:
marker_genes <- subset(fData(HSMM)[expressed_genes,],
                                 !is.na(symbol))

Next, I'll remove LINC genes.

In [None]:
marker_genes <- subset(marker_genes, !grepl("^LINC", symbol))

Finally, I'll remove genes without a symbol.

In [None]:
marker_genes <- subset(marker_genes, symbol != "")

Now I have a list of expressed, not non-coding, with-name genes. Let's see how many they are.

In [None]:
dim(marker_genes)

Now let's try to perform the differential expression analysis with all these genes.

In [None]:
diff_test_res <- differentialGeneTest(HSMM[row.names(marker_genes),],
                                      fullModelFormulaStr = "~CellType")

In [None]:
sig_genes <- subset(diff_test_res, qval < 0.1)
head(sig_genes[,c("symbol", "pval", "qval")])

How many are there?

In [None]:
dim(sig_genes)

Let's plot something.

In [None]:
testgenes <- HSMM[row.names(subset(fData(HSMM),
              symbol %in% c("UBE3C", "TTC27"))),]
plot_genes_jitter(testgenes, grouping = "label", ncol= 2)

Okay so these plots are not very representative nor very nice. The thing is: what should I do now with all these genes? Maybe better going back to the unsupervised classification.