# 2019-04-03 Compositional data analysis

After my talk at the PRBB Computational Genomics Seminar, I received the advice to use an R package called `propr`, that implements methods that allow to treat the RNA-seq data as *compositional*, which avoids the problem of introducing biases due to normalization.

So here it goes.

## A short note about compositional data analysis

Before doing that, I want to make sure that I really understand the point of using compositional data analysis.

Let's start by thinking about two experiments that measure four genes, A, B, C, and D. Let's imagine that the two experiments are strictly identical, only that in the second experiment gene D has artificially an increased amount of reads, perhaps due to PCR amplification biases. Let's see what happens when normalizing by library size.

In [None]:
experiments <- data.frame(exp1 = c(10, 40, 20, 5),
                          exp2 = c(10, 40, 20, 25))
rownames(experiments) <- c("A", "B", "C", "D")

# show it
experiments

In [None]:
librarySize <- colSums(experiments)
norm.experiments <- t(experiments)
norm.experiments <- norm.experiments / librarySize
norm.experiments <- t(norm.experiments)
norm.experiments

Bottomline is that the levels of expression in the two experiments seem to have changed just because the library size has been modified.

Let's proceed with using `propr` for our purposes.

## Using propr

Now let's try to start using the package.

In [None]:
# load the library
library(propr)

In [None]:
# load the expression matrix
data.dir <- "/home/rcortini/work/CRG/projects/sc_hiv/data"
matrix.fname <- sprintf('%s/matrices/exprMatrix.tsv', data.dir)
exprMatrix <- read.table(matrix.fname, header = TRUE, row.names = 1,
                                       sep = "\t", check.names = FALSE)

# load the sample sheet
sample.sheet.fname <- sprintf("%s/metadata/sampleSheet.tsv", data.dir)
sampleSheet <- read.delim(sample.sheet.fname, header = TRUE, row.names = 1)

# load gene annotations file
gene.annotations <- sprintf("%s/matrices/gene_annotations.tsv", data.dir)
gene.data <- read.delim(gene.annotations, header = TRUE, sep = "\t",
                        row.names = 1, stringsAsFactors = FALSE)
gene.data <- subset(gene.data, rownames(gene.data) %in% rownames(exprMatrix))

The `propr` package wants the data in the format of a matrix of $D$ columns and $N$ rows, where $D$ is the number of features (in this case: genes) and $N$ is the number of observations (in this case: cells) in the data set. Therefore, I transpose the matrix.

In [None]:
jlat.DMSO <- rownames(sampleSheet)[sampleSheet$label == "J-LatA2+DMSO"]
jlat.SAHA <- rownames(sampleSheet)[sampleSheet$label == "J-LatA2+SAHA"]
X <- cbind(exprMatrix[, jlat.DMSO], exprMatrix[, jlat.SAHA])
X <- t(X)

# filter out
keep <- apply(X, 2, function(x) sum(x >= 10) >= N/10)
X <- X[, keep]

In [None]:
groups <- c(rep("DMSO", length(jlat.DMSO)), rep("SAHA", length(jlat.SAHA)))

And now I can use the main function that `propr` supplies, which is "propd".

In [None]:
rho <- propd(counts = X,
             group = groups,
             p = 10)

In [None]:
rho@results