# 2019-03-07 Comparing old and new

I did data processing starting from the FASTQ files by myself. I applied a pipeline of data processing both to the new and to the old data.

## Comparing old and new pipeline

The first that I'm asking myself is whether the results from the old pipeline (data processed at the CNAG) and the current ones are consistent or not.

In [None]:
# load old data
old.datadir <- "../data/matrices"
P2449 <- read.table(sprintf('%s/P2449.tsv.gz', old.datadir),
                    header = TRUE, row.names = 1,
                    sep = "\t", check.names = FALSE)
P2458 <- read.table(sprintf('%s/P2458.tsv.gz', old.datadir),
                    header = TRUE, row.names = 1,
                    sep = "\t", check.names = FALSE)
old.data <- cbind(P2449, P2458)

In [None]:
# load new data
new.datadir <- "../data/fastq/postprocess"
new.data <- read.table(sprintf('%s/exprMatrix.tsv', new.datadir),
                    header = TRUE, row.names = 1,
                    sep = "\t", check.names = FALSE)

Load also the sample sheet.

In [None]:
sample.sheet.fname <- sprintf("%s/samplesheet.csv", old.datadir)
sample.sheet <- read.delim(sample.sheet.fname, header = TRUE, row.names = 1)

The first thing I want to look at is whether the HIV expression values are consistent between the two.

**CAREFUL**: when doing the analysis, we must take care when comparing different samples, as the order of the samples is different in the two matrices!

In [None]:
# prepare a data frame that will hold only the values of the HIV expression in the two samples
hiv <- data.frame(old = t(old.data["FILIONG01", ]),
                  new = t(new.data["FILIONG01",match(colnames(old.data), colnames(new.data))]))
colnames(hiv) <- c("old", "new")
hiv$status <- sample.sheet$status
hiv$label <- sample.sheet$label

Let's start by asking a simple question: what is the mean expression level of the HIV in the non-treated samples? Should be zero.

In [None]:
mean(hiv[hiv$status == "nontreated", "new"])

Good. Inspecting the data frame I see that there is only one cell that was not treated in which there is a value of 2 in column.

Let's look at the correlation between old results and new ones as for the HIV transcript.

In [None]:
options(repr.plot.width = 4, repr.plot.height = 4)
plot(hiv$old, hiv$new, xlab = "Old Pipeline", ylab = "New Pipeline", main = "HIV")

Very good.

Let's have a look at some other cases. I'll select the case of highly expressed genes and see whether there is some good correlation.

In [None]:
highly.expressed.genes <- row.names(old.data[rowSums(old.data) > 5000, ])
my.gene <- highly.expressed.genes[[3]]
df <- data.frame(old = t(old.data[my.gene, ]),
                 new = t(new.data[my.gene, match(colnames(old.data), colnames(new.data))]))
colnames(df) <- c("old", "new")

In [None]:
options(repr.plot.width = 4, repr.plot.height = 4)
plot(df$old, df$new, xlab = "Old Pipeline", ylab = "New Pipeline",
     main = my.gene)

Let's do some more global analyses.

Correlation of the expression levels of all the genes.

In [None]:
# this is needed because the names of the genes in the old data set does not coincide
# with the names of the genes in the second data set
mygenes <- intersect(rownames(salmon.genes), rownames(old.data))

In [None]:
sample.id <- "P2458_N710-S518"
plot(old.data[mygenes, sample.id],
     new.data[mygenes, sample.id],
     xlab = "Old Pipeline",
     ylab = "New Pipeline",
     main = "All genes")

Okay, so apart from a few weirdos, everything is looking good. Let's look at global gene expression and global number of reads.

In [None]:
# gene sums
gene.sums <- data.frame(old = rowSums(old.data)[mygenes] ,
                        new = rowSums(new.data)[mygenes])
plot(gene.sums$old,
     gene.sums$new, xlab = "Old Pipeline", ylab = "New Pipeline", main = "Gene Sums")

In [None]:
# gene sums
cell.sums <- data.frame(old = colSums(old.data) ,
                        new = colSums(new.data)[match(colnames(old.data), colnames(new.data))])
plot(cell.sums$old,
     cell.sums$new, xlab = "Old Pipeline", ylab = "New Pipeline", main = "Cell Sums")

This is good.