### Running DESeq2

As I went looking for single-cell differential expression packages, I came across [this paper](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2599-6) and unabashedly skipped to the conclusion which reported that "methods developed specifically for scRNAseq data do not show significantly better performance compared to the methods designed for bulk RNAseq data; and methods that consider behavior of each individual gene (not all genes) in calling DE genes outperform the other tools."

Luckily, I have already used DESeq2 (which is designed for bulk-seq data). It also _is_ one of the methods that considers the behavior of each individual gene (according to the same paper). For those reasons, I'm applying it here. 



In [16]:
library("DESeq2")

Define a path prefix:

In [17]:
prefix <- "/data/codec/production.run/mrna/"

Define the directory with the counts and then a results directory.

In [18]:
countsdir <- paste(prefix,"de.csvs/",sep="")
resdir <- paste(prefix,"deseq2.res/",sep="")

List the files in `countsdir`.

In [19]:
sampleFiles <- list.files(countsdir)

In [20]:
sampleFiles

File names are returned in alphabetical order, so the `col.csv` always comes before the `cts.csv`.

In [21]:
for (cond in c("A","B","G","P","R")) {
    
    cond_files <- grep(cond, sampleFiles, value = TRUE)
    
    cts <- as.matrix(read.csv(paste(countsdir, 
                                    cond_files[2], # using the second position, which is the cts.csv
                                    sep=""
                                   ),row.names=1, check.names = FALSE))
    
    coldata <- read.csv(paste(countsdir, 
                              cond_files[1], # using the first position, which is the col.csv
                              sep=""
                             ), row.names=1)
    
    # reclassify as factor, right now reading at an integer
    coldata$FID <- as.factor(coldata$FID)
    
# #     # first, run the time course model outlined in the rnaseqgene vignette
# #     ddsTC <- DESeqDataSetFromMatrix(countData = cts,
# #                     colData = coldata,
# #                                 design = ~ cond + TIME + cond:TIME
# #                                )
# #     ddsTC <- DESeq(ddsTC, test="LRT", reduced = ~ cond + TIME)
# #     res <- results(ddsTC)
# #     write.csv(as.data.frame(res), file=paste(resdir,cond,".TC.csv",sep=""))
    
    # then for each time point, do just regular differential expression between condulation and control
    for (celltype in c("B_Memory", "B_Naive", "CD4_T_Memory", "CD4_T_Naive", 
                       "CD8_T_Memory_MAIT_GD", "CD8_T_Naive", "HSC", "NK", 
                       "pDC", "cDC", "Mono_cDC_All")) {
        subct <- subset(cts, select=grep(paste(as.character(celltype),"-",sep=""),colnames(cts),value = TRUE))
        subcoldata <- subset(coldata, CT == celltype)
        dds <- DESeqDataSetFromMatrix(countData = subct,
                              colData = subcoldata,
                              design = ~ COND
                             )
        dds <- DESeq(dds, parallel = TRUE)
        res <- results(dds, contrast = c("COND",cond,"C"))
        write.csv(as.data.frame(res), file=paste(resdir, cond,".",as.character(celltype),".csv",sep=""))
    }
    
}

converting counts to integer mode

estimating size factors

estimating dispersions

gene-wise dispersion estimates: 30 workers

mean-dispersion relationship

-- note: fitType='parametric', but the dispersion trend was not well captured by the
   function: y = a/x + b, and a local regression fit was automatically substituted.
   specify fitType='local' or 'mean' to avoid this message next time.

final dispersion estimates, fitting model and testing: 30 workers

-- replacing outliers and refitting for 27 genes
-- DESeq argument 'minReplicatesForReplace' = 7 
-- original counts are preserved in counts(dds)

estimating dispersions

fitting model and testing

converting counts to integer mode

estimating size factors

estimating dispersions

gene-wise dispersion estimates: 30 workers

mean-dispersion relationship

-- note: fitType='parametric', but the dispersion trend was not well captured by the
   function: y = a/x + b, and a local regression fit was automatically substituted.
   specify

ERROR: Error in DESeqDataSet(se, design = design, ignoreRank): design has a single variable, with all samples having the same value.
  use instead a design of '~ 1'. estimateSizeFactors, rlog and the VST can then be used


In [9]:
for (cond in c("A","B","G","R")) {
    
    cond_files <- grep(cond, sampleFiles, value = TRUE)
    
    cts <- as.matrix(read.csv(paste(countsdir, 
                                    cond_files[2], # using the second position, which is the cts.csv
                                    sep=""
                                   ),row.names=1, check.names = FALSE))
    
    coldata <- read.csv(paste(countsdir, 
                              cond_files[1], # using the first position, which is the col.csv
                              sep=""
                             ), row.names=1)
    
    # reclassify as factor, right now reading at an integer
    coldata$FID <- as.factor(coldata$FID)
    
# #     # first, run the time course model outlined in the rnaseqgene vignette
# #     ddsTC <- DESeqDataSetFromMatrix(countData = cts,
# #                     colData = coldata,
# #                                 design = ~ cond + TIME + cond:TIME
# #                                )
# #     ddsTC <- DESeq(ddsTC, test="LRT", reduced = ~ cond + TIME)
# #     res <- results(ddsTC)
# #     write.csv(as.data.frame(res), file=paste(resdir,cond,".TC.csv",sep=""))
    
    # then for each time point, do just regular differential expression between condulation and control
    for (celltype in c('Mono_NC', 'Mono_C')) {
        subct <- subset(cts, select=grep(paste(as.character(celltype),"-",sep=""),colnames(cts),value = TRUE))
        subcoldata <- subset(coldata, CT == celltype)
        dds <- DESeqDataSetFromMatrix(countData = subct,
                              colData = subcoldata,
                              design = ~ COND
                             )
        dds <- DESeq(dds, parallel = TRUE)
        res <- results(dds, contrast = c("COND",cond,"C"))
        write.csv(as.data.frame(res), file=paste(resdir, cond,".",as.character(celltype),".csv",sep=""))
    }
    
}

converting counts to integer mode

estimating size factors

estimating dispersions

gene-wise dispersion estimates: 30 workers

mean-dispersion relationship

-- note: fitType='parametric', but the dispersion trend was not well captured by the
   function: y = a/x + b, and a local regression fit was automatically substituted.
   specify fitType='local' or 'mean' to avoid this message next time.

final dispersion estimates, fitting model and testing: 30 workers

-- replacing outliers and refitting for 12 genes
-- DESeq argument 'minReplicatesForReplace' = 7 
-- original counts are preserved in counts(dds)

estimating dispersions

fitting model and testing

converting counts to integer mode

estimating size factors

estimating dispersions

gene-wise dispersion estimates: 30 workers

mean-dispersion relationship

-- note: fitType='parametric', but the dispersion trend was not well captured by the
   function: y = a/x + b, and a local regression fit was automatically substituted.
   specify