### Running DESeq2
Here is my attempt at running DESeq2. I didn't know exactly how to treat the 3 different variables in our experiment: individual, time and stimulation. I therefore am generating a bunch of results tables, which I'm then exporting to a `.csv` to import into Python and generate some plots. I run two types of differential expression tests for the samples:

1. Using the recommendation from [here](https://bioconductor.org/packages/release/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html) on time course data, I run a log liklihoood test between two models, one accounting for the intereaction between stimulation and time, and the other not. This is to capture a "stimulation-dependent change in time".
2. I run regular differential expression at every timepoint between the stimulation condition and control. 

I'm hoping these are sufficient, but can adjust the models and tests as necessary upon input from others.

In [1]:
library("DESeq2")

Loading required package: S4Vectors

Loading required package: stats4

Loading required package: BiocGenerics

Loading required package: parallel


Attaching package: ‘BiocGenerics’


The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB


The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs


The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, basename, cbind, colnames,
    dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
    union, unique, unsplit, which, which.max, which.min



Attaching package: ‘S4Vectors’


The

Define a path prefix.

In [2]:
prefix <- "/data/codec/bulk.jan20/"

Define the directory with the counts and then a results directory.

In [3]:
countsdir <- paste(prefix,"counts.csvs/",sep="")
resdir <- paste(prefix,"deseq2.res/",sep="")

List the files in `countsdir`.

In [4]:
sampleFiles <- list.files(countsdir)

In [5]:
sampleFiles

File names are returned in alphabetical order, so the `col.csv` always comes before the `cts.csv`.

In [10]:
for (stim in c("IFNB","IFNG","PMAI","R848","TNFa")) {
    
    stim_files <- grep(stim, sampleFiles, value = TRUE)
    
    cts <- as.matrix(read.csv(paste(countsdir, 
                                    stim_files[2], # using the second position, which is the cts.csv
                                    sep=""
                                   ),row.names=1, check.names = FALSE))
    coldata <- read.csv(paste(countsdir, 
                              stim_files[1], # using the first position, which is the col.csv
                              sep=""
                             ), row.names=1)
    
    # reclassify as factor, right now reading at an integer
    coldata$IND <- as.factor(coldata$IND)
    
    # first, run the time course model outlined in the rnaseqgene vignette
    dds <- DESeqDataSetFromMatrix(countData = cts,
                                  colData = coldata,
                                  design = ~ COND
                                 )

    res <- results(dds)
    write.csv(as.data.frame(res), file=paste(resdir,stim,".csv",sep=""))
    
    # then for each time point, do just regular differential expression between stimulation and control
    for (time in c(3,6,9,12)) {
        subct <- subset(cts, select=grep(paste("-",as.character(time),sep=""),colnames(cts),value = TRUE))
        subcoldata <- subset(coldata, TIME == time)
        dds <- DESeqDataSetFromMatrix(countData = subct,
                              colData = subcoldata,
                              design = ~ STIM
                             )
        dds <- DESeq(dds)
        res <- results(dds, contrast = c("STIM",stim,"Control"))
        write.csv(as.data.frame(res), file=paste(resdir,stim,".",as.character(time),".csv",sep=""))
    }
    
}

  the design formula contains one or more numeric variables with integer values,
  specifying a model with increasing fold change for higher values.
  did you mean for this to be a factor? if so, first convert
  this variable to a factor using the factor() function

  the design formula contains one or more numeric variables that have mean or
  standard deviation larger than 5 (an arbitrary threshold to trigger this message).
  it is generally a good idea to center and scale numeric variables in the design
  to improve GLM convergence.

estimating size factors

estimating dispersions

gene-wise dispersion estimates

mean-dispersion relationship

final dispersion estimates

fitting model and testing

estimating size factors

estimating dispersions

gene-wise dispersion estimates

mean-dispersion relationship

final dispersion estimates

fitting model and testing

estimating size factors

estimating dispersions

gene-wise dispersion estimates

mean-dispersion relationship

final dispersi