<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Adapted-from:--http://www.bioconductor.org/help/workflows/rnaseqGene/" data-toc-modified-id="Adapted-from:--http://www.bioconductor.org/help/workflows/rnaseqGene/-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Adapted from:  <a href="http://www.bioconductor.org/help/workflows/rnaseqGene/" target="_blank">http://www.bioconductor.org/help/workflows/rnaseqGene/</a></a></span></li><li><span><a href="#Table-of-Contents" data-toc-modified-id="Table-of-Contents-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Table of Contents</a></span></li><li><span><a href="#Import-Libraries" data-toc-modified-id="Import-Libraries-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Import Libraries</a></span></li><li><span><a href="#Load-BAM-files" data-toc-modified-id="Load-BAM-files-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Load BAM files</a></span></li><li><span><a href="#Load-GFF-File" data-toc-modified-id="Load-GFF-File-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Load GFF File</a></span></li><li><span><a href="#Count-Reads" data-toc-modified-id="Count-Reads-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Count Reads</a></span></li></ul></div>

# Adapted from:  http://www.bioconductor.org/help/workflows/rnaseqGene/

To learn basics of R: https://www.datacamp.com/courses/free-introduction-to-r

# Table of Contents

1. [Setup](#Setup)
1. [Load BAM files](#Load-BAM-files)
1. [Load GFF file](#Load-GFF-file)
1. [Count Reads](#Count-Reads)

# Import Libraries

In [2]:
suppressPackageStartupMessages(library('Rsamtools'))
suppressPackageStartupMessages(library('GenomicFeatures'))
suppressPackageStartupMessages(library('GenomicAlignments'))
suppressPackageStartupMessages(library('BiocParallel'))
suppressPackageStartupMessages(library('SummarizedExperiment'))

# Load BAM files

The `align_reads.ipynb` script stored all our BAM files in `example/aligned_files.csv`.  
**Enter your alignment csv here:**

In [3]:
ALIGNMENT_CSV <- './DF_files.csv'

In [4]:
sampleTable <- read.csv(ALIGNMENT_CSV,row.names=1)
head(sampleTable)

Unnamed: 0,sample_id,R1,R2,organism,BAM,percent.aligned,mean.phred.scores,total.reads,mapped.reads
0,U01_53,ARM_HRS_TCH1516_RPMI_Naf_0ug_mLNafRep3_S28_L001_R1_001.fastq.gz,ARM_HRS_TCH1516_RPMI_Naf_0ug_mLNafRep3_S28_L001_R2_001.fastq.gz,USA300_TCH1516,./bam/U01_53.bam,94.66,38.7,15294662,14477782
1,U01_50,ARM_HRS_TCH1516_RPMI_Naf_PreCulture_S2_L001_R1_001.fastq.gz,ARM_HRS_TCH1516_RPMI_Naf_PreCulture_S2_L001_R2_001.fastq.gz,USA300_TCH1516,./bam/U01_50.bam,98.15,38.8,19890598,19523607
2,U01_56,ARM_HRS_TCH1516_RPMI_Naf_0_031ug_mLNafRep3_S9_L001_R1_001.fastq.gz,ARM_HRS_TCH1516_RPMI_Naf_0_031ug_mLNafRep3_S9_L001_R2_001.fastq.gz,USA300_TCH1516,./bam/U01_56.bam,93.82,38.6,13181966,12366915
3,U01_61,ARM_HRS_TCH1516_CAMHB_Naf_0ug_mLNafRep1_S6_L001_R1_001.fastq.gz,ARM_HRS_TCH1516_CAMHB_Naf_0ug_mLNafRep1_S6_L001_R2_001.fastq.gz,USA300_TCH1516,./bam/U01_61.bam,95.08,38.1,16494676,15683459
4,U01_70,ARM_HRS_TCH1516_CAMHB_Naf_1ug_mLNafRep1_S19_L001_R1_001.fastq.gz,ARM_HRS_TCH1516_CAMHB_Naf_1ug_mLNafRep1_S19_L001_R2_001.fastq.gz,USA300_TCH1516,./bam/U01_70.bam,96.89,38.6,15621372,15135285
5,U01_59,ARM_HRS_TCH1516_RPMI_Naf_0_063ug_mLNafRep3_S14_L001_R1_001.fastq.gz,ARM_HRS_TCH1516_RPMI_Naf_0_063ug_mLNafRep3_S14_L001_R2_001.fastq.gz,USA300_TCH1516,./bam/U01_59.bam,96.02,38.7,14998926,14402693


Put filenames into a character vector and check that they all exist.

In [5]:
filenames <- as.character(sampleTable$BAM)
all(file.exists(filenames))

The BamFileList function prepares the BAM files to be processed. The `yieldSize` argument states how many reads can be processed at once (default 2,000,000). This can be increased to speed alignment time, or decreased to reduce memory load.

In [6]:
(bamfiles <- BamFileList(filenames, yieldSize=2000000))

BamFileList of length 23
names(23): U01_53.bam U01_50.bam U01_56.bam ... U01_52.bam U01_58.bam

# Load GFF File

Set `ORG_ID` using the ID from 0_setup_organism  
`makeTxDbFromGFF` loads the GFF file into a database.  
`exonsBy` extracts the exons from the GFF file.

# Count Reads

`summarizeOverlaps` counts the number of reads that overlap each gene in the GFF file. First, we intialize the multiprocessing, using the `workers` argument to set the number of cores to use. The `summarizeOverlaps` arguments are as follows:
* `features`: The genomic features loaded in the previous code block
* `reads`: The bam files listed above
* `mode`: How to deal with potential overlaps. See [HTSeq-count](http://www-huber.embl.de/HTSeq/doc/count.html) documentation.
* `singleEnd`: TRUE if single-end, FALSE if paired-end
* `ignore.strand`: Whether the strand information is useful for mapping, based on library preparation method
    * TRUE: Standard Illumina
    * FALSE: Directional Illumina (Ligation), Standard SOLiD, dUTP, NSR, NNSR
* `preprocess.reads` (optional): Modify reads before aligning
    * invertStrand: Necessary for dUTP, NSR and NNSR library preparation methods
* `fragments`: Whether to count unpaired reads

In [9]:
for (org_id in unique(sampleTable$organism)){
    info <- suppressWarnings(readLines(file.path("ref",org_id,"info.txt")))
    gff <- info[3]
    txdb <- makeTxDbFromGFF(gff, format="gtf")
    exons <- exonsBy(txdb, by="gene")
    filenames <- as.character(sampleTable[sampleTable$organism == org_id,'BAM'])
    (bamfiles <- BamFileList(filenames, yieldSize=2000000))
    register(MulticoreParam(workers = 8))
    se <- summarizeOverlaps(features=exons, reads=bamfiles,
                        mode="IntersectionStrict",
                        singleEnd=FALSE,
                        ignore.strand=FALSE,
                        preprocess.reads=invertStrand,
                        fragments=FALSE)
    save(se, file=paste('./counts/',org_id,'.se'))
    write.csv(assay(se), file=paste('./counts/',org_id,'_counts.csv', sep=''))
}

Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
