<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Adapted-from:--http://www.bioconductor.org/help/workflows/rnaseqGene/" data-toc-modified-id="Adapted-from:--http://www.bioconductor.org/help/workflows/rnaseqGene/-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Adapted from:  <a href="http://www.bioconductor.org/help/workflows/rnaseqGene/" target="_blank">http://www.bioconductor.org/help/workflows/rnaseqGene/</a></a></span></li><li><span><a href="#Table-of-Contents" data-toc-modified-id="Table-of-Contents-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Table of Contents</a></span></li><li><span><a href="#Import-Libraries" data-toc-modified-id="Import-Libraries-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Import Libraries</a></span></li><li><span><a href="#Load-BAM-files" data-toc-modified-id="Load-BAM-files-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Load BAM files</a></span></li><li><span><a href="#Load-GFF-File" data-toc-modified-id="Load-GFF-File-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Load GFF File</a></span></li><li><span><a href="#Count-Reads" data-toc-modified-id="Count-Reads-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Count Reads</a></span></li></ul></div>

# Adapted from:  http://www.bioconductor.org/help/workflows/rnaseqGene/

To learn basics of R: https://www.datacamp.com/courses/free-introduction-to-r

# Table of Contents

1. [Setup](#Setup)
1. [Load BAM files](#Load-BAM-files)
1. [Load GFF file](#Load-GFF-file)
1. [Count Reads](#Count-Reads)

# Import Libraries

In [1]:
suppressPackageStartupMessages(library('Rsamtools'))
suppressPackageStartupMessages(library('GenomicFeatures'))
suppressPackageStartupMessages(library('GenomicAlignments'))
suppressPackageStartupMessages(library('BiocParallel'))
suppressPackageStartupMessages(library('SummarizedExperiment'))

# Load BAM files

The `align_reads.ipynb` script stored all our BAM files in `example/aligned_files.csv`.  
**Enter your alignment csv here:**

In [2]:
ALIGNMENT_CSV <- 'example/aligned_files.csv'

In [3]:
sampleTable <- read.csv(ALIGNMENT_CSV,row.names=1)
head(sampleTable)

Unnamed: 0,sample_id,R1,R2,organism,BAM,alignment
0,wt_fe2_1,WT-Fe2-1_S1_L001_R1_001.fastq.trunc.gz,WT-Fe2-1_S1_L001_R2_001.fastq.trunc.gz,MG1655,example/bam/wt_fe2_1.bam,93.43
1,wt_fe2_2,WT-FE2-2_S2_L001_R1_001.fastq.trunc.gz,WT-FE2-2_S2_L001_R2_001.fastq.trunc.gz,MG1655,example/bam/wt_fe2_2.bam,92.38
2,wt_dpd_1,WTDPD1_S1_L001_R1_001.fastq.trunc.gz,WTDPD1_S1_L001_R2_001.fastq.trunc.gz,MG1655,example/bam/wt_dpd_1.bam,98.11
3,wt_dpd_2,WTDPD2_S1_L001_R1_001.fastq.trunc.gz,WTDPD2_S1_L001_R2_001.fastq.trunc.gz,MG1655,example/bam/wt_dpd_2.bam,98.31
4,delfur_fe2_1,del-fur-Fe2-1_S1_L001_R1_001.fastq.trunc.gz,del-fur-Fe2-1_S1_L001_R2_001.fastq.trunc.gz,MG1655,example/bam/delfur_fe2_1.bam,92.82
5,delfur_fe2_2,del-fur-Fe2-2_S2_L001_R1_001.fastq.trunc.gz,del-fur-Fe2-2_S2_L001_R2_001.fastq.trunc.gz,MG1655,example/bam/delfur_fe2_2.bam,93.3


Put filenames into a character vector and check that they all exist.

In [4]:
filenames <- as.character(sampleTable$BAM)
all(file.exists(filenames))

The BamFileList function prepares the BAM files to be processed. The `yieldSize` argument states how many reads can be processed at once (default 2,000,000). This can be increased to speed alignment time, or decreased to reduce memory load.

In [5]:
(bamfiles <- BamFileList(filenames, yieldSize=2000000))

BamFileList of length 8
names(8): wt_fe2_1.bam wt_fe2_2.bam ... delfur_dpd_1.bam delfur_dpd_2.bam

# Load GFF File

Set `ORG_ID` using the ID from 0_setup_organism  
`makeTxDbFromGFF` loads the GFF file into a database.  
`exonsBy` extracts the exons from the GFF file.

In [6]:
ORG_ID <- "MG1655"

In [8]:
info <- suppressWarnings(readLines(file.path("ref",ORG_ID,"info.txt")))
gff <- info[3]
txdb <- makeTxDbFromGFF(gff, format="gtf")
exons <- exonsBy(txdb, by="gene")

Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
“Named parameters not used in query: gene_id, internal_tx_id”OK


# Count Reads

`summarizeOverlaps` counts the number of reads that overlap each gene in the GFF file. First, we intialize the multiprocessing, using the `workers` argument to set the number of cores to use. The `summarizeOverlaps` arguments are as follows:
* `features`: The genomic features loaded in the previous code block
* `reads`: The bam files listed above
* `mode`: How to deal with potential overlaps. See [HTSeq-count](http://www-huber.embl.de/HTSeq/doc/count.html) documentation.
* `singleEnd`: TRUE if single-end, FALSE if paired-end
* `ignore.strand`: Whether the strand information is useful for mapping, based on library preparation method
    * TRUE: Standard Illumina
    * FALSE: Directional Illumina (Ligation), Standard SOLiD, dUTP, NSR, NNSR
* `preprocess.reads` (optional): Modify reads before aligning
    * invertStrand: Necessary for dUTP, NSR and NNSR library preparation methods
* `fragments`: Whether to count unpaired reads

In [10]:
register(MulticoreParam(workers = 8))
se <- summarizeOverlaps(features=exons, reads=bamfiles,
                        mode="IntersectionStrict",
                        singleEnd=FALSE,
                        ignore.strand=FALSE,
                        preprocess.reads=invertStrand,
                        fragments=FALSE)

The final counts are stored in the [SummarizedExperiment](https://www.bioconductor.org/help/workflows/rnaseqGene/#summarizedexperiment) object. 

In [11]:
se

class: RangedSummarizedExperiment 
dim: 4386 8 
metadata(0):
assays(1): counts
rownames(4386): b0001 b0002 ... b4706 b4708
rowData names(0):
colnames(8): wt_fe2_1.bam wt_fe2_2.bam ... delfur_dpd_1.bam
  delfur_dpd_2.bam
colData names(0):

Let's add the metadata information into colData, and set the colnames

In [12]:
metadata <- sampleTable
colData(se) <- DataFrame(metadata)
colnames(se) <- colData(se)$sample_id

To view raw counts, use `assay(se)`

In [13]:
head(assay(se))

Unnamed: 0,wt_fe2_1,wt_fe2_2,wt_dpd_1,wt_dpd_2,delfur_fe2_1,delfur_fe2_2,delfur_dpd_1,delfur_dpd_2
b0001,218,294,29,29,316,367,27,21
b0002,3382,2782,5099,5787,3607,3763,1579,1876
b0003,1027,901,1411,1945,1061,1028,450,642
b0004,1062,977,1038,1366,929,920,326,420
b0005,45,34,18,24,30,25,11,10
b0006,58,53,63,78,64,52,57,63


Finally, save the summarizedExperiment object as a checkpoint

In [15]:
save(se,file='example/se.rda')