# Adapted from:  http://www.bioconductor.org/help/workflows/rnaseqGene/

#### To learn basics of R: https://www.datacamp.com/courses/free-introduction-to-r

# Table of Contents

1. [Setup](#Setup)
1. [Load BAM files](#Load-BAM-files)
1. [Load GFF file](#Load-GFF-file)
1. [Count Reads](#Count-Reads)

# Setup

If this is your first time running this script, uncomment and run the following code block. This will install all necessary R packages, but it may take 5-10 minutes to complete

In [1]:
# source("https://bioconductor.org/biocLite.R")
# biocLite('Rsamtools')
# biocLite('GenomicFeatures')
# biocLite('GenomicAlignments')
# biocLite('BiocParallel')
# biocLite('SummarizedExperiment')

Now import the libraries. This code block must be run each time.

In [20]:
suppressPackageStartupMessages(library('Rsamtools'))
suppressPackageStartupMessages(library('GenomicFeatures'))
suppressPackageStartupMessages(library('GenomicAlignments'))
suppressPackageStartupMessages(library('BiocParallel'))
suppressPackageStartupMessages(library('SummarizedExperiment'))

# Load BAM files

The `align_reads.ipynb` script stored all our BAM files in `example/aligned_files.csv`.  
**Enter your alignment csv here:**

In [6]:
ALIGNMENT_CSV <- 'example/aligned_files.csv'

In [7]:
sampleTable <- read.csv(ALIGNMENT_CSV,row.names=1)
head(sampleTable)

Unnamed: 0,sample_id,metadata,R1,R2,BAM,alignment
0,wt_fe2_1,/media/nucleoid/raw_data/dhkim/fur/2016-02-14_RNA-seq/2016-02-14_RNA-seq.csv,/media/nucleoid/raw_data/dhkim/fur/2016-02-14_RNA-seq/WT-Fe2-1_S1_L001_R1_001.fastq.gz,/media/nucleoid/raw_data/dhkim/fur/2016-02-14_RNA-seq/WT-Fe2-1_S1_L001_R2_001.fastq.gz,/home/anand/Documents/github/RNAseq_workflow/example/bam_dup/wt_fe2_1.bam,93.71
1,wt_fe2_2,/media/nucleoid/raw_data/dhkim/fur/2016-02-14_RNA-seq/2016-02-14_RNA-seq.csv,/media/nucleoid/raw_data/dhkim/fur/2016-02-14_RNA-seq/WT-FE2-2_S2_L001_R1_001.fastq.gz,/media/nucleoid/raw_data/dhkim/fur/2016-02-14_RNA-seq/WT-FE2-2_S2_L001_R2_001.fastq.gz,/home/anand/Documents/github/RNAseq_workflow/example/bam_dup/wt_fe2_2.bam,92.75
2,wt_dpd_1,/media/nucleoid/raw_data/dhkim/fur/2016-02-13_RNA-seq/2016-02-13_RNA-seq.csv,/media/nucleoid/raw_data/dhkim/fur/2016-02-13_RNA-seq/WTDPD1_S1_L001_R1_001.fastq.gz,/media/nucleoid/raw_data/dhkim/fur/2016-02-13_RNA-seq/WTDPD1_S1_L001_R2_001.fastq.gz,/home/anand/Documents/github/RNAseq_workflow/example/bam_dup/wt_dpd_1.bam,98.39
3,wt_dpd_2,/media/nucleoid/raw_data/dhkim/fur/2016-02-13_RNA-seq/2016-02-13_RNA-seq.csv,/media/nucleoid/raw_data/dhkim/fur/2016-02-13_RNA-seq/WTDPD2_S1_L001_R1_001.fastq.gz,/media/nucleoid/raw_data/dhkim/fur/2016-02-13_RNA-seq/WTDPD2_S1_L001_R2_001.fastq.gz,/home/anand/Documents/github/RNAseq_workflow/example/bam_dup/wt_dpd_2.bam,98.72
4,delfur_fe2_1,/media/nucleoid/raw_data/dhkim/fur/2016-02-11_RNA-seq/2016-02-11_RNA-seq.csv,/media/nucleoid/raw_data/dhkim/fur/2016-02-11_RNA-seq/del-fur-Fe2-1_S1_L001_R1_001.fastq.gz,/media/nucleoid/raw_data/dhkim/fur/2016-02-11_RNA-seq/del-fur-Fe2-1_S1_L001_R2_001.fastq.gz,/home/anand/Documents/github/RNAseq_workflow/example/bam_dup/delfur_fe2_1.bam,93.23
5,delfur_fe2_2,/media/nucleoid/raw_data/dhkim/fur/2016-02-11_RNA-seq/2016-02-11_RNA-seq.csv,/media/nucleoid/raw_data/dhkim/fur/2016-02-11_RNA-seq/del-fur-Fe2-2_S2_L001_R1_001.fastq.gz,/media/nucleoid/raw_data/dhkim/fur/2016-02-11_RNA-seq/del-fur-Fe2-2_S2_L001_R2_001.fastq.gz,/home/anand/Documents/github/RNAseq_workflow/example/bam_dup/delfur_fe2_2.bam,93.62


Put filenames into a character vector and check that they all exist.

In [8]:
filenames <- as.character(sampleTable$BAM)
all(file.exists(filenames))

The BamFileList function prepares the BAM files to be processed. The `yieldSize` argument states how many reads can be processed at once (default 2,000,000). This can be increased to speed alignment time, or decreased to reduce memory load.

In [9]:
(bamfiles <- BamFileList(filenames, yieldSize=2000000))

BamFileList of length 8
names(8): wt_fe2_1.bam wt_fe2_2.bam ... delfur_dpd_1.bam delfur_dpd_2.bam

# Load GFF File

`makeTxDbFromGFF` loads the GFF file into a database.  
`exonsBy` extracts the exons from the GFF file.

In [10]:
GFF_FILE <- "./example/ref/NC_000913.3.gff"
txdb <- makeTxDbFromGFF(GFF_FILE, format="gtf")
exons <- exonsBy(txdb, by="gene")

Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
“Named parameters not used in query: gene_id, internal_tx_id”OK


# Count Reads

`summarizeOverlaps` counts the number of reads that overlap each gene in the GFF file. First, we intialize the multiprocessing, using the `workers` argument to set the number of cores to use. The `summarizeOverlaps` arguments are as follows:
* `features`: The genomic features loaded in the previous code block
* `reads`: The bam files listed above
* `mode`: How to deal with potential overlaps. See [HTSeq-count](http://www-huber.embl.de/HTSeq/doc/count.html) documentation.
* `singleEnd`: TRUE if single-end, FALSE if paired-end
* `ignore.strand`: Whether the strand information is useful for mapping, based on library preparation method
    * TRUE: Standard Illumina
    * FALSE: Directional Illumina (Ligation), Standard SOLiD, dUTP, NSR, NNSR
* `preprocess.reads` (optional): Modify reads before aligning
    * invertStrand: Necessary for dUTP, NSR and NNSR library preparation methods
* `fragments`: Whether to count unpaired reads

In [11]:
register(MulticoreParam(workers = 6))
se <- summarizeOverlaps(features=exons, reads=bamfiles,
                        mode="Union",
                        singleEnd=FALSE,
                        ignore.strand=FALSE,
                        preprocess.reads=invertStrand,
                        fragments=FALSE)

The final counts are stored in the [SummarizedExperiment](https://www.bioconductor.org/help/workflows/rnaseqGene/#summarizedexperiment) object. 

In [12]:
se

class: RangedSummarizedExperiment 
dim: 4319 8 
metadata(0):
assays(1): counts
rownames(4319): b0001 b0002 ... b4706 b4708
rowData names(0):
colnames(8): wt_fe2_1.bam wt_fe2_2.bam ... delfur_dpd_1.bam
  delfur_dpd_2.bam
colData names(0):

Let's add the metadata information into colData, and set the colnames

In [13]:
metadata <- read.csv('example/metadata.csv')
colData(se) <- DataFrame(metadata)
colnames(se) <- colData(se)$sample_id

To view raw counts, use `assay(se)`

In [17]:
head(assay(se))

Unnamed: 0,wt_fe2_1,wt_fe2_2,wt_dpd_1,wt_dpd_2,delfur_fe2_1,delfur_fe2_2,delfur_dpd_1,delfur_dpd_2
b0001,2303,2354,311,284,3022,3486,245,232
b0002,37014,25850,58245,61055,40634,38562,17943,23165
b0003,10941,8272,16878,21079,11610,10854,4817,7890
b0004,11987,8493,12182,14567,10102,9566,3738,5112
b0005,473,348,216,178,363,297,104,121
b0006,669,590,703,633,682,653,662,767


Finally, save the summarizedExperiment object as a checkpoint

In [80]:
save(se,file='example/se.rda')