<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Setup" data-toc-modified-id="Setup-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Setup</a></span><ul class="toc-item"><li><span><a href="#Locate-data-files" data-toc-modified-id="Locate-data-files-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Locate data files</a></span></li><li><span><a href="#Enter-files-into-a-Tab-separated-file-with-4-columns:" data-toc-modified-id="Enter-files-into-a-Tab-separated-file-with-4-columns:-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Enter files into a Tab-separated file with 4 columns:</a></span></li></ul></li><li><span><a href="#QC" data-toc-modified-id="QC-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>QC</a></span></li><li><span><a href="#Align-Reads" data-toc-modified-id="Align-Reads-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Align Reads</a></span></li></ul></div>

# Setup

In [1]:
from rnaseq import align_reads
import pandas as pd
import os

## Locate data files

To test out the pipeline on an example dataset, download this file and extract it into the `example` folder: [RNAseq_pipeline_example_data.tar.gz](https://www.dropbox.com/s/xvvy18lz5rergvd/RNAseq_pipeline_example_data.tar.gz?dl=0)

In [2]:
# Set data location
DATA_DIR = './fq'

## Enter files into a Tab-separated file with 4 columns:
1. Unique sample identifier
  * Make these easy to read and understand
  * Biological replicates should end with _1, _2, etc.
1. R1 file location
  * If your fastq files are split across >1 files, separate using semicolons
1. R2 file location
1. Organism ID (from 0_setup_organism)

**NOTE**: In order to reduce file size and computation speed, I have provided truncated versions of the fastq files. The fastq/bam files in the example folder are test files only.

In [3]:
master = pd.read_csv('./master.csv', index_col=0)

In [5]:
count = 0
DF_files = pd.DataFrame(index=range(17), columns=['sample_id', 'R1', 'R2', 'organism'])
DF_files['organism'] = ['USA300_TCH1516'] * len(DF_files)
fastq_files = os.listdir('fq/')
idx = 0
for files in fastq_files:
    if 'R1' in files:
        uid = master[master['fastq-read1'] == files].index[0]
        DF_files.loc[idx, 'sample_id'] = uid
        DF_files.loc[idx, 'R1'] = files
        DF_files.loc[idx, 'R2'] = files.replace('_R1_', '_R2_')
        idx += 1

# QC

**Before alignment, run FastQC on your samples to assess the quality of the raw reads.**

In [7]:
DF_files = pd.read_csv('DF_files.csv', index_col=0)

In [8]:
print 'Number of unique sample IDs: %d'%len(DF_files.sample_id.unique())

Number of unique sample IDs: 23


In [9]:
all_R1 = [r1.split(',') for r1 in DF_files.R1.values]
all_R2 = [r1.split(',') for r1 in DF_files.R2.values]
print 'Number of unique R1 files: %d'%len(DF_files.R1.unique())
print 'Number of unique R2 lists: %d'%len(DF_files.R2.unique())

Number of unique R1 files: 23
Number of unique R2 lists: 23


# Align Reads

The `align_reads` function takes the following required arguments:
* `name`: The unique sample name used to name the output files
* `R1`: Location of the R1 file
* `R2`: Location of the R2 file
* `bt_index`: Location of bowtie index to use for alignment
* `out_dir`: Output directory

Optional arguments:
* `aligner`: 'bowtie' or 'bowtie2' (default 'bowtie')
* `insertsize`: Maximum distance between paired ends (default 1000)
* `cores`: Number of cores to use (default 1)
* `force`: Re-runs alignment even if BAM file already exists
* `verbose`: Update user with current process

`align_reads` performs the following:
1. Unzips .gz files into a temporary folder (if necessary)
2. Uses the bowtie aligner to align reads to a bowtie index:
    * Bowtie: `bowtie -X 1000 -n 2 -p <cores> -3 3 -S -1 <R1_files> -2 <R2_files> <bt_index>`
    * Bowtie2: `bowtie2 -X 1000 -N 1 -p <cores> -3 3 -1 <R1_files> -2 <R2_files> -x <bt_index>`
    * For information about these options, see docs for [bowtie](http://bowtie-bio.sourceforge.net/manual.shtml) and [bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml)
3. Converts the SAM output of bowtie to BAM
    * `samtools view -b <bowtie_out> -@ <cores> -o <unsorted_bam>`
4. Sorts the resulting BAM file
    * `samtools sort <unsorted_bam> -@ <cores> -o <sorted_bam>`
5. Cleans up intermediate files

The final output is the alignment score (%) and the location of the final BAM file

In [10]:
OUT_DIR = './bam'

In [11]:
for i,row in DF_files.iterrows():
    if row.sample_id == 'U01_201' or row.sample_id == 'U01_199':
        continue
    bam,score, bam_stats = align_reads(row.sample_id,row.R1,row.R2,row.organism,
                            DATA_DIR,OUT_DIR,cores=8,verbose=True)
    
    
    DF_files.loc[i,'BAM'] = bam
    DF_files.loc[i,'percent aligned'] = score
    DF_files.loc[i, 'mean phred scores'] = bam_stats[0]
    DF_files.loc[i, 'total reads'] = bam_stats[1]
    DF_files.loc[i, 'mapped reads'] = bam_stats[2]

Processing U01_53
Processing U01_50
Processing U01_56
Processing U01_61
Processing U01_70
Processing U01_59
Processing U01_67
Processing U01_60
Processing U01_55
Processing U01_57
Processing U01_66
Processing U01_62
Processing U01_51
Processing U01_65
Processing U01_63
Processing U01_54
Processing U01_72
Processing U01_64
Processing U01_69
Processing U01_71
Processing U01_68
Processing U01_52
Processing U01_58


In [16]:
DF_files.to_csv('./DF_files.csv')