Script and pipeline for BRAINcode project
Switch branches/tags
Nothing to show
Clone or download
Latest commit d711094 Mar 6, 2018
Permalink
Failed to load latest commit information.
bin update Mar 6, 2018
modules update Mar 6, 2018
src update Mar 6, 2018
BRAINcode.overview.png update Mar 6, 2018
CAGE.pipeline.sh update Jun 26, 2017
README.html update Mar 6, 2018
README.md update readme Oct 18, 2017
RNAseq.Rproj update Mar 6, 2018
RNAseq.pipeline.sh update Feb 26, 2018
TODO update Aug 24, 2016
config.hms.txt update Mar 6, 2018
config.txt update Feb 26, 2018
installation.html update Mar 6, 2018
installation.md update Mar 6, 2018

README.md

RBAINcode overview

File naming convention

To standardlize the process, we strongly suggest to start with universal file naming format. In this pipeline, we will use filename convention as: [PD|AD|HD|HC]_[subjectID]_[cellType]_<b#>_<rep#>.[R1|R2].<fq|fastq>.gz where,

  • Beginning with patient type, PD="Parkinson's disease", AD="Alzhimer's disease", HD="Huntington's disease" and HC="Healthy controls", in 2-letter abbreviation;
  • Following with subjectID (NOT lane# or sequencing date), which is unique per subject;
  • Following with cell type or tissue short name, e.g. SNDA="Substantial nigra dopamine neuron", MCPY="Motor cortex pyramidal neuron" etc.;
  • b# and rep# are optional, in format of (if any);
  • batch number: b1, b2 etc.
  • replication number (capitalized one for biological rep), e.g. rep1 is for technique replicate 1, Rep1 for biological replicate 1
  • Use R1 or R2 to tell the two mates of pair-end sequencing data;
  • Use zipped fastq;

Folder structure

For each project (e.g. PDMap), it should have a folder named with project short name and followed by the type of data, such as PDMap_RNAseq, HDPredict_smallRNA etc.

  • rawfiles
    • raw fastq files from sequencing;
    • use soft links for external location or unformated file name
    • log file (readme.txt or Excel)
  • filtered
    • filtered files (e.g. adaptor removal/clip)
  • run_output
    • sub-folder per sample (e.g. sample1, sample2 etc.)
      • output of Tophat/Cufflinks/htseq-count runs
      • uniq subfolder for the runs for unique reads only
      • status indication file (.status*) for tracking the progress
  • for_display
    • files used for display on UCSC / IGV, such as *.bam, *.bam.bai, *.bw, *.gtf etc.
    • use soft links to the output files
  • results
    • result files for integrative analysis, e.g. differential analysis by combining all samples.

Pipeline requirement

  1. Install programs: tophat, cufflinks, bowtie, bedtools, samtools, fastq-mcf, fastqc, and htseq-count;
  2. Install Jim Kent's executable programms: http://hgdownload.cse.ucsc.edu/admin/exe/;
  3. Install R and bioconductor packages: DESeq2, MatrixEQTL, SPIA etc.
  4. Add path of the executable programs to the $PATH;

Pipeline structure

Each sub-module has its own status hidden file (e.g. .status.$modulename.adaptorremoval). To check if any sample not completed at any step, use script below:

$grep touch ~/pipeline/modules/_RNAseq.sh | grep -v "#" | sed 's/.*modulename\.//g' | while read i; do echo $i `ll ~/neurogen/rnaseq_PD/run_output/*/.status._RNAseq.sh.$i | wc -l`; done

Main script:

RNAseq.pipeline.sh

  • Usage: RNAseq.pipeline.sh /data/neurogen/rnaseq_PD/rawfiles
  • Function: Main script for submitting RNAseq data analysis jobs to high-properformance computing cluster in batch. For now, it's configured to support only HPC environment with LSF job scheduler.
  • Input: absolute path of folder for the raw fastq files
  • Output: Tophat/Cufflinks/htseq-count/callSNP etc. runs for each sample using both multiple and unique mappers

Modules:

_RNAseq.sh

  • Usage: _RNAseq.sh HC_BN10-39_2.R1.fastq.gz HC_BN10-39_2.R2.fastq.gz
  • Function: Routine steps for pair-end RNAseq data
  • Input: a pair of fastq files from PE RNAsequencing
  • Output: BAM/SAM (from alignment), BED/GTF (from assembly) etc.

List of sub-moduels in the core script _RNAseq.sh:

$ grep touch ~/pipeline/modules/_RNAseq.sh | grep -v "#" | sed 's/.*modulename\.//g'
adaptorremoval 
shortReadsExtract 
shortReadsExtract_stemloop
fastqc
mapping
circRNA
sam2bam
bam2stat
sam2bw
rpm_vs_coverage
cufflinks.multi.denovo
cufflinks 
htseqcount
metaexon
uniq
uniq.sam2bw
uniq.bam2stat
cufflinks.denovo
uniq.bam2annotation
uniq.normalize
uniq.cufflinks.rpkm
uniq.cuffquant.rpkm
uniq.htseqcount
uniq.metaexon
makelinks