Stephen Fisher edited this page Sep 15, 2017 · 75 revisions

PennSCAP-T Pipeline

PennSCAP-T Pipeline is designed to automate the processing of next generation sequencing data. The master script is used to run the various 'modules' (aka 'commands') that perform individual tasks. Each module should perform a single task. The module name (uppercase) is the command name (lowercase). So for example the '' file contains the code for the 'INIT' module that will perform the 'init' command. While the module files end in '.sh', then are not meant to be run outside of the script.

Current modules:

  • HELP: expanded help for other modules
  • INIT: prepare read file(s) for processing
  • FASTQC: run FastQC
  • BLAST: run blast on randomly sampled subset of reads
  • BARCODE: select reads with a specific barcode sequence
  • RMDUP: remove duplicates
  • TRIM: trim adapter and poly-A/T contamination (requires Python 2.7 or later)
  • RUM: run RUM on trimmed reads
  • RUMSTATUS: get status of RUM run
  • STAR: run STAR on trimmed reads
  • BOWTIE: run bowtie on trimmed reads
  • SNP: perform SNP calling on bowtie BAM file
  • SPAdes: run SPAdes on trimmed reads
  • HTSEQ: run HTSeq on unique mappers from STAR
  • VERSE: run Verse on unique mappers from STAR
  • POST: clean up trimmed data
  • BLASTDB: create blast database from reads
  • RSYNC: copy data to analyzed directory
  • STATS: print stats from blast, trimming and STAR
  • PIPELINE: run full pipeline
  • VERSION: print version information
  • TEMPLATE: this is an empty module that can be used as an example to build new modules

Using the script, modules can be run on their own, although modules may depend on the output from other modules. For example, the INIT module places the raw read files (uncompressed) in a directory called 'orig'. The TRIM module, by default, looks for the uncompressed read files in the 'orig' directory and places the trimmed read files in a directory called 'trim'. The STAR module uses the reads from the 'trim' directory and hence is expected to run after TRIM.

The PIPELINE module is effectively a meta-module and runs the following modules: init, fastqc, blast, trim, star, post, blastdb, htseq, rsync

Installation and Running

To install the pipeline, just download the files from GitHub and place them in a directory that is in your executable PATH. Make sure and the *.py files are executable. Also be sure you have the required additional programs installed (see Requirements below).

The script is the master script. You use this script to run each of the modules. For example, you can run the HELP module to get expanded help on other modules; documenting the input files, output files, and required programs needed for that module to function. The following command would display documentation on the BLAST module: help blast

You can either run the modules manually or all at once with the PIPELINE module. To start off manually, you would probably want to run the INIT module, copying the raw fastq files out of your data repository and into a subdirectory for processing. The following command would run the INIT module uncompressing (ungzip) the fastq read files from the directory '/lab/repo/E.43/raw/sampleID'. The fastq file(s) containing the first reads is expected to have 'R1' in the file name(s). The second read fastq file(s) must have a 'R2' in the file name(s). The uncompressed files would be put in the subdirectcory './sampleID/orig'. This subdirectory will be created, if necessary. init -i /lab/repo/E.43/raw sampleID  (NOTE: there is a space between 'raw' and 'sampleID')

As with the other modules, the PIPELINE module is also run using the command. The following command would run the PIPELINE module using fastq files from '/lab/repo/E.51/raw/sampleID'. When the pipeline is complete the generated files (ie alignment files, trimmed read files, etc) would be copied to '/lab/repo/E.51/analyzed/sampleID'. pipeline -i /lab/repo/E.51/raw -o /lab/repo/E.51/analyzed -p 8 -s mm10 sampleID

Example Setup and Run

In this example, we assume a data repository located in the directory /NGS_DATA containing three samples named S1, S2, and S3. The unaligned reads files, as received from the sequencing center, reside in a subdirectory called 'raw' within the repository. Note that each sample is placed in a separate directory using the sample name as the directory name. This is essential for the pipeline to find the relevant files.


An 'analyzed' subdirectory should exist within the repository to store the aligned output.


'raw' and 'analyzed' are the default directory locations for the pipeline. You can change these defaults by updating the settings at the top of

Since the pipeline might include repeated reading and writing of large data files, it's best to run the pipeline from a fast local drive (i.e. not an NFS partition). However it's typically not a problem for the data repository to be accessible via NFS since there is minimal writing of data to and from the repository.

  cd /scratch/working_directory

Within the working_directory, create a symbolic link to the raw and analyzed directories.

  ln -s /NGS_DATA/raw raw
  ln -s /NGS_DATA/analyzed analyzed

Now run the pipeline from the working_directory. The INIT module will create a subdirectory in the working_directory called 'S1' and copy the uncompressed reads from /NGS_DATA/raw/S1 to /scratch/working_directory/S1/init. init S1

To run fastqc you can do the following, after the INIT module has completed. This will put the fastqc output into /scratch/working_directory/S1/fastqc fastqc S1

If you now run the rsync module, it will copy everything from /scratch/working_directory/S1/* (excluding the init directory) to /NGS_DATA/analyzed/S1/.



  • Only tested on Linux OS (RHEL 6.x). Will likely work on a Mac. May work on Windows with Cygwin. System specs are dictated by the modules used.

Note that the "-P" flag in grep is used in some cases. This works on RHEL 6.x (ie GNU version of grep). On Mac 10.9 the "-P" flag isn't required (ie BSD version of grep). If a module requires the "-P" flag then use the GREPP variable to run grep as this will include the "-P" flag on Linux machines and exclude the flag on Darwin (ie Mac) machines. The GREPP variable is set in and can be easily adjusted there for other computer systems.

External Programs Required per Module

Note that the Python scripts included here are hardcoded to use /usr/bin/Python. If Python is not installed in /usr/bin, then these files will need to be updated to point to the appropriate Python. Similarly if /usr/bin/Python is older than 2.7, will need to be manually updated to point to a newer version of Python.

Resource Locations

The pipeline requires various genome and transcriptome library files. The location of these library files is currently hardcoded in the script. These library locations may need to be adjusted for your environment. If a module is not used, then that library is not required. For example, if the user only plans to run STAR, skipping Bowtie, RUM, and HTSeq, then only the STAR library files are required.

  • BOWTIE_REPO = /lab/repo/resources/bowtie
    • Location of the Bowtie databases.
  • RUM_REPO = /lab/repo/resources/rum2
    • Location of the RUM (version 2) databases.
  • STAR_REPO = /lab/repo/resources/star
    • Location of the STAR databases.
  • HTSEQ_REPO = /lab/repo/resources/htseq
    • Location of the HTSeq databases.

Additional Documentation

Data Repository

When lots of sequencing runs are being managed and processed, the process of storing the data gets complicated. Data Repository outlines a simple scheme for storing NGS data that works well with this PennSCAP-T Pipeline.

Ancillary Files

The Ancillary directory contains files that are not required by any modules provided herein although thay may be helpful to prepare files for use with the pipeline and/or process files after pipeline processing.

Version Tracking

When a module is run, a file ("sampleID.version") is created in the module subdirectory. The sampleID.version file is a tab-delimited list containing the external programs used by that module and their respective version numbers. This file also includes the location of the species library used by that module, if relevant. See Data Repository for an understanding of the expected directory structure.

Log File

All commands are copied to a sample-module-specific log file ("sampleID/log/date_time.module.log"). The name of the log file includes the date and time the module was run as well as the name of the module that was run. For the PIPELINE module, a single log file is generated.

The commands in the log file are included with time stamps. The following script can be used to strip out the time stamps within the log file and convert the log file into an executable bash script ("") that will reproduce the analysis. Note that the STATS module is not logged so that users running the STATS module do not require write permission to the repository.

cat sampleID/myDate_myTime_myModule.log | awk -F\\t '{print $2}' >
chmod u+x

No Warranty

Unless otherwise noted in the individual applications, the following disclaimer applies to all applications provided herein.

There is no warranty to the extent permitted by applicable law. Except when otherwise stated in writing the copyright holders and/or other parties provide these applications "as is" without warranty of any kind, either expressed or implied, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose. The entire risk as to the quality and performance of these applications, and data is with you. Should these applications or data prove defective, you assume the cost of all necessary servicing, repair or correction.

In no event unless required by applicable law or agreed to in writing will any copyright holder, or any other party who may modify and/or redistribute these applications as permitted above, be liable to you for damages, including any general, special, incidental or consequential damages arising out of the use or inability to use these applications and data (including but not limited to loss of data or data being rendered inaccurate or losses sustained by you or third parties or a failure of these applications and data to operate with any other programs), even if such holder or other party has been advised of the possibility of such damages.


The pipeline was developed by Stephen Fisher and Junhyong Kim at the University of Pennsylvania. Licensing information can be found in each file or here:

You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.