Workflow for creating a gene expression matrix (GEM) using resources available on the Palmetto Cluster at Clemson University
Python Shell Perl
Switch branches/tags
Nothing to show
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
Input
Logs/Progress
Reference
Software
Task-Files
Templates
.gitignore
01-Prepare-inputs.sh
02-Trim-reads.sh
03-Map-reads.sh
04-Count-transcripts.sh
05-GEM-parse.sh
Index-Genome.sh
LICENSE
README.md
SRAList.txt
initiate

README.md

PBS-GEM

This workflow contains PBS job wrappers, pre-configured software packages(all open source), and bash scripts that automate the submission of PBS jobs that perform the following tasks:

  • Download RNA sequencing data in FASTQ format using the SRA Toolkit
  • Trim raw fastq files of poor quality reads and Illumina adapter sequences using Trimmomatic
  • Map cleaned reads to a reference genome using Hisat2
  • Quantify RNA transcript abundances using StringTie
  • Parse FPKM values from StringTie output into a Gene Expression Matrix (GEM)

This workflow utilizes Genome annotation files in GFF3 format to quantify transcript abudances as described in the following Nature Protocol:

Pertea, M., et al. (2016). "Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown." Nat. Protocols 11(9): 1650-1667.

Note that PBS-GEM does not perform transcript assembly, and will only quantify abundances of annotated reference transcripts.

Pre-Workflow User Input

Decompress software

This workflow contains pre-configured software packages. To decompress them for use, execute the initiate script:

$ ./initiate

This will submit a PBS job that decompresses the SRA toolkit, Trimmomatic, Hisat2, and StringTie packages for use

Download and Index Reference Genome

The reference genome must be indexed using Hisat2. Download a reference genome in FASTA (.fa) format, and place this file in the Reference directory of the workflow. To index this reference genome, execute the Index-Genome.sh script and provide a reference prefix as an argument:

    $ ./Index-Genome.sh $REF_PREFIX

For example:

    $ ./Index-Genome.sh chr21-GRCh38

Please note that only one .fa genome file can be present in the Reference directory. Please remove the example file, "chr21-GRCh38.fa", before using your own data.

Download GFF3 Genome Annotation

A GFF3 file that corresponds to the reference genome must be placed in the Reference directory. Please check that only one GFF3 file is present.

Identify SRA sample ID's and modify SRAList.txt file

SRA sample ID's must be specified in the "SRAList.txt" file. Please modify this file to specify the samples that you want to process. Each SRA ID must be present on a new line.

Execute the Workflow

The workflow contains a small reference genome for testing. To run the workflow, execute each step of the pipeline as follows:

Download Input Data

$ ./01-Prepare-inputs.sh

Please note that this script has the "-X 10000" parameter set by default. This will only download the first 10,000 reads from each sample, to enable the user to quickly test the workflow. Please remove "X 10000" from the FASTQ-DUMP.template file in the Templates directory when performing your experiment.

Trim Reads

$ ./02-Trim-reads.sh

Map Reads to Reference Genome

$ ./03-Map-reads.sh chr21-GRCh38

When using your own data, please replace "chr21-GRCh38" with the appropriate reference prefix (same as the $REF_PREFIX that you chose when indexing the reference genome).

Quantify Transcript Abundances

$ ./04-Count-transcripts.sh

Build Gene Expression Matrix (GEM)

$ ./05-GEM-parse.sh

Comments/Notes

With full datasets, each step of this workflow can take several hours. Please be sure that all PBS jobs have finished before moving onto the next step. A "Logs" directory will be created upon initiation of the workflow. Please inspect all log files for errors.