TACOCAT version 0.0.1 - A pipeline for de novo transcriptome assembly.
TACOCAT - Trim And COnstruct Combined Assembly Transcriptome
TACOCAT was built to streamline the de novo transcriptome assembly process and create a robust reference for downstream analysis. This pipeline starts with raw paired-end Illumina reads in fastq format, quality trims the reads, assembles the reads using multiple assemblers and different k-mer sizes, and outputs a reference classified into primary and alternate sets of transcripts.
TACOCAT runs as a single command that takes as input paired-end Illumina fastq files, two per species, outputs a number of directories containing intermediate files from each step of the pipeline, and provides the final reference transcriptome classified into primary and alternate transcripts. Classification is based on CDS-dna local alignment identity. Perfect fragment CDS are dropped, those with some CDS base differences are kept, with the longest CDS as primary transcript. Alternates are any alternately spliced transcripts.
INPUT - two fastq files. One file for forward reads and one file for reverse reads. If you have multiple forward and reverse read files from replicate conditions or sequencing runs, combine these together before using the pipeline.
If you want to run the pipeline using multiple processors in parallel, modify
config.yaml to specify the number of processors you wish to use.
Installing Programs and Dependencies
TACOCAT pipeline is written to run on linux using the Snakemake workflow management system and requires the following programs and dependencies to be installed and in the system path:
Brief instructions are given below although users may wish to refer to the installation notes provided with these packages for more detailed instructions.
Sourcecode and documentation can be found at https://bitbucket.org/snakemake/snakemake/src.
Binary sourcecode and documentation can be found at https://www.usadellalab.org/cms/?page=trimmomatic. Trimmomatic requires Java to access and run jar files.
Sourcecode and documentation can be found at https://github.com/trinityrnaseq/trinityrnaseq.
Sourcecode and manual can be found at https://www.ebi.ac.uk/~zerbino/velvet/.
Sourcecode and manual can be found at https://www.ebi.ac.uk/~zerbino/oases/.
Sourcecode can be found at https://sourceforge.net/projects/evidentialgene/. Additional documentation about the program can be found at https://arthropods.eugenes.org/EvidentialGene/. TACOCAT uses
trformat.pl (/scripts/rnaseq/) and
tr2aacds.pl (/scripts/prot/) scripts.
Sourcecode and documentation can be found at https://bowtie-bio.sourceforge.net/index.shtml.
Part of the Exonerate package. Sourcecode and documentation can be found at https://www.ebi.ac.uk/about/vertebrate-genomics/software/exonerate.
Sourcecode and documentation can be found at https://weizhongli-lab.org/cd-hit/.
Executables are found here ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST (instructions are currently at http://www.ncbi.nlm.nih.gov/guide/howto/run-blast-local/ and in more detail in the BLAST+ user manual). As websites can change, an alternative is to search online for "install BLAST+".
- Instructions are provided for installing BLAST+ on various flavors of linux on the 'Standalone BLAST setup for Unix' page of the BLAST+ Help manual currently at http://www.ncbi.nlm.nih.gov/books/NBK1762/.
- Follow the instructions under "Configuration" in the BLAST+ help manual to add BLAST+ to the PATH environment variable.
Setting up and running TACOCAT
Once the required programs and dependencies have been installed, TACOCAT can be setup and run on the small example data included in the package as follows:
- Save TACOCAT_v0.0.1.tar.gz
- Open a terminal and cd to the directory where TACOCAT_v.0.0.1.tar.gz was saved
tar xzf TACOCAT_v.0.0.1
The command for running TACOCAT on the example dataset is: