TransAnnot is a GPL-3.0 licensed, C++ implemented modular toolkit. TransAnnot predicts protein functions, orthologous relationships and biological pathways for the whole newly sequenced transcriptome. It uses high-performative MMseqs2 sequence-profile search to obtain closest homologs from profile database and infer protein function, structure and orthologous groups based on the identified homologs. Prior to functional annotation, it can perform transcriptome sequence assembly using PLASS (Protein-Level ASSembler) to de novo assemble raw sequence reads on protein level upon user request.
Compiling from source helps to optimize TransAnnot for the specific system, which improve its performance. For the compilation cmake, g++ and git are required. After the compilation TransAnnot will be located in build/bin directory.
git clone https://github.com/mariia-zelenskaia/transannot.git
cd transannot && mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=RELEASE -DCMAKE_INSTALL_PREFIX=. ..
make -j 4
make install
export PATH=$(pwd)/transannot/bin/:$PATH
❗️ If you compile from source under macOS we recommend to install and use gcc instead of clang as a compiler. gcc can be installed with Homebrew. Force cmake to use gcc as a compiler by running:
CC="$(brew --prefix)/bin/gcc-10"
CCX="$(brew --prefix)/bin/g++-10"
cmake -DCMAKE_BUILD_TYPE=RELEASE -DCMAKE_INSTALL_PREFIX=. ..
Other dependencies for the compilation from source are zlib and bzip.
- PLASS - should be installed separately, see corresponding repository. To perform de novo assembly, it is required to install PLASS to the current working directory.
tmp folder keeps temporary files. By default, all the intermediate output files from different modules will be kept in this folder. To clear tmp pass --remove-tmp-files parameter.
There is a possibility to run TransAnnot using easy module
transannot easytransannot <inputReads.fastq> Pfam-A.full eggNOG UniProtKB/Swiss-Prot <resDB> <tmp> [options]
If (one of the) target databases are already downloaded in MMseqs2 format, just provide pathway to them, otherwise simply use their names, and the databases will be downloaded in easy module.
Possible inputs are:
- assembled transcriptomes (obtained e.g. using Trinity) or raw transcriptome reads, which will be de novo assembled at protein level using
plass - metatranscriptomes
- single-organism transcriptomes
assemblereadsde novo assembles raw sequencing reads to large genomic fragments (contigs).annotateclusters given input for the reduction of redundancy and runs sequnce-profile and sequence-sequence searches to obtain the closest homologs with annotated function. It also retrieves descriptions of orthologous groups and protein families throgh mapping.
createquerydbcreates a database from the sequence space (obtained fromdownloaddbmodule) in a memory-efficient MMSeqs2 format.downloaddbdownloads databases that serve as a search space for homology detectioneasytransannoteasy module for a quick start, performs assembly, downloads DB and executes annotation
Before running this step PLASS must be installed, detailed information about installation can be found here. Please make sure PLASS is located in the current working directory.
In this step, reads will be assembled with Protein-Level ASSembler PLASS and afterwards MMseqs2 database will be created, you may skip this step if the transcriptome is already assembled. Usage:
transannot assemblereads <inputReads.fastq[.gz|bz]> ... <inputReads.fastq[.gz|bz]> <o: fastaFile with assembly> <o: seqDB> <tmp> [options]
In this step, sequence databases for homology searches will be downloaded.
To see detailed information about databases, please use the following command:
mmseqs databases -h
and execute the below command to download the databases (Ensure the same keyword as given in mmseqs database -h):
transannot downloaddb <selection> <outDB> <tmp> [options]
Hence transannot runs 3 searches in annotate module, this step should be repeated 3 times. For the annotation module Pfam-A.full, eggNOG (profile datbases) and UniProtKB/SwissProt (sequence database) are standard, so please download them using this module, for more information also check MMseqs2 user guide.
In the annotate module representative sequences will be extracted and used as search input to remove redundancy. 3 searches (one sequence-sequence and two seqeuce-profile) will be performed.
To run annotate module of transannot execute the following command:
transannot annotate <assembledQueryDB> <path to Pfam profileTargetDB> <path to eggNOG profileTargetDB> <path to SwissProt sequenceTargetDB> <o:resTsvFile> <tmp> [options]
--simple-output parameter allows user to obtain simplified output, which only includes query and target IDs, header of the target database and E-value. Whereas standard output also contains sequence identity and bit score for each target sequence. Usage:
transannot annotate $1 $2 $3 $4 $5 $6 --simple-output
When no tag is used, standard output will be provided.
--min-seq-id is a parameter to adjust minimum sequence identity for the searches. Default value is set to 0.3.
--no-run-clust performs annotation without clustering. All the input sequences will undergo similarity searches.
Outut is a tab-separated .tsv file containing following columns:
queryID targetID description E-value sequenceIdentity bitScore typeOfSearch nameOfDatabase
