A tool for discovering transposable elements and describing patterns of genome evolution
What is Tephra?
Tephra is a command line application to annotate transposable elements from a genome assembly. The goal is to provide a high quality set of de novo annotations for all transposon types, describe the structure and evolution of those sequences, and do it without a reference set of transposon sequences (therefore being unbiased as possible).
Part of the utility of Tephra is to provide family-level TE classifications and infer patterns of molecular evoltion. To be efficient as possible, these tasks require a few external programs. Specifically, you will need to download MUSCLE and add this program to your system PATH. This program is free, but it has a special license so I cannot distribute it. If you are only interested in TE identification and classification, you can skip the installation of this program (it is only used for calculating the insertion age of transposons).
The following commands will install the core dependencies for Debian-based systems (e.g., Ubuntu):
sudo apt-get install -y -qq build-essential zlib1g-dev unzip sudo apt-get install -y -qq libncurses5 libncurses5-dev libdb-dev git cpanminus libexpat1 libexpat1-dev
For RHEL-based systems (e.g., CentOS/Fedora):
sudo yum groupinstall -y "Development Tools" sudo yum install -y perl-App-cpanminus ncurses ncurses-devel libdb-devel expat expat-devel zlib-devel java-1.7.0-openjdk
The next two commands install BioPerl, and these can be skipped if BioPerl is installed:
cpanm Data::Stag DB_File echo "n" | cpanm -n Bio::Root::Version
Finally, download the latest release and run the following commands from the root directory:
cpanm --installdeps . perl Makefile.PL make test make install
Please note, the above instructions will install Tephra for a single user. If you would like to configure Tephra to be installed for all users on a cluster, you will need to set the TEPHRA_DIR environment variable. For example,
export TEPHRA_DIR=/usr/local/tephra perl Makefile.PL make test make install
will configure the software for all users. Please note that if Tephra is configured in a custom location this way it will be necessary to set this variable prior to using Tephra so the configuration can be found. In this case, just export the variable the same way. For a regular user, this can be done with a single line as below (note that this is the same command used to install/configure Tephra):
Now you can type any command to use the usage, for example:
tephra findltrs -h
For developers, please run the tests with:
export TEPHRA_ENV='development' && make test
Please report any test failures or installation issues with the issue tracker.
SUPPORT AND DOCUMENTATION
You can get usage information at the command line with the following command:
tephra program will also print a diagnostic help message when executed with no arguments, and display the available subcommands.
You can also look for information at:
Tephra wiki https://github.com/sestaton/tephra/wiki Tephra issue tracker https://github.com/sestaton/tephra/issues
Tephra is a command-line program only for now. The command
tephra itself controls all the action of the subcommands, which perform specific tasks. Typing the command
tephra will show the available commands. Here is an example,
$ tephra Tephra version 0.12.0 Copyright (C) 2015-2018 S. Evan Staton LICENSE -- MIT Citation: Staton, SE. 2018. https://github.com/sestaton/tephra Name: Tephra - A tool for discovering transposable elements and describing patterns of genome evolution Description: This is an application to find transposable elements based on structural and sequence similarity features, group those elements into recognized (superfamilies) and novel (families) taxonomic groups, and infer patterns of evolution. ------------------------------------------------------------------------------------------- USAGE: tephra <command> [options] Available commands: age: Calculate the age distribution of LTR or TIR transposons. all: Run all subcommands and generate annotations for all transposon types. classifyltrs: Classify LTR retrotransposons into superfamilies and families. classifytirs: Classify TIR transposons into superfamilies. findfragments: Search a masked genome with a repeat database to find fragmented elements. findhelitrons: Find Helitons in a genome assembly. findltrs: Find LTR retrotransposons in a genome assembly. findnonltrs: Find non-LTR retrotransposons in a genome assembly. findtirs: Find TIR transposons in a genome assembly. findtrims: Find TRIM retrotransposons in a genome assembly. illrecomb: Characterize the distribution of illegitimate recombination in a genome. maskref: Mask a reference genome with transposons. reannotate: Transfer annotations from a reference set of repeats to Tephra annotations. sololtr: Find solo-LTRs in a genome assembly. Most common usage: tephra all -c tephra_config.yml That will produce a FASTA and GFF3 of all intact and fragmented transposons in the genome, and generate a table of annotation results. To get the configuration file, run: wget https://raw.githubusercontent.com/sestaton/tephra/master/config/tephra_config.yml To see information about a subcommand, run: tephra <command> --help To get more detailed information, run: tephra <command> --man
Typing a subcommand will show the usage of that command, for example:
$ tephra findnonltrs [ERROR]: Required arguments not given. Name: tephra findnonltrs - Find non-LTRs retrotransposons in a genome assembly. Description: Find non-LTR retrotransposons in a reference genome, classify them into known superfamilies, and generate a GFF file showing their locations and properties. USAGE: tephra findnonltrs [-h] [-m] -m --man : Get the manual entry for a command. -h --help : Print the command usage. Required: -g|genome : The genome sequences in FASTA format to search for non-LTR-RTs. -o|gff : The GFF3 outfile to place the non-LTRs found in <genome>. Options: -r|reference : The non-masked reference genome for base correction. -d|outdir : The location to place the results. -p|pdir : Location of the HMM models (Default: configured automatically). -t|threads : The number of threads to use for BLAST searches (Default: 1). -v|verbose : Display progress for each chromosome (Default: no).
A manuscript is in preparation, which includes a description of the all the methods and their uses, a comparison to other programs, and results from model systems. These will be provided in some form ahead of publication, as soon as they are available.
For now, please cite the github URL of this repo if you use Tephra. Thank you.
Please check the wiki for progress updates.
I welcome any comments, bug reports, feature requests, or contributions to the development of the project. Please submit a new issue (preferred) or send me an email and I would be happy to talk about Tephra or transposons.
LICENSE AND COPYRIGHT
Part of this project uses code from MGEScan-nonLTR, which is released under the GPL license. With permission of the authors, this code is packaged with Tephra. Below is the copyright for MGEScan-nonLTR:
Copyright (C) 2015. See the LICENSE file for license rights and limitations (GPL v3). This program is part of MGEScan. MGEScan is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.
The license for Tephra is below:
Copyright (C) 2015-2018 S. Evan Staton
This program is distributed under the MIT (X11) License, which should be distributed with the package. If not, it can be found here: http://www.opensource.org/licenses/mit-license.php