Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
The first step to using Transposome is make sure your data is in the expected format and free of contaminants like excessive rDNA or organellar sequences which can inflate estimates of repeat levels in the genome.
Next, get a copy of the Transposome configuration file in the
Transposome/config directory. You can also obtain the configuration file from the command line by typing the following:
curl -sL https://git.io/bPVv > transposome_config.yml
Then, edit the file transposome_config.yml and pass this to the program
Here is an example file:
blast_input: - sequence_file: sunflower_500k_interleaved.fasta - sequence_format: fasta - thread: 12 - output_directory: sunflower_500k_transposome_PID90_COV55 clustering_options: - in_memory: 1 - percent_identity: 90 - fraction_coverage: 0.55 annotation_input: - repeat_database: RepBase1801_sunflower_repeats.fasta annotation_options: - cluster_size: 100 - blast_evalue: 10 output: - run_log_file: sunflower_500k_transposome_run_log.txt - cluster_log_file: sunflower_500k_transposome_cluster_log.txt
If we save this as transposome_config.yml, then we would run
transposome as follows:
transposome --config transposome_config.yml
All of the results will be in the output directory that is specified in the configuration file (note that there is a more detailed description of the configuration file on the specifications page). Depending on how distant the reference set of TEs being used for annotation may be, it is advisable to adjust the blast e-value thresholds appropriately (raise for more distantly related and lower for closely related species).
ON NAMING RESULTS
Try to give the results and output directory descriptive identifiers that can be easily distinguished between runs. In the example above, there is some minimal information used to name the output where the species name, run parameters for calculating pairwise matches, and the number of sequence reads are used to describe the results.
ON CHOOSING THE NUMBER OR READS TO ANALYZE
In the publication, we show that very little is gained from sampling over 1 million reads for most species. It is therefore advised that users start with 100,000 reads (or even less), and monitor the computational resource usage and also the annotation results. Transposome runs pretty fast, so it is easy to add more reads and compare how this influences the annotation results. In many cases, I have observed that analyzing many millions of reads adds no more information than analyzing a few hundred thousand, it just makes the analysis take much longer and uses more resources. So, be very careful choosing the number of reads, you can always add more. It is not a good idea to start with millions of reads without a good reason to do so.
See the section on setting the appropriate thread level for how to choose the optimal values for the
sequence_num entries in the configuration file. Also, be sure to read the section on creating the correct repeat database format for annotation before getting started.
If you are planning to use Transposome for Illumina data (~100bp reads), then the defaults settings are fine. However, if you are going to be using long read data, some special considerations need to be taken. Specifically, see the running Transposome with long read data section for more details.