Quick Start

Evan Staton edited this page Oct 19, 2017 · 29 revisions

The first step to using Transposome is make sure your data is in the expected format and free of contaminants like excessive rDNA or organellar sequences which can inflate estimates of repeat levels in the genome.

Next, get a copy of the Transposome configuration file in the Transposome/config directory. You can also obtain the configuration file from the command line by typing the following:

curl -sL https://git.io/bPVv > transposome_config.yml

Then, edit the file transposome_config.yml and pass this to the program transposome.

Here is an example file:

blast_input:
  - sequence_file:     sunflower_500k_interleaved.fasta
  - sequence_format:   fasta
  - thread:            12
  - output_directory:  sunflower_500k_transposome_PID90_COV55
clustering_options:
  - in_memory:         1
  - percent_identity:  90
  - fraction_coverage: 0.55
annotation_input:
  - repeat_database:  RepBase1801_sunflower_repeats.fasta
annotation_options:
  - cluster_size:     100
  - blast_evalue:     10
output:
  - run_log_file:       sunflower_500k_transposome_run_log.txt
  - cluster_log_file:   sunflower_500k_transposome_cluster_log.txt

If we save this as transposome_config.yml, then we would run transposome as follows:

transposome --config transposome_config.yml

All of the results will be in the output directory that is specified in the configuration file (note that there is a more detailed description of the configuration file on the specifications page). Depending on how distant the reference set of TEs being used for annotation may be, it is advisable to adjust the blast e-value thresholds appropriately (raise for more distantly related and lower for closely related species).

ON NAMING RESULTS

Try to give the results and output directory descriptive identifiers that can be easily distinguished between runs. In the example above, there is some minimal information used to name the output where the species name, run parameters for calculating pairwise matches, and the number of sequence reads are used to describe the results.

ON CHOOSING THE NUMBER OR READS TO ANALYZE

In the publication, we show that very little is gained from sampling over 1 million reads for most species. It is therefore advised that users start with 100,000 reads (or even less), and monitor the computational resource usage and also the annotation results. Transposome runs pretty fast, so it is easy to add more reads and compare how this influences the annotation results. In many cases, I have observed that analyzing many millions of reads adds no more information than analyzing a few hundred thousand, it just makes the analysis take much longer and uses more resources. So, be very careful choosing the number of reads, you can always add more. It is not a good idea to start with millions of reads without a good reason to do so.

FURTHER READING

See the section on setting the appropriate thread level for how to choose the optimal values for the thread, cpu, and sequence_num entries in the configuration file. Also, be sure to read the section on creating the correct repeat database format for annotation before getting started.

If you are planning to use Transposome for Illumina data (~100bp reads), then the defaults settings are fine. However, if you are going to be using long read data, some special considerations need to be taken. Specifically, see the running Transposome with long read data section for more details.

You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.