RetroSeq Tutorial

RetroSeq is a tool for discovery and genotyping of transposable element variants (TEVs) (also known as mobile element insertions) from next-gen sequencing reads aligned to a reference genome in BAM format. The goal is to call TEVs that are not present in the reference genome but present in the sample that has been sequenced. It should be noted that RetroSeq can be used to locate any class of viral insertion in any species where whole-genome sequencing data with a suitable reference genome is available.

If you want to find if a transposable element that is present in the reference genome is not found in your sample (a deletion in structural variation terminology), then you should use a structural variation deletion caller such as Breakdancer.

RetroSeq is a two phase process, the first being the read pair discovery phase where discorandant mate pairs are detected and assigned to a TE class (Alu, SINE, LINE, etc.) by using either the annotated TE elements in the reference and/or aligned with Exonerate to the supplied library of viral sequences.

Software Requirements

RetroSeq requires that samtools, windowBED, and optionally Exonerate are in the path. It will check for these and verify the version is sufficiently recent.

NOTE: RetroSeq has been primarily tested on BAM files produced by MAQ and BWA. There is no guarantee it will work on BAM files from other aligners.

1 Discovery Phase

The goal here is to pass through the BAM and identify discordant read pairs that might support a TE insertion. You can either supply a tab delimited file specifying a set of TE types (e.g. Alu, LINE etc.) and the corresponding BED file of locations where these are in the reference genome (-refTEs parameter). Alternatively, you can provide a tab delimited file specifying a set of viral/TE types and the corresponding fasta file with a set of consensus sequences for these (-eref parameter).

RetroSeq: A tool for discovery and genotyping of transposable elements from short read alignments

Version: 1.2 Author: Thomas Keane (thomas.keane@sanger.ac.uk)

Usage: retroseq.pl -discover -bam -eref -output [-srmode] [-q ] [-id ] [-len -noclean]

-bam        BAM file of paired reads mapped to reference genome
-output     Output file to store candidate supporting reads (required for calling step)
[-eref      Tab file with list of transposon types and the corresponding fasta file of reference sequences (e.g. SINE   /home/me/refs/SINE.fasta). Required when the -align option is used.]
[-refTEs    Tab file with TE type and BED file of reference elements. These will be used to quickly assign discordant reads the TE types and avoid alignment. Using this will speed up discovery dramatically.]
[-srmode   Search for split reads in the BAM file]
[-minclip]  Minimum length of soft clippped portion of read to be considered for split-read analysis. Default is 30bp.
[-noclean   Do not remove intermediate output files. Default is to cleanup.]
[-q         Minimum mapping quality for a read mate that anchors the insertion call. Default is 30. Parameter is optional.]
[-id        Minimum percent ID for a match of a read to the transposon references. Default is 90.]
[-len       Minimum length of a hit to the transposon references. Default is 36bp.]
[-rgs       Comma separated list of readgroups to operate on. Default is all.]
[-exd       Fofn of BED files of regions where discordant mates falling into will be excluded e.g. simple repeats, centromeres, telomeres]
[-align     Do the computational expensive exonerate PE discordant mate alignment step]

The discovery stage will produce an output file (specified by the -output parameter) with lists of read pair names per TE type. Multiple discovery stage outputs can be inputted to the calling phase (i.e. to allow parallelisation of the discovery phase).

2 Calling

The calling phase takes one or more outputs from the discovery phase, clusters the reads, and carries out various checks on the breakpoints to decide if a TEV is present. You can provide a list of locations to ignore per TE type - this would typically be the list of locations of the reference elements of that type (-filter option).

RetroSeq: A tool for discovery and genotyping of transposable elements from short read alignments

Version: 1.2 Author: Thomas Keane (thomas.keane@sanger.ac.uk)

Usage: retroseq.pl -call -bam -input -ref -output [-srinput -filter -cleanup -reads -depth ]

-bam            BAM file OR BAM fofn
-input          Either a single output file from the PE discover stage OR a prefix of a set of files from discovery to be combined for calling OR a fofn of discovery stage output files
-ref            Fasta of reference genome
-output         Output file name (VCF)
[-srinput       Either a single output from split-read discovery stage OR a prefix of a set of files to be combined for calling (will trigger split-read calling to run)]
[-orientate     Attempt to predicate the orientation of the calls. Default is no.]
[-filter        Tab file with TE type and BED file of reference elements. These will be filtered out from the calling.]
[-region        Call a particular chromosome only (chr) OR region (chr:start-end) only]
[-depth         Max average depth of a region to be considered for calling. Default is 200.]
[-reads         It is the minimum number of reads required to make a call. Default is 5. Parameter is optional.]
[-q             Minimum mapping quality for a read mate that anchors the insertion call. Default is 30. Parameter is optional.]
[-ignoreRGs     Single read group name OR a file of readgroups that should be ignored. Default is none.]
[-novel]    Call novel sequence insertions also (from non-TE mate pairs)
[-noclean       Do not remove intermediate output files. Default is to cleanup.]

Interpreting the output

The final TE calls from RetroSeq are in VCF format (http://vcftools.sourceforge.net/). The calls are annotated with information on number of supporting reads (GQ tag). The FL tag ranges from 1-8 and gives information on the breakpoint with 8 being the most confident calls and lower values indicating calls that don’t meet the breakpoint criteria for reasons such as lack of 5’ or 3’ reads. The example in Fig. 1 shows two human Alu calls, one with a GQ of 15 and FL of 5 and the other has a GQ of 37 and FL of 8 (highly confident).

##source=RetroSeq v1.2
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=MEINFO,Number=4,Type=String,Description="Mobile element info of the form NAME,START,END,POLARITY">
##ALT=<ID=INS:ME,Type=String,Description="Insertion of a mobile element">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype quality">
##FORMAT=<ID=FL,Number=1,Type=Integer,Description="Call Status - for reference calls a flag to say if the call failed a particular filter. Filters are ordered by priority in calling (higher number indicates closer to being called). 1 - depth too high in region, 2 - not enough reads in cluster, 3 - not enough total flanking reads, 4 - not enough inconsistently mapped reads, 5 - neither side passes ratio test, 6 - one side passes ratio test, 7 - distance too large at breakpoint, 8 - PASSED all filters">
##INFO=<ID=NOT_VALIDATED,Number=0,Type=Flag,Description="Not validated experimentally">
##INFO=<ID=1000G,Number=0,Type=Flag,Description="Overlaps with 1000G MEI call">
##INFO=<ID=REPEATMASKER,Number=0,Type=Flag,Description="Overlaps with a reference ME element">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NA12891
10      135035407       .       G       <INS:ME>        15      .       MEINFO=Alu,135035407,135035408,NA;NOT_VALIDATED;SVTYPE=INS      GT:GQ:FL        1/1:15:5
11      98774710        .       A       <INS:ME>        37      .       MEINFO=Alu,98774710,98774711,NA;NOT_VALIDATED;SVTYPE=INS        GT:GQ:FL        1/1:37:8

Example usage on human data

To run RetroSeq on a human sample you will need:

BAM file of you reads mapped to the human reference
BED files of the locations in the reference of the TE types you want to call (1)
Optional file of BED file names with regions to ignore (e.g. telomeres, centromeres, reference gaps etc) (2)
Reference fasta file

(1) The BED files should be listed together in a single tab delimited file giving the TE type and BED location e.g. types.tab

Alu    /home/me/data/Alu.bed
L1    /home/me/data/L1.bed

(2) A file of file names where each file is a BED file with regions to ignore e.g. exclude_regions.fofn

hg19.fa.satellites.bed
hg19.fa.simple_repeat.bed
hg19.fa.centromeres.bed

Command 1 - read pair discovery

perl retroseq.pl -discover -bam mysample.bam -output mysample.candidates.tab -refTEs types.tab -exd exclude_regions.fofn

Command 2 - TE calling

perl retroseq.pl -call -bam mysample.bam -input mysample.candidates.tab -ref hg19.fa -output mysample.out -filter types.tab -reads 10 -depth 400

This will produce a VCF file calls mysample.out.PE.vcf with the final calls. Remember to filter the final VCF file for proximity to reference copies of Alu and L1. It is suggested to filter out any calls less than the average fragment size of the sequencing libraries.

Provide feedback

Saved searches