GitHub - srobb1/RelocaTE: Find the locations of TEs using the TSD in unassembled short reads by comparing to a closely related reference genome assembly

RelocaTE-1-0-5

RelocaTE: is a collection of scripts in which short reads (paired or unpaired), a fasta containing the sequences of transposable elements and a reference genome sequence are the input and the output is a series of files containing the locations (relative to the reference genome) of TE insertions in the reference and short reads

non-reference transposable element insertion events that are present in DNA short read data but absent in the reference genome sequence.
reference transposable element insertions that are present in the reference
shared transposable element insertions that are present in the reference and the reads
reference-only transposable element insertions that are present in the reference and no evidence of the insertion in the reads. This could be due to a lack of data. Future releases of RelocaTE will report evidence based reference-only insertions

CharacTErizer is a companion tool that compares the numbers of reads that flank the TE sequence and contain genomic sequence to the number of reads that span a predicted insertion site with no gaps. These spanners contain no TE sequence. The ratio of spanners to flankers is used to classify the insertion as homozygous, heterozygous, or new (somatic). Somatic excision events can also be predicted.

** Updates **

- 1-0-5: 1. fixed bug: incorrect coordinate positions when using TSD=UNK0 2. added -rr relaxed reference align:allow variable sized fulllen matches of TEs (but must be less than 15kb). 3. allow more relaxed alignment of flankers to reference. Reads that do not map uniquely are now used. Finds more inserts but many may be false positives [option -ra 1] - when -ra 1: "bowtie --sam --sam-nohead --sam-nosq -a -m 1 -v 3" is change to "bowtie --sam --sam-nohead --best -a -v 1" - 1-0-4: 1. fixed bug: missing existingTE.blatout file. 2. fixed bug: convert fq to fa even when only one file needs to be converted 3. fixed bug: qsub -W with multiple dependencies now is functioning 4. TSD=UNK, TSD=UKN, and TSD=UNKNOWN will all function 5. added count of seqs that match to unique regions to output of construcTEr.pl - 1-0-3: now able to use TSD=UNK in TE fasta. RelocaTE is faster if TSDs are known. - 1-0-2: Using "bowtie -a -m 1 -v 3". This decreases the number of false positive insertions identified.

###Table of Contents:
#####Prerequisites
#####RelocaTE Command Line Options #####RelocaTE: Quick Start Guide
#####RelocaTE Input Files #####RelocaTE Output Files #####RelocaTE: TIPs
#####CharacTErizer #####Report an Issue

###Prerequisites: These programs need to be installed and the executable should be in your path.

Blat
Bowtie 1
BioPerl
SAMtools
BWA Recommeded for the creation of the BAM file needed by CharacTErizer
Blast (Legacy) formatdb and fastacmd are used for indexed sequence retrieval in an additional companion tool, ConstrucTEr, more info coming soon.

###RelocaTE Command Line Options:

-t  | --te_fasta       Str: TE FASTA File
-d  | --fq_dir         Str: directory of fq files
-g  | --genome_fasta   Str: reference genome fasta file
-e  | --exper          Str: Sample identifier
-o  | --outdir         Str: output directory name
-1  | --mate_1_id      Str: unique mate/pair 1 string
-2  | --mate_2_id      Str: unique mate/pair 2 string
-u  | --unpaired_id    Str: unique unPaired string
-p  | --parallel       0|1: split into many jobs
-a  | --qsub_array     0|1: create PBS array job script
-w  | --workingdir     Str: working directory name
-l  | --len_cutoff       n: min length cutoff for alignment to reference
-bm | --blat_minScore    n: blat minScore for alignment to TE
-bt | --blat_tileSize    n: blat tileSize for alignmetn to TE
-m  | --mismatch         n: fraction (0 or less ex. 0.1) mismatches allowed in blat alignment to TE
-f  | --flanking_seq_len n: length of the insertion site flanking seq to be returned
-r  | --reference_ins  Str: To identify reference and shared insertions (reference and reads) 
                            choose option-1 or option-2. option-1) use '-r 1' to have RelocaTE 
                            find the location of your TE in the reference. option-2) input the 
                            file name of a tab-delimited file containing the coordinates of TE 
                            insertions pre-existing in the reference sequence. [no default]

####-t | --te_fasta Str: TE FASTA File

Required. No default value.

The file name of the fasta file containing the nucleotide sequence of one or many transposable elements. The sequence should include the complete terminal inverted repeats (TIRs) [or LTR] but not include the target site in the sequence proper. The target site should be provided in the description portion of the fasta file in the following format, TSD=xyz. The TSD will be searched for in the both the forward and reverse strand. [[after testing the use of no TSD, write weather TSD= can be used]]

SAMPLE TE FASTA:

>mping TSD=TTA
GGCCAGTCACAATGGGGGTTTCACTGGTGTGTCATGCACATTTAATAGGGGTAAGACTGAATAAAAA
ATGATTATTTGCATGAAATGGGGATGAGAGAGAAGGAAAGAGTTTCATCCTGGTGAAACTCGTCAGC
GTCGTTTCCAAGTCCTCGGTAACAGAGTGAAACCCCCGTTGAGGCCGATTCGTTTCATTCACCGGAT
CTCTTGCGTCCGCCTCCGCCGTGCGACCTCCGCATTCTCCCGCGCCGCGCCGGATTTTGGGTACAAA
TGATCCCAGCAACTTGTATCAATTAAATGCTTTGCTTAGTCTTGGAAACGTCAAAGTGAAACCCCTC
CACTGTGGGGATTGTTTCATAAAAGATTTCATTTGAGAGAAGATGGTATAATATTTTGGGTAGCCGT
GCAATGACACTAGCCATTGTGACTGGCC

The fasta must contain TSD=
This can be a Perl regular expression.

Example: these exact characters TTA: TSD=TTA
Example: any 4 characters: TSD=....
Example: A or T followed by GCC: TSD=(A|T)GCC
Example: CGA followed by any character then an A then CT or G: TSD=CGA.A(CT|G)

####-d | --fq_dir Str: directory of fq files

Required. No default value.

The name of the directory of paired and unpaired fastq files (paired _p1.fq & _p2.fq). Both the .fq and .fastq extensions are accepted as extensions of fastq files. If something different is used RelocaTE will not recognize those files as being fastq files.

####-g | --genome_fasta Str: reference genome fasta file

Optional, Recommended. No default value.

The file name of the fasta file containing the genome sequence of the reference. If it is provided, the locations of the insertion events will be reported relative to the reference genome coordinates.

If the genome sequence is not provided a series of files will be generated. One set will contain the intact reads that align to the TE. The second and third set of files will be made up of trimmed reads. The second set will be only the trimmed portion of the reads in the first set that align to the TE. The third set will contain the trimmed portion of the reads that do not align to the TE, therefore the portion of the reads that should align to the genome sequence not containing a TE insertion.

####-e | --exper Str: Sample identifier

Optional, Recommended. The default value is not.given

A short string for sample name. This string will be used in the output files to create IDs for the insert (ex. A123)

####-o | --outdir Str: output directory name

Optional, Recommended. The default value is outdir_teSearch

A short string for the output directory name. This string will be used to create a directory to contain the output files and directories in the current working directory. The complete path is not required, only the desired name for the directory.

####-1 | --mate_1_id Str: unique mate/pair 1 string

Optional, Recommended. The default value is _p1

A string to identify mate 1 paired files. Should contain the unique text and any text between the unique text and the fq extension. This string will be used in a regular expression to identify the files as a mate 1 file, so the string should not be found in the mate 2 file or the unpaired files

Example:

If the files are named as such: file_1.fq
The string would be:            _1

File:                           file_1.noNumbers.fq
String:                         _1.noNumbers

File:                           file_1_1.fq (and mate = file_1_2.fq)
String:                         _1

Issue with last scenario:       _1 will recognize both mates.
Suggestion:                     rename files to file_1_p1.fq and file_1_p2.fq. 
                                Now the string _p1 can be used to uniquely 
                                identify all _p1 files and no _p2 files.

####-2 | --mate_2_id Str: unique mate/pair 2 string

Optional, Recommended. The default value is _p2

See -1 for a more in depth explanation.

Example:

File:                            file_p2.fq
String:                          _p2

####-u | --unpaired_id Str: unique unPaired string

Optional, Recommended. The default value is .unPaired.

See -1 for a more in depth explanation.

Example:

File:                           file.unParied.fq
String:                         .unParied

####-p | --parallel 0|1: split into many jobs

Optional. Default value is 1.

n is 0 or 1.

0: means only one large job will be run.
1: many shell scripts will be generated for the user to manually run

Break down the single big job of relocaTE into as many smaller jobs as possible. If selected this option will cause the creation of shell scripts which can be manually ran or submitted to a queue. This enables the jobs to be run in parallel. The folders of shell scripts should be run as ordered. Step_1 needs to run and be complete before Step_2 jobs can be proper started. If the genome fasta had already been split and indexed this job will be skipped.

####-a | --qsub_array 0|1: create PBS array job script

Optional. Default value is 1.

n is 0 or 1.

0: Nothing will be done
1: if -p 1 then array jobs will be created

If -p 1 then -a can be 1, otherwise it will be 0. If 1, in addition to the shell scripts generated from -p 1, qsub PBS array job scripts will be made for easier submittion of the shell scripts to the queue. See man qsub option -t for more information.

Submit each array job one at a time, waiting for the previous job to be completed before submitting the next.

See run_these_jobs.sh for the array jobs.

####-w | --workingdir Str: working directory name

Optional. Default value is the current working directory.

If a directory different form the cwd is given it needs to exist, will not create. Provide the full path.

####-l | --len_cutoff n: min length cutoff for alignment to reference

Optional. Default value is 10.

n is a value for the length cutoff. This is the minimum length that a read needs to be after the removal of TE sequence. This trimmed read will be aligned to the genome. When selecting a custom value consider these points:

value can not be greater than the read length
How many bps are needed to limit false alignments to the reference?
How many bps are needed to recognize the TE?
The answer to the above two questions should not total more than the read length.

####-bm | --blat_minScore n: blat minScore for alignment to TE

Optional. Default value is 10.

n is used for the blat minScore value for the comparison of reads to the TE sequence.

Excerpt directly from Blat manual:

-minScore=N sets minimum score. This is the matches minus the mismatches minus some sort of gap penalty.

####-m | --mismatch (n<=0) mismatches allowed in blat alignment to TE

Optional. Default value is 0.

Any number less than or equal to 0.

Fraction of the bps that aligned to the TE that are allowed to not be an exact match. For example, if 10 bp align to the TE and the allowance is 0.1, 1 bp can be a mismatch.

####-bt | --blat_tileSize n: blat tileSize for alignmetn to TE

Optional. Default value is 7.

n is used for the blat tileSize value for the comparison of reads to the the TE sequence.

Excerpt directly from Blat manual:

-tileSize=N sets the size of match that triggers an alignment.
Usually between 8 and 12 Default is 11 for DNA and 5 for protein.

####-f | --flanking_seq_len n: length of the insertion site flanking seq to be returned

Optional. Default value is 100.

n is the length of the sequence flanking the insertion site that will be returned in an output fasta file and in the output gff file. This sequence is taken from the reference genome.

####-r | --reference_ins [1|File] To identify reference and shared insertions (reference and reads)

Optional. No default value.

If a 1 is provided (-r 1) RelocaTE with identify each location of the provided TE in the reference genome.
If a listing of each location has already been calculated, the file name of a tab-delimited file containing the coordinates of existing TE in the reference.
The coordinates of the TE insertions in the reference will be used to determine if there is evidence of each insertion in the reads. The number of reads supporting the start and the end of the insertion will be reported.

If manaully providing the coordinates the required format is two columns, neither column have any white space. The first colum is the TE name. Then a tab separates the first column from the second column. And the second column contains the reference sequence name as described in the reference fasta, the starting bp, a colon, double periods, and finally the ending bp.

SAMPLE Existing TE File:

mping   Chr12:839604..840033
mping   Chr12:1045463..1045892

###Quick Start Guide:

1. Get the sequence of your TE, including the TIRs. Create a fasta file with your sequence, TE name and the TSD. The TSD= is required. With DNA transposons, by definition, during an insertion event the target site is duplicated. Therefore the target site will be used to identify an insertion event. The reverse complement of each read containing portions of the ends of the provided TE will also be searched for the TSD to identify insertion events. A specific sequence of nucleotides can be used or a perl regular expression. For example:

if the TSD is known to be a specific sequence use that sequence:
TSD=TTA
if the desired TSD is TT followed by an A or G followed by a C or GT the regular expression would be:
TSD=TT[AG](C|GT).
Also a very general pattern can be used:
any 4 bp TSD=....

Example TE Fasta File:

>mping TSD=TTA
GGCCAGTCACAATGGGGGTTTCACTGGTGTGTCATGCACATTTAATAGGGGTAAGACTGAATAAAA
AATGATTATTTGCATGAAATGGGGATGAGAGAGAAGGAAAGAGTTTCATCCTGGTGAAACTCGTCA
GCGTCGTTTCCAAGTCCTCGGTAACAGAGTGAAACCCCCGTTGAGGCCGATTCGTTTCATTCACCG
GATCTCTTGCGTCCGCCTCCGCCGTGCGACCTCCGCATTCTCCCGCGCCGCGCCGGATTTTGGGTA
CAAATGATCCCAGCAACTTGTATCAATTAAATGCTTTGCTTAGTCTTGGAAACGTCAAAGTGAAAC
CCCTCCACTGTGGGGATTGTTTCATAAAAGATTTCATTTGAGAGAAGATGGTATAATATTTTGGGT
AGCCGTGCAATGACACTAGCCATTGTGACTGGCC

2. Get the short reads together.

- Check the name of your files. If they are paired they should have some indication of being pair 1 and pair 2.
- Determine the pattern to indicate the 1st and 2nd pair
Example:

Flowcell25_lane1_pair1.fastq and Flowcell25_lane1_pair2.fastq
_pair1 would match the first paired file
_pair2 would match the second paired file

3. Get the reference genome fasta. This will be one file containing all the reference sequences of your organism.
4. If you want to determine if your short reads contain the same TE insertions as the reference, use the '-r 1' option OR, use -r "filename": you can create a file with the TE name and the coordinates in the reference: SAMPLE:

mping   Chr12:839604..840033
mping   Chr12:1045463..1045892

5. Do you have a queue system available to you, or do you have multiple processors? If so, you can select the -p 1 option. This will create a series of shell scripts that you can submit to the queue or run on multiple processors.
- These should be run in order. For example all of step_1 should be run and completed before the shell scripts of step_2 are run, and so on.

6. Do you have a queue system available and is it PBS? If so, you can select the -a 1 option to generate array job scripts. These jobs can be submitted to the queue instead of submitting hundreds of individual shell scripts. The array job will submit the hundreds of scripts for you.
- Make sure to submit the array jobs in order and wait for the job to complete before submitting the next one.

7. You are now ready to run relocaTE.pl with your data. If you run the program without any command line options, it will print out a list of the options and short descriptions.

###RelocaTE Input Files:

Genome Fasta [FASTA format] (not required, but if not used RelocaTE will only identify and trim the reads that contain the specified TE and will not map the insertion locations)
Transposable element FASTA [FASTA format with TSD= in description] (required)
Paired and/or unpaired Fastq files. (ex: reads_p1.fq, reads_p2.fq, reads_unPaired.fq, reads.fq) (required)
Optional: Tab-delimited file with coordinates of TE insertions in the reference genome (not required)

###RelocaTE Output Files:

sample_name.TE_name.all_inserts.gff           
		GFF3 file containing all reference and non-reference insertions

sample_name.TE_name.all_nonref.txt          
		tab-delimited file containing all potential non-reference (insertions 
                found only in reads, absent from reference) insertion sites

sample_name.TE_name.confident_nonref.txt    
		tab-delimited file containing only the confident non-reference
		insertion sites

sample_name.TE_name.confident_nonref_genomeflanks.fa
		FASTA file containing the genome sequence which 
		flanks each confident non-reference site

sample_name.TE_name.confident_nonref_reads_list.txt
		text file containing a list of the reads used to call
		each confident non-reference insertion

sample_name.TE_name.all_reference.txt
		text file containing counts of reads which overlap the reference insertions.

###RelocaTE Tips: If you have a multi-node cluster you can speed up your RelocaTE run by dividing your fastq files into many smaller files. Dividing your reads files into many files of 1 million reads is a nice way to improve the speed of the analysis. A script included in the package can be used to do this. It is called fastq_split.pl.

perl fastq_split.pl 
 Please Provide: 
 -s int The number of sequences per file [1_000_000]
 -o dir The name of the directory to output the files
 followed by a list of files to be split

 Usage:
 ./fastq_split -o split_fq ~/somedir/someRandom.fq
 ./fastq_split -o split_fq ~/somedir/*fq

###CharacTErizer:

usage:
./characterizer.pl [-s relocaTE table output file][-b bam file or dir of bam files][-g reference genome fasta file][-h] 

options:
-s file         relocaTE output file: YOURSAMPLENAME.mping.all_nonref.txt  [no default]
-b dir/file     bam file of the orginal reads aligned to reference (before TE was trimmed) or directory of bam files [no default]
-g file         reference genome fasta file [no default]
-x int          find excision events that leave a footprint, yes or no (1|0) [0]
-h              this help message

For more information see documentation: http://srobb1.github.com/RelocaTE/

It is suggested that the BAM files are generated by using BWA to align the fastq files to the reference genome download Manual.
Example BWA command line

#create bwa index file
bwa index -a bwtsw MSUr7.sample.fa 

#Align Pair 1 fastq
bwa aln MSUr7.sample.fa fq/sample_p1.fq > sample_p1.sai

#align Pair 2 fastq
bwa aln MSUr7.sample.fa fq/sample_p2.fq > sample_p2.sai

#generate SAM for paired reads
bwa sampe MSUr7.sample.fa sample_p1.sai sample_p2.sai fq/sample_p1.fq fq/sample_p2.fq > sample.paired.sam
<<<<<<< .merge_file_H6775A

#align unparied
bwa aln MSUr7.sample.fa fq/sample.unPaired.fq > sample.unPaired.sai

#generate SAM for unpaired reads
bwa samse MSUr7.sample.fa  sample.unPaired.sai fq/sample.unPaired.fq > sample.unPaired.sam

=======

#align unparied
bwa aln MSUr7.sample.fa fq/sample.unPaired.fq > sample.unPaired.sai

#generate SAM for unpaired reads
bwa samse MSUr7.sample.fa  sample.unPaired.sai fq/sample.unPaired.fq > sample.unPaired.sam

>>>>>>> .merge_file_yAfFKy
#generate BAM with SAMtools
samtools view -h -b -S -T MSUr7.sample.fa sample.paired.sam > sample.paired.bam
samtools view -h -b -S -T MSUr7.sample.fa sample.unPaired.sam > sample.unPaired.bam

#combine BAM
samtools cat -o sample.bam sample.unPaired.bam sample.paired.bam 

#sort BAM with SAMtools
samtools sort sample.bam sample.sorted

#index BAM with SAMtools
samtools index sample.sorted.bam

###What does relocaTE.pl actually do?

if not already done, creates a bowtie index for the complete reference fasta.
if not already done, converts the fq files to fa files.
it splits the supplied TE fasta into individual files, one file for each sequence.
if not already done, runs blat: one job for every read file for every TE file. these jobs can be writted to shell scripts if -p 1, and an array job script will be generated if -a 1
if not already done, runs relocaTE_trim.pl: one job for every blat out file. shell scripts and array jobs will be created if -p 1 and -a 1
runs relocaTE_align.pl: aligns the trimmed reads to the reference fasta. a shell script created if -p 1 and -a 1
runs relocaTE_insertionFinder.pl: one job for every TE for every sequence of the reference fasta. shell scripts and array jobs will be created if -p 1 and -a 1.
concatenates the results of each reference sequence into one file: one job for every TE. shell scripts and array jobs will be created if -p 1 and -a 1.

###Report an Issue: For any of the listed reasons, or anything else, please leave us a message here

Is there a bug?
Did you find confusing documentation or a missing topic?
Do you have an awesome idea to share?
Is an improvement needed?

Leave a message here

Name		Name	Last commit message	Last commit date
Latest commit History 194 Commits
sample_relocaTE_run		sample_relocaTE_run
scripts		scripts
LICENSE		LICENSE
README.md		README.md
TSD.len_table.txt		TSD.len_table.txt
master		master

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Contributors 2

Languages

License

srobb1/RelocaTE

Folders and files

Latest commit

History

Repository files navigation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages