The method in this article is based on the Tutorials: "Ageing LTR insertions"(https://github.com/SIWLab/Lab_Info/wiki/Ageing-LTR-insertions). However, many parts of the Tutorials are brief introductions, and the provided script has also become invalid, so I can only rewrite the script by myself.
First, installing the software that will be used later. apt or conda are recommanded.
sudo apt install bedtools
sudo apt install genometools
Finding full LTR-RT by LTRharvest. Say the genome file is called EGL.fa
#LTRharvest
gt suffixerator \
-db EGL.fa \
-indexname EGL \
-tis -suf -lcp -des -ssp -sds -dna
gt ltrharvest \
-index EGL \
-similar 90 -vic 10 -seed 20 -seqids yes \
-minlenltr 100 -maxlenltr 7000 -mintsd 4 -maxtsd 6 \
-motif TGCA -motifmis 1 > EGL.harvest.scn &
After searching for LTR with LTRharvest, get the LTR annotation file EGL.gff Among them, there is the line of long_terminal_repeat where the LTR is annotated. Use this line of information to extract the LTR pair, and generate a file for each pair.
But The result of LTRharvest has one inconvenient feature. It will replace the original sequence name with seq0, seq1, seq2... etc. Before extraction, first restore the seq in first column of gff to the original chromosome name, delete the comment line with # in front of it. Add "###" to the first line for awk recognition For example, the original EGL.gff is something like this:
##gff-version 3
##sequence-region seq0 1 67484389
##sequence-region seq1 1 55950575
##sequence-region seq2 1 55030165
##sequence-region seq3 1 44624142
##sequence-region seq4 1 50342496
##sequence-region seq5 1 42457113
##sequence-region seq6 1 55595191
##sequence-region seq7 1 55557465
##sequence-region seq8 1 54465677
#chr01
#chr02
#chr03
#chr04
#chr05
#chr06
#chr07
#chr08
#chr09
seq0 LTRharvest repeat_region 4380 13346 . ? . ID=repeat_region1
seq0 LTRharvest target_site_duplication 4380 4383 . ? . Parent=repeat_region1
seq0 LTRharvest LTR_retrotransposon 4384 13342 . ? . ID=LTR_retrotransposon1;Parent=repeat_region1;ltr_similarity=86.67;seq_number=0
seq0 LTRharvest long_terminal_repeat 4384 4514 . ? . Parent=LTR_retrotransposon1
seq0 LTRharvest long_terminal_repeat 13208 13342 . ? . Parent=LTR_retrotransposon1
seq0 LTRharvest target_site_duplication 13343 13346 . ? . Parent=repeat_region1
###
seq0 LTRharvest repeat_region 28572 31425 . ? . ID=repeat_region2
seq0 LTRharvest target_site_duplication 28572 28575 . ? . Parent=repeat_region2
seq0 LTRharvest LTR_retrotransposon 28576 31421 . ? . ID=LTR_retrotransposon2;Parent=repeat_region2;ltr_similarity=93.89;seq_number=0
seq0 LTRharvest long_terminal_repeat 28576 28886 . ? . Parent=LTR_retrotransposon2
seq0 LTRharvest long_terminal_repeat 31113 31421 . ? . Parent=LTR_retrotransposon2
seq0 LTRharvest target_site_duplication 31422 31425 . ? . Parent=repeat_region2
Before extraction, first restore the seq in first column of gff to the original chromosome name, delete the comment line with # in front of it. Add "###" to the first line for awk recognition. After modified:
###
chr01 LTRharvest repeat_region 4380 13346 . ? . ID=repeat_region1
chr01 LTRharvest target_site_duplication 4380 4383 . ? . Parent=repeat_region1
chr01 LTRharvest LTR_retrotransposon 4384 13342 . ? . ID=LTR_retrotransposon1;Parent=repeat_region1;ltr_similarity=86.67;seq_number=0
chr01 LTRharvest long_terminal_repeat 4384 4514 . ? . Parent=LTR_retrotransposon1
chr01 LTRharvest long_terminal_repeat 13208 13342 . ? . Parent=LTR_retrotransposon1
chr01 LTRharvest target_site_duplication 13343 13346 . ? . Parent=repeat_region1
###
chr01 LTRharvest repeat_region 28572 31425 . ? . ID=repeat_region2
chr01 LTRharvest target_site_duplication 28572 28575 . ? . Parent=repeat_region2
chr01 LTRharvest LTR_retrotransposon 28576 31421 . ? . ID=LTR_retrotransposon2;Parent=repeat_region2;ltr_similarity=93.89;seq_number=0
chr01 LTRharvest long_terminal_repeat 28576 28886 . ? . Parent=LTR_retrotransposon2
chr01 LTRharvest long_terminal_repeat 31113 31421 . ? . Parent=LTR_retrotransposon2
chr01 LTRharvest target_site_duplication 31422 31425 . ? . Parent=repeat_region2
mkdir ltr.seq
cd ltr.seq
gff_name=AER0102.fasta.mod.harvest.combine.gff ###enter your edited gff name.
cp ../${gff_name} .
ulimit -n 100000 #Lift the limit on the number of files, otherwise an error will be reported
awk '/###/{filename=NR".txt"}; /long_terminal_repeat/{gsub(/Parent=/,"");print $1,$4,$5,$9 >filename}' OFS="\t" ${gff_name} #generate ltr bed files of each LTR-RT
ls | grep txt | sort -n > file.list #generate list of bed file
The goal is to use bedtools to extract each pair of LTR, and then use muscle to align, write python script here to generate bash: ###This script can be run under linux or windows system with python. If you are using windows OS, please remeber to change line endings form "CF LF" to "LF"!!! ##Please replace "ensete_glaucum.assembly.fna" with your own geneme file name.
#!/bin/usr/python
filelistname = 'F:/sequence/EGL_hub/ltr.seq/file.list' #Path to file.list
fileresult = filelistname+".sh"
f = open(fileresult , "w")
with open(filelistname,'r',newline="\n") as r:
lines=r.read().splitlines()
i = 0
for l in lines:
i= i+1
f.write("bedtools getfasta -fi ../genome.fa -bed " + l + " -fo retroltr" + str(i) +".fa -name"+ '\n') #extract LTR fasta file, the genome file name should be change.
f.write("muscle -in retroltr" + str(i) +".fa -out retroltr" + str(i) +".fa.afa" + '\n') #do alignment of each pairs of LTRs
Afer running this, a file name file.list.sh should be generated. And run it by
cd ltr.seq
bash file.list.sh
Put all the alignment results (end with .afa) into a new folder. Here we use the R package ape for calculation:
mkdir ../ltr.alignment
cp *afa ../ltr.alignment
enter the R and install ape package
R
install.packages("ape")
Use ape package to calculate the insetion time based on "T = K/(2r)"
library(ape)
library(xlsx)
setwd('F:/sequence/ltr_time/ltr.alignment') ###The path should be changed
list <- list.files()
fas.F1 = read.FASTA(list[1])
mat1 = dist.dna(fas.F1,as.matrix = T, model = "K80")
merge.data = mat1[1,2]
mutate_rate <- 1.3e-8 #according to rice mutation rate described in Ma, 2004
time1 = merge.data/(2*mutate_rate)
v1.merge = c(list[1],merge.data, time)
dir = paste("./",list,sep="")
n = length(dir)
for (i in 2:n){
fas.F.new = read.FASTA(list[i])
mat.new = dist.dna(fas.F.new, as.matrix = T, model = "K80")
time = mat.new[1,2]/(2*mutate_rate)
v.new = c(list[i], mat.new[1,2], time)
v1.merge = rbind(v1.merge, v.new)
}
write.csv(v1.merge,file = '../Insert_time.csv')