Yoshiki Tanaka wrote the original code.
1.0.0
cycle_finder is a tool for detecting tandem repeat and interspersed repeat from short reads. The execusion procedures are as follows:
- Extracting high frequency k-mer from short reads.
- Finding cycle from de Bruijn graph and detect tandem repeats.
- Clustering and copy estimating from detected tandem repeats.
- Finding path from de Bruijn graph that removed cycle and detect interspersed repeats.
- Clustering and copy estimating from detected interspersed repeats.
conda install -c bioconda -c conda-forge cycle_finder
tar zxfv cycle_finder_<version>.tar.gz
cd cycle_finder_<version>
make
cp cycle_finder <installation_path>
cycle_finder all -f <SHORT_READS>.fastq
cycle_finder all -f1 <SHORT_READS1.fastq> -f2 <SHORT_READS2.fastq>
# Get the test dataset, which includes simulated reads from repeats-inserted E. coli genomes.
wget https://github.com/rkajitani/cycle_finder/releases/download/v1.0.0/test_data.tar.gz
tar xzfv test_data.tar.gz
# Test for tandem repeats.
# The reads, genome, tandem repeat unit (253 bp; mutation rate, 2%), and correct result are
# reads.fq, genome.fa, rep_unit.fa, and result/*, respectively.
cd test_data/tandem/
bash cmd.sh
# Output: out_T.fa out_T.tsv
cd ../..
# Test for interspersed repeats.
# The reads, genome, interspersed repeat unit (253 bp; mutation rate, 2%), and correct result are
# reads.fq, genome.fa, rep_unit.fa, and result/*, respectively.
cd test_data/interspersed/
bash cmd.sh
# Output: out_I.fa out_I.tsv
-
GCC
-
OpenMP
-
Jellyfish (>= 2.2.6)
G. Marçais and C. Kingsford, “A fast, lock-free approach for efficien parallel counting of occurrences of k-mers,” Bioinformatics, vol. 27 no. 6, pp. 764–770, 2011. https://academic.oup.com/bioinformatics/article/27/6/764/234905 -
TRF (Tandem Repeat Finder; >= 4.07b)
G. Benson, “Tandem repeats finder : a program to analyze DNA sequence,” vol. 27, no. 2, pp. 573–580, 1999. -
BLAST+ (>= 2.2.31+)
S. F. Altschul, T. L. Madden, A. A. Schäffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman, “Gapped BLAST and PSI-BLAST: A new generation of proten database search programs,” Nucleic Acids Res., vol. 25, no. 17 pp. 3389–3402, 1997. -
CD-HIT (>= 4.6)
W. Li, L. Fu, B. Niu, S. Wu, and J. Wooley, “Ultrafast clustering algorithm for metagenomic sequence analysis,” Brief. Bioinform., vol. 13, no. 6, pp 656–668, 2012.
-t INT : Number of threads (default 1)
-o STR : Prefix of output files (default out)
Extracting high frequency k-mer from short reads.
-f FILE1 [FILE2 ...]: Reads file (fasta or fastq format)
-f1 FILE1 [FILE2 ...]: Reads file (fasta or fastq format)
-f2 FILE1 [FILE2 ...]: Reads file (fasta or fastq format)
-c1 INT : copy difference for detecting repeat
-c2 INT : copy difference for copy estimation
-jf FILE : if you have jf file that output by jellyfish,
you can skip jellyfish count.
-dm FILE : if you have dump file that output by jellyfish,
you can skip jellyfish dump.
-hs FILE : if you have histo file that output by jellyfish,
you can skip jellyfish histo.
-jf1 FILE : if you have jf file that output by jellyfish,
you can skip jellyfish count(corresponds to FILE1).
-jf2 FILE : if you have jf file that output by jellyfish,
you can skip jellyfish count(corresponds to FILE2).
-dm1 FILE : if you have dump file that output by jellyfish,
you can skip jellyfish dump(corresponds to FILE1).
-dm2 FILE : if you have dump file that output by jellyfish,
you can skip jellyfish dump(corresponds to FILE2).
-hs1 FILE : if you have histo file that output by jellyfish,
you can skip jellyfish histo(corresponds to FILE1).
-hs2 FILE : if you have histo file that output by jellyfish,
you can skip jellyfish histo(corresponds to FILE2).
PREFIX_for_detect${c1}.fa : high frequency k-mer for detecting repeat
PREFIX_for_estimate${c2}.fa : high frequency k-mer for copy estimating
PREFIX_kmer_peak : k-mer peak
PREFIX.jf /PREFIX.jf1 / PREFIX.jf2 : jf file output by jellyfish
PREFIX.dm / PREFIX.dm1 / PREFIX.dm2 : dump file output by jellyfish
PREFIX.hs / PREFIX.hs1 / PREFIX.hs2 : histogram of k-mer frequency
Finding cycle from de Bruijn graph and detect tandem repeats.
-f FILE : high frequency k-mer for detecting repeat
-r FASTQ : short read fastq file(only one file)
-c INT : single mode->1 compare mode->2
-l INT : threshold of detecting repeat length
-n INT : threshold of searching nodes
-d INT : threshold of searching depth
-p INT : k-mer peak (single mode)
-p1 INT : k-mer peak (compare mode corresponds to FILE1)
-p2 INT : k-mer peak (compare mode corresponds to FILE2)
-rc FLOAT : Down sampling coverage (default 0.5)
PREFIX_T_repeat_num : temporal copy estimation
PREFIX_T_repeat_num_min : temporal minimum copy estimation
Clustering and copy estimating from detected repeat.
-f1 FILE : high frequency k-mer for copy estimating
-f2 FILE : temporal copy estimation
-f3 FILE : temporal minimum copy estimation
-c INT : single mode->1 compare mode->2
-p INT : k-mer peak (single mode)
-p1 INT : k-mer peak (compare mode corresponds to FILE1)
-p2 INT : k-mer peak (compare mode corresponds to FILE2)
-m INT : threshold mismatch of kmer alignment for copy estimation (default 0)
PREFIX_T_blst.blastn : k-mer alignment file
PREFIX_T.fa : detected tandem repeat
PREFIX_T.tsv : copy esitimation
PREFIX_T_clst.fa : fasta file of detected tandem repeats
PREFIX_I.fa : detected interspersed repeat
PREFIX_I.tsv : copy estimation
PREFIX_I_clst.fa : fasta file of detected tandem repeats
Finding path from de Bruijn graph that removed cycle and detect interspersed repeats.
-f FILE : high frequency k-mer for detecting repeat
-b FILE : alignment file (output from cycle command)
-r FASTQ : short read fastq file(only one file)
-c INT : single mode->1 compare mode->2
-L INT : threshold of detecting repeat length
-N INT : threshold of searching nodes
-D INT : threshold of searching depth
-p INT : k-mer peak (single mode)
-p1 INT : k-mer peak (compare mode corresponds to FILE1)
-p2 INT : k-mer peak (compare mode corresponds to FILE2)
-rc FLOAT : Down sampling coverage (default 0.5)
PREFIX_for_detect${c1}.fa_no_cycle : high frequency k-mer for detect without cycle
PREFIX_I_repeat_num : temporal copy estimation
PREFIX_I_repeat_num_min : temporal minimum copy estimation
Run whole pipeline:
extract -> cycle -> cluster -> intersperse > cluster
-f FILE1 [FILE2 ...]: Reads file (fasta or fastq format)
-f1 FILE1 [FILE2 ...]: Reads file (fasta or fastq format)
-f2 FILE1 [FILE2 ...]: Reads file (fasta or fastq format)
-c1 INT : copy difference for detecting repeat
-c2 INT : copy difference for copy estimation
-jf FILE : if you have jf file that output by jellyfish,
you can skip jellyfish count.
-dm FILE : if you have dump file that output by jellyfish,
you can skip jellyfish dump.
-hs FILE : if you have histo file that output by jellyfish,
you can skip jellyfish histo.
-jf1 FILE : if you have jf file that output by jellyfish,
you can skip jellyfish count(corresponds to FILE1).
-jf2 FILE : if you have jf file that output by jellyfish,
you can skip jellyfish count(corresponds to FILE2).
-dm1 FILE : if you have dump file that output by jellyfish,
you can skip jellyfish dump(corresponds to FILE1).
-dm2 FILE : if you have dump file that output by jellyfish,
you can skip jellyfish dump(corresponds to FILE2).
-hs1 FILE : if you have histo file that output by jellyfish,
you can skip jellyfish histo(corresponds to FILE1).
-hs2 FILE : if you have histo file that output by jellyfish,
you can skip jellyfish histo(corresponds to FILE2).
-l INT : threshold of detecting repeat length
-n INT : threshold of searching nodes
-d INT : threshold of searching depth
-L INT : threshold of detecting repeat length
-N INT : threshold of searching nodes
-D INT : threshold of searching depth
-rc FLOAT : Down sampling coverage (default 0.5)
-m INT : threshold mismatch of kmer alignment for copy estimation (default 0)
PREFIX_for_detect${c1}.fa : high frequency k-mer for detecting repeat
PREFIX_for_estimate${c2}.fa : high frequency k-mer for copy estimating
PREFIX_kmer_peak : k-mer peak
PREFIX.jf(PREFIX.jf1, PREFIX.jf2) : jf file output by jellyfish
PREFIX.dm / PREFIX.dm1 / PREFIX.dm2 : dump file output by jellyfish
PREFIX.hs / PREFIX.hs1 / PREFIX.hs2 : histogram of k-mer frequency
[tandem repeat]
PREFIX_T_repeat_num : temporal copy estimation
PREFIX_T_repeat_num_min : temporal minimum copy estimation
PREFIX_T.fa : detected tandem repeat
PREFIX_T.tsv : copy esitimation
[interspersed repeat]
PREFIX_I_repeat_num : temporal copy estimation
PREFIX_I_repeat_num_min : temporal minimum copy estimation
PREFIX_I.fa : detected interspersed repeat
PREFIX_I.tsv : copy estimation
-
column1 : sequence
-
column2 : sequence length
-
column3 : explanation
-
column4 : difference between column5 and column4
-
column5 : the copy number (FILE or FILE1)
-
column6 : the copy number (FILE2 ; in single mode, this column shows 0)
-
column7 : explanation
-
column8 : difference between column9 and column10
-
column9 : the number of bases (FILE or FILE1)
-
column10 : the number of bases (FILE2 ; in single mode, this column shows 0)
-
column1 : "Family" name (the number next to '_' shows "Family number", length, sequence number from left)
-
column2 : difference between column3 and column4
-
column3 : the number of bases consists of the "Family" (FILE or FILE1)
-
column4 : the number of bases consists of the "Family" (FILE2 ; in single mode, this column shows 0)
-
column5 : difference between normalized column3 and normalized column4 (normalized means devided by k-mer peak, so it shows difference of the copy number)
fasta file of detected repeats. The sequence which has "*" is a representative sequence of "family".
-
column1 : sequence name
-
column2 : k-mer aligned to the repeat
-
column3 : the number of match
after ">" shows Family, and other lines shows the "family number".