cycle_finder README

AUTHOR

Yoshiki Tanaka wrote the original code.

VERSION

1.0.0

DESCRIPTION

cycle_finder is a tool for detecting tandem repeat and interspersed repeat from short reads. The execusion procedures are as follows:

Extracting high frequency k-mer from short reads.
Finding cycle from de Bruijn graph and detect tandem repeats.
Clustering and copy estimating from detected tandem repeats.
Finding path from de Bruijn graph that removed cycle and detect interspersed repeats.
Clustering and copy estimating from detected interspersed repeats.

INSTALATION

Using Bioconda (Linux)

conda install -c bioconda -c conda-forge cycle_finder

From source

tar zxfv cycle_finder_<version>.tar.gz
cd cycle_finder_<version>
make
cp cycle_finder <installation_path>

SYNOPSIS

single mode

cycle_finder all -f <SHORT_READS>.fastq

compare mode

cycle_finder all -f1 <SHORT_READS1.fastq> -f2 <SHORT_READS2.fastq>

TEST

# Get the test dataset, which includes simulated reads from repeats-inserted E. coli genomes.  
wget https://github.com/rkajitani/cycle_finder/releases/download/v1.0.0/test_data.tar.gz
tar xzfv test_data.tar.gz

# Test for tandem repeats.
# The reads, genome, tandem repeat unit (253 bp; mutation rate, 2%), and correct result are
# reads.fq, genome.fa, rep_unit.fa, and result/*, respectively.
cd test_data/tandem/
bash cmd.sh
# Output: out_T.fa out_T.tsv

cd ../..

# Test for interspersed repeats.
# The reads, genome, interspersed repeat unit (253 bp; mutation rate, 2%), and correct result are
# reads.fq, genome.fa, rep_unit.fa, and result/*, respectively.
cd test_data/interspersed/
bash cmd.sh
# Output: out_I.fa out_I.tsv

DEPENDENCY

GCC
OpenMP
Jellyfish (>= 2.2.6)
G. Marçais and C. Kingsford, “A fast, lock-free approach for efficien parallel counting of occurrences of k-mers,” Bioinformatics, vol. 27 no. 6, pp. 764–770, 2011. https://academic.oup.com/bioinformatics/article/27/6/764/234905
TRF (Tandem Repeat Finder; >= 4.07b)
G. Benson, “Tandem repeats finder : a program to analyze DNA sequence,” vol. 27, no. 2, pp. 573–580, 1999.
BLAST+ (>= 2.2.31+)
S. F. Altschul, T. L. Madden, A. A. Schäffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman, “Gapped BLAST and PSI-BLAST: A new generation of proten database search programs,” Nucleic Acids Res., vol. 25, no. 17 pp. 3389–3402, 1997.
CD-HIT (>= 4.6)
W. Li, L. Fu, B. Niu, S. Wu, and J. Wooley, “Ultrafast clustering algorithm for metagenomic sequence analysis,” Brief. Bioinform., vol. 13, no. 6, pp 656–668, 2012.

USAGE

COMMON OPTIONS

-t INT    : Number of threads (default 1)

-o STR    : Prefix of output files (default out)

cycle_finder extract [OPTIONS]

Extracting high frequency k-mer from short reads.

INPUT OPTIONS

single mode

-f FILE1 [FILE2 ...]: Reads file (fasta or fastq format)

compare mode

-f1 FILE1 [FILE2 ...]: Reads file (fasta or fastq format)
-f2 FILE1 [FILE2 ...]: Reads file (fasta or fastq format)

OTHER OPTIONS

-c1 INT  : copy difference for detecting repeat

-c2 INT  : copy difference for copy estimation

single mode

-jf FILE : if you have jf file that output by jellyfish,
            you can skip jellyfish count.

-dm FILE : if you have dump file that output by jellyfish,
            you can skip jellyfish dump.

-hs FILE : if you have histo file that output by jellyfish,
            you can skip jellyfish histo.

compare mode

-jf1 FILE : if you have jf file that output by jellyfish, 
            you can skip jellyfish count(corresponds to FILE1).

-jf2 FILE : if you have jf file that output by jellyfish,
            you can skip jellyfish count(corresponds to FILE2).

-dm1 FILE : if you have dump file that output by jellyfish, 
            you can skip jellyfish dump(corresponds to FILE1).

-dm2 FILE : if you have dump file that output by jellyfish, 
            you can skip jellyfish dump(corresponds to FILE2).

-hs1 FILE : if you have histo file that output by jellyfish,
            you can skip jellyfish histo(corresponds to FILE1).

-hs2 FILE : if you have histo file that output by jellyfish,
            you can skip jellyfish histo(corresponds to FILE2).

OUTPUT FILES

PREFIX_for_detect${c1}.fa           : high frequency k-mer for detecting repeat

PREFIX_for_estimate${c2}.fa         : high frequency k-mer for copy estimating

PREFIX_kmer_peak                    : k-mer peak

PREFIX.jf /PREFIX.jf1 / PREFIX.jf2  : jf file output by jellyfish

PREFIX.dm / PREFIX.dm1 / PREFIX.dm2 : dump file output by jellyfish

PREFIX.hs / PREFIX.hs1 / PREFIX.hs2 : histogram of k-mer frequency

cycle_finder cycle [OPTIONS]

Finding cycle from de Bruijn graph and detect tandem repeats.

INPUT OPTIONS

-f FILE   : high frequency k-mer for detecting repeat

-r FASTQ  : short read fastq file(only one file)

-c INT    : single mode->1 compare mode->2

-l INT    : threshold of detecting repeat length

-n INT    : threshold of searching nodes

-d INT    : threshold of searching depth

-p INT    : k-mer peak (single mode)

-p1 INT   : k-mer peak (compare mode corresponds to FILE1)

-p2 INT   : k-mer peak (compare mode corresponds to FILE2)

-rc FLOAT : Down sampling coverage (default 0.5)

OUTPUT FILES

PREFIX_T_repeat_num     : temporal copy estimation

PREFIX_T_repeat_num_min : temporal minimum copy estimation

cycle_finder cluster [OPTIONS]

Clustering and copy estimating from detected repeat.

INPUT OPTIONS

-f1 FILE : high frequency k-mer for copy estimating

-f2 FILE : temporal copy estimation

-f3 FILE : temporal minimum copy estimation

-c  INT  : single mode->1 compare mode->2

-p INT   : k-mer peak (single mode)

-p1 INT  : k-mer peak (compare mode corresponds to FILE1)

-p2 INT  : k-mer peak (compare mode corresponds to FILE2)

-m  INT  : threshold mismatch of kmer alignment for copy estimation (default 0)

OUTPUT FILES

tandem repeat

PREFIX_T_blst.blastn : k-mer alignment file

PREFIX_T.fa          : detected tandem repeat

PREFIX_T.tsv         : copy esitimation

PREFIX_T_clst.fa     : fasta file of detected tandem repeats

interspersed repeat

PREFIX_I.fa          : detected interspersed repeat

PREFIX_I.tsv         : copy estimation

PREFIX_I_clst.fa     : fasta file of detected tandem repeats

cycle_finder intersperse [OPTIONS]

Finding path from de Bruijn graph that removed cycle and detect interspersed repeats.

INPUT OPTIONS

-f FILE   : high frequency k-mer for detecting repeat

-b FILE   : alignment file (output from cycle command)

-r FASTQ  : short read fastq file(only one file)

-c INT    : single mode->1 compare mode->2

-L INT    : threshold of detecting repeat length

-N INT    : threshold of searching nodes

-D INT    : threshold of searching depth

-p INT    : k-mer peak (single mode)

-p1 INT   : k-mer peak (compare mode corresponds to FILE1)

-p2 INT   : k-mer peak (compare mode corresponds to FILE2)

-rc FLOAT : Down sampling coverage (default 0.5)

OUTPUT FILES

PREFIX_for_detect${c1}.fa_no_cycle : high frequency k-mer for detect without cycle

PREFIX_I_repeat_num                          : temporal copy estimation

PREFIX_I_repeat_num_min                      : temporal minimum copy estimation

cycle_finder all [OPTIONS]

Run whole pipeline:
extract -> cycle -> cluster -> intersperse > cluster

INPUT OPTIONS

single mode

-f FILE1 [FILE2 ...]: Reads file (fasta or fastq format)

compare mode

-f1 FILE1 [FILE2 ...]: Reads file (fasta or fastq format)
-f2 FILE1 [FILE2 ...]: Reads file (fasta or fastq format)

OTHER OPTIONS

-c1 INT  : copy difference for detecting repeat

-c2 INT  : copy difference for copy estimation

single mode

-jf FILE : if you have jf file that output by jellyfish,
            you can skip jellyfish count.

-dm FILE : if you have dump file that output by jellyfish,
            you can skip jellyfish dump.

-hs FILE : if you have histo file that output by jellyfish,
            you can skip jellyfish histo.

compare mode

-jf1 FILE : if you have jf file that output by jellyfish, 
            you can skip jellyfish count(corresponds to FILE1).

-jf2 FILE : if you have jf file that output by jellyfish,
            you can skip jellyfish count(corresponds to FILE2).

-dm1 FILE : if you have dump file that output by jellyfish, 
            you can skip jellyfish dump(corresponds to FILE1).

-dm2 FILE : if you have dump file that output by jellyfish, 
            you can skip jellyfish dump(corresponds to FILE2).

-hs1 FILE : if you have histo file that output by jellyfish, 
            you can skip jellyfish histo(corresponds to FILE1).

-hs2 FILE : if you have histo file that output by jellyfish, 
            you can skip jellyfish histo(corresponds to FILE2).

-l INT    : threshold of detecting repeat length

-n INT    : threshold of searching nodes

-d INT    : threshold of searching depth

-L INT    : threshold of detecting repeat length

-N INT    : threshold of searching nodes

-D INT    : threshold of searching depth

-rc FLOAT : Down sampling coverage (default 0.5)

-m  INT   : threshold mismatch of kmer alignment for copy estimation (default 0)

OUTPUT FILES

PREFIX_for_detect${c1}.fa          : high frequency k-mer for detecting repeat

PREFIX_for_estimate${c2}.fa          : high frequency k-mer for copy estimating

PREFIX_kmer_peak                          : k-mer peak

PREFIX.jf(PREFIX.jf1, PREFIX.jf2)   : jf file output by jellyfish

PREFIX.dm / PREFIX.dm1 / PREFIX.dm2 : dump file output by jellyfish

PREFIX.hs / PREFIX.hs1 / PREFIX.hs2 : histogram of k-mer frequency

[tandem repeat]			

PREFIX_T_repeat_num     : temporal copy estimation

PREFIX_T_repeat_num_min : temporal minimum copy estimation

PREFIX_T.fa          : detected tandem repeat

PREFIX_T.tsv         : copy esitimation

[interspersed repeat]

PREFIX_I_repeat_num     : temporal copy estimation

PREFIX_I_repeat_num_min : temporal minimum copy estimation

PREFIX_I.fa          : detected interspersed repeat

PREFIX_I.tsv         : copy estimation

FILE FORMAT

PREFIX_T_repeat_num / PREFIX_T_repeat_num_min / PREFIX_I_repeat_num / PREFIX_I_repeat_num_min

column1 : sequence
column2 : sequence length
column3 : explanation
column4 : difference between column5 and column4
column5 : the copy number (FILE or FILE1)
column6 : the copy number (FILE2 ; in single mode, this column shows 0)
column7 : explanation
column8 : difference between column9 and column10
column9 : the number of bases (FILE or FILE1)
column10 : the number of bases (FILE2 ; in single mode, this column shows 0)

PREFIX_T.tsv / PREFIX_I.tsv

column1 : "Family" name (the number next to '_' shows "Family number", length, sequence number from left)
column2 : difference between column3 and column4
column3 : the number of bases consists of the "Family" (FILE or FILE1)
column4 : the number of bases consists of the "Family" (FILE2 ; in single mode, this column shows 0)
column5 : difference between normalized column3 and normalized column4 (normalized means devided by k-mer peak, so it shows difference of the copy number)

PREFIX_T_clst.fa / PREFIX_I_clst.fa

fasta file of detected repeats. The sequence which has "*" is a representative sequence of "family".

PREFIX_T_blst.blastn

column1 : sequence name
column2 : k-mer aligned to the repeat
column3 : the number of match

PREFIX_T.element / PREFIX_I.element

after ">" shows Family, and other lines shows the "family number".

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
blast_to_copynum.cpp		blast_to_copynum.cpp
blast_to_copynum.h		blast_to_copynum.h
blastn2fa.cpp		blastn2fa.cpp
blastn2fa.h		blastn2fa.h
bll.cpp		bll.cpp
bll.h		bll.h
clst_to_family.cpp		clst_to_family.cpp
clst_to_family.h		clst_to_family.h
cluster.cpp		cluster.cpp
cluster.h		cluster.h
clustering.cpp		clustering.cpp
clustering.h		clustering.h
common.cpp		common.cpp
common.h		common.h
common_dummy.h		common_dummy.h
cycle.cpp		cycle.cpp
cycle.h		cycle.h
cycle_find.cpp		cycle_find.cpp
extract.cpp		extract.cpp
extract.h		extract.h
extract_class.h		extract_class.h
header.h		header.h
intersperse.cpp		intersperse.cpp
intersperse.h		intersperse.h
kmer_align.cpp		kmer_align.cpp
kmer_align.h		kmer_align.h
kmer_compare.cpp		kmer_compare.cpp
kmer_compare.h		kmer_compare.h
main.cpp		main.cpp
map_read.cpp		map_read.cpp
path_find.cpp		path_find.cpp
peak_detect.cpp		peak_detect.cpp
peak_detect.h		peak_detect.h
repeat_num.cpp		repeat_num.cpp
repeat_num.h		repeat_num.h
trf_filter.cpp		trf_filter.cpp

License

rkajitani/cycle_finder

Folders and files

Latest commit

History

Repository files navigation

cycle_finder README

AUTHOR

VERSION

DESCRIPTION

INSTALATION

Using Bioconda (Linux)

From source

SYNOPSIS

single mode

compare mode

TEST

DEPENDENCY

USAGE

COMMON OPTIONS

cycle_finder extract [OPTIONS]

INPUT OPTIONS

single mode

compare mode

OTHER OPTIONS

single mode

compare mode

OUTPUT FILES

cycle_finder cycle [OPTIONS]

INPUT OPTIONS

OUTPUT FILES

cycle_finder cluster [OPTIONS]

INPUT OPTIONS

OUTPUT FILES

tandem repeat

interspersed repeat

cycle_finder intersperse [OPTIONS]

INPUT OPTIONS

OUTPUT FILES

cycle_finder all [OPTIONS]

INPUT OPTIONS

single mode

compare mode

OTHER OPTIONS

single mode

compare mode

OUTPUT FILES

FILE FORMAT

PREFIX_T_repeat_num / PREFIX_T_repeat_num_min / PREFIX_I_repeat_num / PREFIX_I_repeat_num_min

PREFIX_T.tsv / PREFIX_I.tsv

PREFIX_T_clst.fa / PREFIX_I_clst.fa

PREFIX_T_blst.blastn

PREFIX_T.element / PREFIX_I.element

About

Resources

License

Stars

Watchers

Forks

Languages