Skip to content
Go to file

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time


Computes the ratio of exact, paired (and unpaired) repeated reads within a genome.
Inspired by Nava Whiteford work on exact unpaired repetitions (Whiteford et al, 2005).
Details of the algorithm are in Rayan Chikhi's PhD thesis (2012), and the following
R. Chikhi, D. Lavenier. Paired-end read length lower bounds for genome re-sequencing. (Meeting Abstract) BMC Bioinformatics 2009, 10(Suppl 13):O2

It might be a bit tough to compile, so here is a python version (with a more readable algorithm), parameters are hardcoded:
- python 

Compiling the much faster C version:
- download Judy (
  and put it in ./judy-1.0.5/
- type "make judy" to compile judy
- download argtable ( and put it in ./argtable2/ (remove the trailing version, e.g. "-14")
- type "make argtable" to compile argtable
- type "make -C ./my_sarray/" to compile suffix array extension
- type "make" to compile pairs-v7.c
- run ./pairs-v7

./pairs-v7 --help
Usage: pairs-v7 [-v] [-s <int>] [-d <int>] [-f <filename>] [--hd] [--mkesa]
[-l <int>] [--nopairs]
  -s, --sigma=<int>         define sigma value (default is 300)
  -d, --delta=<int>         define delta value (default is 0)
  -f <filename>             sequence to analyze
  --hd                      stores v[] on disk (in ./vfiles/)
  --mkesa                   use mkesa to build the suffix array and lcp (in ./mkesafiles/)
  -l, --length=<int>        set maximum read length (default is 16)
  --nopairs                 do not compute paired uniqueness, only single uniqueness
  -v, --verbose             verbose

To compute paired repetitions for very large (human-sized) genomes:
use a suffix array constructed by mkesa, requires ~60gb ram. feel free to send me a mail 
if you wish to perform this kind of experiment.

Why all the optimizations in this code? (e.g. 33-bits integers, multiple 1-byte judy arrays, 
several auxiliary data structures in pairs-v7) Back in 2008 I only had access to a computer
with 64 GB memory, and the structures supporting paired repetitions computations were too large.
For time efficiency (and lack of public implementations), compressed suffix arrays were not used.


Computes paired and unpaired exact repetitions within a genome sequence.



No releases published
You can’t perform that action at this time.