Computes paired and unpaired exact repetitions within a genome sequence.
C Python Java
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
my_sarray
33bits.c
33bits.h
Makefile
NC_001416.fna
README
byte2judy.c
byte2judy.h
p-judy.c
p-judy.h
pairs-v7.c
pairs-v7.py
rb-1bit.c
short2judy.c
short2judy.h

README

Computes the ratio of exact, paired (and unpaired) repeated reads within a genome.
Inspired by Nava Whiteford work on exact unpaired repetitions (Whiteford et al, 2005).
Details of the algorithm are in Rayan Chikhi's PhD thesis (2012), and the following
abstract: http://hal.inria.fr/docs/00/42/68/56/PDF/1471-2105-10-S13-O2.pdf
Citation: 
R. Chikhi, D. Lavenier. Paired-end read length lower bounds for genome re-sequencing. (Meeting Abstract) BMC Bioinformatics 2009, 10(Suppl 13):O2


It might be a bit tough to compile, so here is a python version (with a more readable algorithm), parameters are hardcoded:
- python pairs-v7.py 

Compiling the much faster C version:
- download Judy (http://sourceforge.net/projects/judy/files/latest/download)
  and put it in ./judy-1.0.5/
- type "make judy" to compile judy
- download argtable (http://argtable.sourceforge.net/) and put it in ./argtable2/ (remove the trailing version, e.g. "-14")
- type "make argtable" to compile argtable
- type "make -C ./my_sarray/" to compile suffix array extension
- type "make" to compile pairs-v7.c
- run ./pairs-v7

./pairs-v7 --help
Usage: pairs-v7 [-v] [-s <int>] [-d <int>] [-f <filename>] [--hd] [--mkesa]
[-l <int>] [--nopairs]
  -s, --sigma=<int>         define sigma value (default is 300)
  -d, --delta=<int>         define delta value (default is 0)
  -f <filename>             sequence to analyze
  --hd                      stores v[] on disk (in ./vfiles/)
  --mkesa                   use mkesa to build the suffix array and lcp (in ./mkesafiles/)
  -l, --length=<int>        set maximum read length (default is 16)
  --nopairs                 do not compute paired uniqueness, only single uniqueness
  -v, --verbose             verbose


To compute paired repetitions for very large (human-sized) genomes:
use a suffix array constructed by mkesa, requires ~60gb ram. feel free to send me a mail 
if you wish to perform this kind of experiment.

Why all the optimizations in this code? (e.g. 33-bits integers, multiple 1-byte judy arrays, 
several auxiliary data structures in pairs-v7) Back in 2008 I only had access to a computer
with 64 GB memory, and the structures supporting paired repetitions computations were too large.
For time efficiency (and lack of public implementations), compressed suffix arrays were not used.