Skip to content
master
Go to file
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README

Computes the ratio of exact, paired (and unpaired) repeated reads within a genome.
Inspired by Nava Whiteford work on exact unpaired repetitions (Whiteford et al, 2005).
Details of the algorithm are in Rayan Chikhi's PhD thesis (2012), and the following
abstract: http://hal.inria.fr/docs/00/42/68/56/PDF/1471-2105-10-S13-O2.pdf
Citation: 
R. Chikhi, D. Lavenier. Paired-end read length lower bounds for genome re-sequencing. (Meeting Abstract) BMC Bioinformatics 2009, 10(Suppl 13):O2


It might be a bit tough to compile, so here is a python version (with a more readable algorithm), parameters are hardcoded:
- python pairs-v7.py 

Compiling the much faster C version:
- download Judy (http://sourceforge.net/projects/judy/files/latest/download)
  and put it in ./judy-1.0.5/
- type "make judy" to compile judy
- download argtable (http://argtable.sourceforge.net/) and put it in ./argtable2/ (remove the trailing version, e.g. "-14")
- type "make argtable" to compile argtable
- type "make -C ./my_sarray/" to compile suffix array extension
- type "make" to compile pairs-v7.c
- run ./pairs-v7

./pairs-v7 --help
Usage: pairs-v7 [-v] [-s <int>] [-d <int>] [-f <filename>] [--hd] [--mkesa]
[-l <int>] [--nopairs]
  -s, --sigma=<int>         define sigma value (default is 300)
  -d, --delta=<int>         define delta value (default is 0)
  -f <filename>             sequence to analyze
  --hd                      stores v[] on disk (in ./vfiles/)
  --mkesa                   use mkesa to build the suffix array and lcp (in ./mkesafiles/)
  -l, --length=<int>        set maximum read length (default is 16)
  --nopairs                 do not compute paired uniqueness, only single uniqueness
  -v, --verbose             verbose


To compute paired repetitions for very large (human-sized) genomes:
use a suffix array constructed by mkesa, requires ~60gb ram. feel free to send me a mail 
if you wish to perform this kind of experiment.

Why all the optimizations in this code? (e.g. 33-bits integers, multiple 1-byte judy arrays, 
several auxiliary data structures in pairs-v7) Back in 2008 I only had access to a computer
with 64 GB memory, and the structures supporting paired repetitions computations were too large.
For time efficiency (and lack of public implementations), compressed suffix arrays were not used.

About

Computes paired and unpaired exact repetitions within a genome sequence.

Resources

Releases

No releases published
You can’t perform that action at this time.