rchikhi/pairedRepetitions
Folders and files
| Name | Name | Last commit date | ||
|---|---|---|---|---|
Repository files navigation
Computes the ratio of exact, paired (and unpaired) repeated reads within a genome. Inspired by Nava Whiteford work on exact unpaired repetitions (Whiteford et al, 2005). Details of the algorithm are in Rayan Chikhi's PhD thesis (2012), and the following abstract: http://hal.inria.fr/docs/00/42/68/56/PDF/1471-2105-10-S13-O2.pdf Citation: R. Chikhi, D. Lavenier. Paired-end read length lower bounds for genome re-sequencing. (Meeting Abstract) BMC Bioinformatics 2009, 10(Suppl 13):O2 It might be a bit tough to compile, so here is a python version (with a more readable algorithm), parameters are hardcoded: - python pairs-v7.py Compiling the much faster C version: - download Judy (http://sourceforge.net/projects/judy/files/latest/download) and put it in ./judy-1.0.5/ - type "make judy" to compile judy - download argtable (http://argtable.sourceforge.net/) and put it in ./argtable2/ (remove the trailing version, e.g. "-14") - type "make argtable" to compile argtable - type "make -C ./my_sarray/" to compile suffix array extension - type "make" to compile pairs-v7.c - run ./pairs-v7 ./pairs-v7 --help Usage: pairs-v7 [-v] [-s <int>] [-d <int>] [-f <filename>] [--hd] [--mkesa] [-l <int>] [--nopairs] -s, --sigma=<int> define sigma value (default is 300) -d, --delta=<int> define delta value (default is 0) -f <filename> sequence to analyze --hd stores v[] on disk (in ./vfiles/) --mkesa use mkesa to build the suffix array and lcp (in ./mkesafiles/) -l, --length=<int> set maximum read length (default is 16) --nopairs do not compute paired uniqueness, only single uniqueness -v, --verbose verbose To compute paired repetitions for very large (human-sized) genomes: use a suffix array constructed by mkesa, requires ~60gb ram. feel free to send me a mail if you wish to perform this kind of experiment. Why all the optimizations in this code? (e.g. 33-bits integers, multiple 1-byte judy arrays, several auxiliary data structures in pairs-v7) Back in 2008 I only had access to a computer with 64 GB memory, and the structures supporting paired repetitions computations were too large. For time efficiency (and lack of public implementations), compressed suffix arrays were not used.