Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.Sign up
Computes the ratio of exact, paired (and unpaired) repeated reads within a genome. Inspired by Nava Whiteford work on exact unpaired repetitions (Whiteford et al, 2005). Details of the algorithm are in Rayan Chikhi's PhD thesis (2012), and the following abstract: http://hal.inria.fr/docs/00/42/68/56/PDF/1471-2105-10-S13-O2.pdf Citation: R. Chikhi, D. Lavenier. Paired-end read length lower bounds for genome re-sequencing. (Meeting Abstract) BMC Bioinformatics 2009, 10(Suppl 13):O2 It might be a bit tough to compile, so here is a python version (with a more readable algorithm), parameters are hardcoded: - python pairs-v7.py Compiling the much faster C version: - download Judy (http://sourceforge.net/projects/judy/files/latest/download) and put it in ./judy-1.0.5/ - type "make judy" to compile judy - download argtable (http://argtable.sourceforge.net/) and put it in ./argtable2/ (remove the trailing version, e.g. "-14") - type "make argtable" to compile argtable - type "make -C ./my_sarray/" to compile suffix array extension - type "make" to compile pairs-v7.c - run ./pairs-v7 ./pairs-v7 --help Usage: pairs-v7 [-v] [-s <int>] [-d <int>] [-f <filename>] [--hd] [--mkesa] [-l <int>] [--nopairs] -s, --sigma=<int> define sigma value (default is 300) -d, --delta=<int> define delta value (default is 0) -f <filename> sequence to analyze --hd stores v on disk (in ./vfiles/) --mkesa use mkesa to build the suffix array and lcp (in ./mkesafiles/) -l, --length=<int> set maximum read length (default is 16) --nopairs do not compute paired uniqueness, only single uniqueness -v, --verbose verbose To compute paired repetitions for very large (human-sized) genomes: use a suffix array constructed by mkesa, requires ~60gb ram. feel free to send me a mail if you wish to perform this kind of experiment. Why all the optimizations in this code? (e.g. 33-bits integers, multiple 1-byte judy arrays, several auxiliary data structures in pairs-v7) Back in 2008 I only had access to a computer with 64 GB memory, and the structures supporting paired repetitions computations were too large. For time efficiency (and lack of public implementations), compressed suffix arrays were not used.