Note this is in active rapid development at present and not ready for reliable use
Pandora is a tool for bacterial genome analysis without using a reference genome, including genetic variation from SNPs to gene presence/absence across the whole pan-genome. Core ideas are:
- new samples look like recombinants (plus mutations) of things seen before
- we should be analysing nucleotide-level variation everywhere, not just in core genes
- arbitrary reference genomes are unnatural
Pandora works with Illumina or nanopore data, allowing per-sample analysis (sequence inference and SNP/indel/gene-calling) and comparison of multiple samples. To do this it uses population reference graphs (PRG) which have been built for orthologous blocks of interest (e.g. genes and intergenic regions). See https://github.com/rmcolq/make_prg for a pipeline which can construct these PRGs from a set of aligned sequence files.
It can do the following for a single sample (read dataset):
- Output inferred gene sequences for the orthologous chunks (eg genes) in the PRG
- Output a VCF showing the variation found in the pangenome genes which are present, with respect to any reference in the PRG.
For a collection of samples, it can:
Output a matrix showing inferred copy-number of each gene in each sample genome.
Output one VCF per orthologous-chunk, showing how samples which contained this chunk differed in their gene sequence. Variation is shown with respect to the most informative recombinant path in the PRG . Soon, in a galaxy not so far away, it will allow
discovery of new variation not in the PRG
Warning - this code is still in development.
Requires a Unix or Mac OS.
Requires a system install of
zlib. If this is not already installed, this tutorial is helpful.
Requires a system installation of
log(which also depends on
iostreamslibraries. If not already installed use the following or look at this guide.
wget https://sourceforge.net/projects/boost/files/boost/1.62.0/boost_1_62_0.tar.gz --no-check-certificate tar xzf boost_1_62_0.tar.gz cd boost_1_62_0 ./bootstrap.sh [--prefix=/prefix/path] --with-libraries=system,filesystem,iostreams,log,thread,date_time ./b2 install
Download and install
git clone https://github.com/rmcolq/pandora.git cd pandora mkdir build cd build cmake [-DCMAKE_PREFIX_PATH=/prefix/path] .. make ctest -VV cd ..
Instead you can download and use the singularity container:
singularity pull --force --name pandora.simg shub://rmcolq/pandora:pandora singularity exec pandora.simg pandora
Population Reference Graphs
Pandora assumes you have already constructed a fasta-like file of graphs, one entry for each gene/ genome region of interest.
Takes a fasta-like file of PRG sequences and constructs an index, and directory of gfa files to be used by pandora map.
Usage: pandora index [options] <prgs.fa> Options: -h,--help Show this help message -w W Window size for (w,k)-minimizers, default 14 -k K K-mer size for (w,k)-minimizers, default 15
The index stores (w,k)-minimizers for each PRG path found. These parameters can be specified, but default to w=1, k=15.
Map reads to index
This takes a fasta of noisy long read sequence data and compares to the index. It infers which of the PRG genes/elements is present, and for those that are present it outputs the inferred sequence.
Usage: pandora map -p PRG_FILE -r READ_FILE -o OUTDIR <option(s)> Options: -h,--help Show this help message -p,--prg_file PRG_FILE Specify a fasta-style prg file -r,--read_file READ_FILE Specify a file of reads in fasta format -o,--outdir OUTDIR Specify directory of output -w W Window size for (w,k)-minimizers, must be <=k, default 14 -k K K-mer size for (w,k)-minimizers, default 15 -m,--max_diff INT Maximum distance between consecutive hits within a cluster, default 500 (bps) -e,--error_rate FLOAT Estimated error rate for reads, default 0.11 --genome_size NUM_BP Estimated length of genome, used for coverage estimation --output_kg Save kmer graphs with fwd and rev coverage annotations for found localPRGs --output_vcf Save a vcf file for each found localPRG --vcf_refs REF_FASTA A fasta file with an entry for each LocalPRG giving reference sequence for VCF. Must have a perfect match in the graph and the same name as the graph --illumina Data is from illumina rather than nanopore, so is shorter with low error rate --bin Use binomial model for kmer coverages, default is negative binomial --max_covg Maximum average coverage from reads to accept --regenotype Add extra step to carefully genotype SNP sites