Haplotype Sampling

Overview

When a pangenome graph is used as a reference for read mapping, the variants that are included in the graph affect the results. Roughly speaking,

variants that are present both in the graph and in the sequenced sample make mapping faster and more accurate; while
variants that are present in the graph but not in the sample make mapping slower and less accurate.

The usual approach is building a graph with only common variants. Starting with vg version 1.49.0, there is another option: building a personalized reference for each sample. This is achieved by counting kmers in the reads and generating a small number of synthetic haplotypes based on the kmer counts.

Preprint

Jouni Sirén, Parsa Eskandar, Matteo Tommaso Ungaro, Glenn Hickey, Jordan M. Eizenga, Adam M. Novak, Xian Chang, Pi-Chuan Chang, Mikhail Kolmogorov, Andrew Carroll, Jean Monlong, and Benedict Paten: Personalized Pangenome References. bioRxiv, 2023. DOI: 10.1101/2023.12.13.571553

Preprocessing the graph

vg index -j graph.dist graph.gbz
vg gbwt -p --num-threads 16 -r graph.ri -Z graph.gbz
vg haplotypes -v 2 -t 16 -H graph.hapl graph.gbz

We assume that the graph is based on multiple sequence alignment and has a fairly linear high-level structure. Such graphs can be built using the Minigraph-Cactus pipeline (use the default / clip graph). The graph and the haplotypes must be in GBZ format.

The preprocessing step is done once per graph. It transforms each chromosome into a sequence of subchains (blocks) of target length 10 kbp. The idea is that we can choose the most relevant haplotypes independently in each subchain and concatenate them to form synthetic haplotypes. For each subchain, we select some kmers that are specific to the subchain and represent each haplotype by their presence / absence. At the moment, the kmers are minimizers with a single hit in the graph.

Note: As of vg version 1.51.0, preprocessing will fail if the graph contains components (~chromosomes, contigs) with multiple top-level chains in the snarl decomposition. Such components lack the necessary linear structure.

Required indexes

Preprocessing needs three indexes in addition to the GBZ graph: a distance index, an r-index, and a minimizer index.

The distance index can be built using the vg index subcommand.

The r-index can be built using the vg gbwt subcommand. Option -p / --progress prints progress information to stderr, while option --num-threads specifies the number of threads used in index construction.

We build a new minimizer index instead of loading a prebuilt one from disk. Because we do not need the distance index annotations used by vg giraffe, we can save memory by building a smaller index without them.

Generating haplotype information

We generate the haplotype information using the vg haplotypes subcommand. There are two required arguments: the input graph that will be preprocessed, and the output file for the haplotype information, specified with option -H / --haplotype-output.

The following options can be used:

-d / --distance-index: Distance index file. Default: graph.dist if the input graph is graph.gbz.
-r / --r-index: R-index file. Default: graph.ri if the input graph is graph.gbz.
--kmer-length: Kmer length used to be used in the haplotype information (up to 31). Default: 29.
--window-length: Window length used for building the minimizer index. Default: 11.
--subchain-length: Target length (in bp) for subchains. Default: 10000.
-v / --verbosity: Verbosity level (0 = silent, 1 = basic, 2 = detailed). Default: 0.
-t / --threads: Approximate number of threads to be used. Default: depends on the system.

Haplotype sampling

Manual sampling, temporary graph:

export TMPDIR=/scratch/tmp
kmc -k29 -m128 -okff -t16 -hp sample.fq.gz ${TMPDIR}/sample $TMPDIR
vg haplotypes -v 2 -t 16 --include-reference --diploid-sampling \
    -i graph.hapl -k ${TMPDIR}/sample.kff -g ${TMPDIR}/sampled.gbz graph.gbz
vg giraffe -p -t 16 -Z ${TMPDIR}/sampled.gbz -i -f sample.fq.gz > sample.gam

Giraffe integration (requires vg version 1.51.0):

export TMPDIR=/scratch/tmp
kmc -k29 -m128 -okff -t16 -hp sample.fq.gz ${TMPDIR}/sample $TMPDIR
vg giraffe -p -t 16 -Z graph.gbz --haplotype-name graph.hapl --kmer-name ${TMPDIR}/sample.kff \
    -N sample -i -f sample.fq.gz > sample.gam

Haplotype sampling is based on kmer counts in the reads. It assumes a diploid sample. Using the counts, the algorithm classifies each kmer used in the haplotype information as absent, heterozygous, present, or frequent in the sequenced sample. It then selects a small number of most relevant haplotypes in each subchain and concatenates them to form synthetic haplotypes.

Note: Read coverage should be at least 20x to tell the difference between heterozygous and homozygous kmers reliably.

Kmer counting

We assume that the kmer counts are in the KFF format. Any kmer counter supporting the format should work, though we have used only KMC so far. Kmers with a single occurrence in the reads can be safely ignored, and the counts for a kmer and its reverse complement can be combined or separate.

It is important to use the same kmer length in kmer counting as in the haplotype information (default: 29).

Sampling the haplotypes

We use the vg haplotypes for sampling the haplotypes. In addition to the input graph, we need the haplotype information and the kmer counts we generated earlier. These can be specified using options -i / --haplotype-input and -k / --kmer-input, respectively. The output is a sampled GBZ graph, which will be written to a file specified by option -g / --gbz-output.

The current algorithm selects the most relevant haplotypes independently in each subchain. It scores each haplotype in the subchain as a sum of kmer scores. A haplotype that gets a present kmer (in the sequenced sample) right/wrong initially gets +1.0/-1.0. The initial scores for absent kmers are +0.8/-0.8. For heterozygous kmers, the initial scores are +0.0/-0.0 if the kmer is present/absent in the haplotype.

The first selected haplotype should be the best approximation for the consensus sequence of the sequenced sample. After selecting a haplotype, the algorithm adjusts the scores for the remaining haplotypes. If the selected haplotype contained a present kmer, the score for that kmer will be discounted by multiplicative factor 0.9. The scores for heterozygous kmers will be adjusted by additive term -0.05/+0.05 if the kmer was present in the selected haplotype, and by +0.05/-0.05 if not. The scores for absent kmers do not change.

The following options can be used with the vg haplotypes command:

--coverage: Kmer coverage in the reads. Default: estimate from kmer counts.
--num-haplotypes: Number of generated synthetic haplotypes, or the number of candidate haplotypes in the diploid sampling mode. Default: 4 generated haplotypes or 32 candidates.
--present-discount: Multiplicative factor for discounting scores for present kmers. Default: 0.9.
--het-adjustment: Additive term for adjusting scores for heterozygous kmers. Default: 0.05.
--absent-score: Score for absent kmers. Default: 0.8.
--diploid-sampling: Enable the diploid sampling mode (see below).
--include-reference: Include reference paths and generic paths from the full graph in the sampled graph.
-v / --verbosity: Verbosity level (0 = silent, 1 = basic, 2 = detailed). Default: 0.
-t / --threads: Approximate number of threads to be used. Default: depends on the system.

Note: Coverage estimation heuristic assumes that the most common kmer count above 1 is close to either the coverage or coverage/2. There will be a warning if the heuristic considers the estimate unreliable.

Note: Option --include-reference is necessary if the sampled graph is intended for a workflow based on a linear reference genome. That includes mapping reads in the BAM format and calling variants relative to a linear reference genome.

Index construction and read mapping

The haplotype sampling pipeline is currently intended for short read mapping with vg giraffe. For that purpose, we need a distance index and a minimizer index. As discussed in Giraffe best practices, if you give vg giraffe a GBZ graph without the indexes, it will build them automatically. While that can cause issues when multiple jobs are using the same graph simultaneously, the sampled graph is only intended for mapping a single set of reads. Hence we can use the automatic behavior safely.

If you want to build the indexes manually, you can use the following commands:

vg index -j sampled.dist sampled.gbz
vg minimizer -p -t 16 -o sampled.min -d sampled.dist sampled.gbz

Note: The sampled graph is a subgraph of the input graph. Any alignments to the sampled graph are also valid alignments to the input graph.

Giraffe integration

Starting with vg version 1.51.0, haplotype sampling is integrated into Giraffe. To use it, run vg giraffe with the following options:

-Z / --gbz-name: Full GBZ graph file. Do not provide distance / minimizer indexes.
--haplotype-name: Haplotype information file.
--kff-name: KFF file with kmer counts.
--index-basename: Name prefix for the sample-specific graph/index files. If omitted, Giraffe will use the path and the base name of the full graph as prefix.
-N / --sample: Sample name. If omitted, Giraffe will use the base name of the KFF file as sample name.

Starting from vg version 1.54.0, this will do diploid sampling (see below) from 32 candidates with default parameters and option --include-reference. Until version 1.53.0, it sampled 4 haplotypes instead. The sample-specific graph/index files will have names of the form prefix.sample.extension. If the name of the full graph is graph.gbz or graph.giraffe.gbz, Giraffe will use graph as the prefix in the usual way. This can be overridden with option --index-basename.

Note: Sample-specific graph/index files will take tens of gigabytes for each human sample. If you are mapping multiple samples, you may want to use option --index-basename to have more control over the location of the files.

Diploid sampling

Starting with vg version 1.51.0, there is an optional diploid sampling mode. The mode is enabled by vg haplotypes option --diploid-sampling. In the diploid sampling mode, we first select the specified number of candidate haplotypes using the usual greedy algorithm. Then we consider all pairs of candidates and choose the highest-scoring pair according to a simple scoring scheme:

Present kmer: -1 / 0 / +1 for 0 / 1 / 2 copies in the haplotypes
Heterozygous kmer: 0 / +1 / 0
Absent kmer: +1 / 0 / -1

As usual, this process is repeated independently in each subchain. The output graph will contain two synthetic haplotypes, plus the possible reference paths if option --include-reference is used.