Skip to content

Giraffe best practices

Jouni Siren edited this page Dec 3, 2023 · 8 revisions

Use a recent version of vg. Pangenome tools are more complex and less mature than the software tools you are likely used to. There is usually a new vg release in every 6 weeks that fixes issues people have identified.

Typical command lines

Paired-end mapping from an interleaved FASTQ file to GAM format using 32 threads:

vg giraffe -p -t 32 -Z graph.gbz -d graph.dist -m graph.min \
    -i -f reads.fq.gz > output.gam

Paired-end mapping from two FASTQ files to GAM format using 32 threads:

vg giraffe -p -t 32 -Z graph.gbz -d graph.dist -m graph.min \
    -f reads.R1.fq.gz -f reads.R2.fq.gz > output.gam

Paired-end mapping from an interleaved FASTQ file to BAM format using 32 threads:

vg giraffe -p -t 32 -Z graph.gbz -d graph.dist -m graph.min -x graph.xg \
    -i -f reads.fq.gz -o BAM > output.bam

Inputs

Graphs

Giraffe uses a GBZ graph that combines the graph with haplotype information. The graph must be provided with option -Z / --gbz-name. Using other graph types with Giraffe is not recommended.

If you start from a graph in GFA format, use W-lines for the haplotype paths. The heuristics vg uses with P-lines are fragile and can lead to incorrect results. In particular, in version 1.52.0 and earlier, vg autoindex does not understand P-lines properly. Starting from version 1.53.0, vg autoindex will recognize haplotypes stored as P-lines, as long as path names follow one of the expected patterns.

Graph construction is a hard problem. A pangenome graph is effectively a multiple sequence alignment. While it is easy to build an alignment, finding a useful alignment may require trial and error. Giraffe prefers graphs that avoid complex structures at every level of detail, while simultaneously minimizing sequence duplication.

The Minigraph–Cactus pipeline builds graphs like that, and the default parameters are suitable for human-like genomes. With a small number of haplotypes (e.g. 10), the default (clip) graph is usually a good choice. With substantially more haplotypes, you may get better results with the filter graph that drops nodes used only by a single haplotype.

Distance index and minimizer index

In addition to the graph, Giraffe also needs a distance index and a minimizer index. These can be specified with options -d / --dist-name and -m / --minimizer-name. Specifying the index files explicitly is recommended in scripts and especially in production use.

If the indexes are not specified and the graph is named graph.gbz or graph.giraffe.gbz, Giraffe will guess that the indexes are in files graph.dist and graph.min. If these files do not exist, Giraffe will try to build the indexes. This can cause issues when running multiple Giraffe jobs using the same indexes.

The distance index is a memory-mapped file. As of vg version 1.48.0, the file will be opened in read+write mode by default. This can cause issues in HPC clusters and other distributed environments, where multiple computers try to access the same distance index file. To avoid this, make the file read-only.

Giraffe relies on distance index annotations in the minimizer index. Without them, mapping speed will be slow. Minimizer indexes built using vg autoindex always contain them. If you build the minimizer index manually using vg minimizer, you must specify the distance index using option -d / --distance-index.

Reads

The reads can be either in FASTQ format (option -f / --fastq-in) or in GAM format (-G / --gam-in). A FASTQ file can be gzip-compressed.

By default, Giraffe does single-end mapping. For paired-end mapping, either specify two FASTQ files or use option -i / --interleaved with a single interleaved input file.

When doing paired-end mapping, Giraffe assumes that the first few thousand reads are a representative sample of the overall file and uses them for estimating fragment length distribution. If the input file is sorted, this does not work, and you must specify the fragment length distribution using options --fragment-mean and --fragment-stdev.

Outputs

Alignments

Giraffe writes the alignments to stdout. The default format is GAM, which is a binary format corresponding to the internal representation of the alignments. Other formats can be specified using option -o / --output-format.

GAF is a text-based format supported by other pangenomics tools. A compressed GAF file is usually smaller than the corresponding GAM file.

Status messages

Giraffe prints status messages to stderr. Additional progress information can be printed using option -p / --progress. This is recommended when troubleshooting or reporting issues.

Some environments offer an option to combine stdout and stderr into a single output stream. Such options must not be used with Giraffe.

SAM / BAM / CRAM output

As of vg version 1.49.0, there are inconsistent reports of poor performance when writing the alignments in SAM / BAM / CRAM format. To avoid that, you can specify a second graph with vg giraffe option -x / --xg-name. The other graph must be compatible with the GBZ graph.

You can convert a GBZ graph into XG format using:

vg convert -x --drop-haplotypes graph.gbz > graph.xg

You may encounter the same issue when converting the alignments to SAM / BAM / CRAM format with vg surject. It may therefore be better to use an XG graph instead of a GBZ graph in vg surject.

Performance

Multithreading

By default, Giraffe uses a single output thread and one mapping thread for every CPU core visible to OpenMP. The number of mapping threads can be adjusted using option -t / --threads or by environment variable OMP_NUM_THREADS.

Expected performance

Mapping a 30x human sample using 32 mapping threads should take 1-10 hours, depending on the computer and the graph. This corresponds roughly to 500-5000 reads per second per thread. After the first few minutes, the size of the output file should grow at a steady pace. If you see substantially lower performance, something is likely wrong.

Clone this wiki locally