Skip to content
Rob Flickenger edited this page Aug 9, 2021 · 1 revision

The BioGraph Variant DataBase (vdb) solves the performance and organizational problems of VCF by leveraging cloud-based storage, distributed queries, and modern design principles.

Major features

  • Organize any number of VCF results for any number of samples.
  • Import, merge, filter, annotate, and export VCF files directly from the command line.
  • Apply bcftools filter syntax to refine variant selection using parallel processing to scale to billions of variants.
  • Merge any number of samples into a project-level VCF.
  • Automatically calculate allele frequencies and other aggregate statistics.
  • Annotate your variants using a variety of popular annotations, or import your own.

Preparing your environment

The vdb uses Amazon S3 and Athena. An AWS EC2 instance, Internet access, a VPN, or AWS Direct Connect is required to use the vdb.

To access the biograph vdb commands, set the following environment variables to the values provided by Spiral:

  • VDB_DB: The name of your vdb database
  • VDB_BUCKET: The name of the S3 bucket where your data will be stored

The usual AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_DEFAULT_REGION environment variables configure access to AWS services. These will also be provided by Spiral if needed. If you are running biograph vdb commands from an AWS instance deployed with an instance profile with sufficient privileges, these environment variables can be omitted.

Quick start

# import some VCFs
$ biograph vdb vcf import HG002.vcf.gz
$ biograph vdb vcf import HG003.vcf.gz
$ biograph vdb vcf import HG004.vcf.gz
...

# show available VCFs
$ biograph vdb vcf list
sample_name  imported_on          refname  variant_count aid                                 
HG002        2021-05-13 14:31:21  grch38   4556521       2e7a0129-13a5-44b6-8594-2fc2e6c80e6c
HG003        2021-05-13 14:32:34  grch38   4350342       a98626d6-758b-468d-add9-fbfbac47d207
HG004        2021-05-13 14:33:43  grch38   4632093       9a174215-8fc5-4c6f-bc4b-134654f65b99

# add samples to a study
$ biograph vdb study create ajtrio
Study 'ajtrio' created

$ biograph vdb study add ajtrio 'HG00*'
Matching VCFs:
  HG002: 2e7a0129-13a5-44b6-8594-2fc2e6c80e6c
  HG003: a98626d6-758b-468d-add9-fbfbac47d207
  HG004: 9a174215-8fc5-4c6f-bc4b-134654f65b99

Adding 13,538,956 variants from 3 VCFs to study ajtrio

# remove non-PASS variants
$ biograph vdb study filter ajtrio --include 'FILTER = "PASS"'
Applying filter
Study ajtrio:
  variants: 13538956 -> 13264569

# import an annotation
$ biograph vdb anno import -i clinvar.vcf.gz ClinVar 2020-10-03 -r grch38
Importing from clinvar.vcf.gz for build GRCh38
...

# merge and export a project-level VCF with annotations
$ biograph vdb study export ajtrio --anno ClinVar | bgzip > project.vcf.gz
Merging variants for checkpoint 2
Annotating variants with ClinVar
...
Exporting VCF

# project VCF is merged, sorted, annotated, and ready to go
$ ls -lh project.vcf.gz
-rw-rw-r-- 1 ubuntu ubuntu 892M May 13 14:41 project.vcf.gz

Getting help

The vdb consists of several commands grouped by function. To get simple help for any vdb command, simply run it with no additional parameters. For extended help, include the --help option.

$ biograph vdb
usage: vdb [-h] CMD ...

vdb - the Spiral Variant DataBase

    Subcommands:
        vcf              Import and export VCF files
        study            Gather, filter, and report on variants
        anno             Manage variant annotations

positional arguments:
  CMD         Command to execute
  OPTIONS     Options to pass to the command

optional arguments:
  -h, --help  show this help message and exit

VCF commands

The biograph vdb vcf commands are used to import and manipulate VCF data.

$ biograph vdb vcf
usage: vcf [COMMAND] [options]

Import and export VCF data.

Run any command with --help for additional information.

    import    Import a VCF
    export    Export a VCF

    list      List all available VCFs
    delete    Delete a VCF

    sort      Sort a VCF file

optional arguments:
  -h, --help  show this help message and exit

study commands

The biograph vdb study commands are used group VCF variants into a study. Studies allow variants to be filtered or merged, and a checkpoint is created each time a change is made. Merged studies can then be exported back into VCF for further processing by other tools.

$ biograph vdb study
usage: study [COMMAND] [options]

Manage studies in the Spiral Variant DataBase (VDB).

Run any command with --help for additional information.

    create    Create a new study

    list      List all available studies
    show      Show details about a study

    add       Add variants to a study
    filter    Filter variants in a study

    export    Export a study to a VCF file

    freeze    Prevent changes to a study
    unfreeze  Allow changes to a study

    revert    Revert to a previous checkpoint
    delete    Delete a study

optional arguments:
  -h, --help  show this help message and exit

annotation commands

The biograph vdb anno commands are used to import and maintain variant annotation data in a variety of formats.

$ biograph vdb anno
usage: anno [COMMAND] [options]

Import and export variant annotation data.

Run any command with --help for additional information.

    import    Import an annotation file
    export    Export an annotation

    list      List all available annotations
    delete    Delete an annotation

optional arguments:
  -h, --help  show this help message and exit

Known issues

  • Multiallelic sites should be reported one per line. If your caller reports multiple alleles per line, split them with bcftools norm -m-any my.vcf.gz
  • This is a BETA feature under active development. Your feedback is very welcome!

All vdb commands

Clone this wiki locally