-
Notifications
You must be signed in to change notification settings - Fork 10
VDB
The BioGraph Variant DataBase (vdb) solves the performance and organizational problems of VCF by leveraging cloud-based storage, distributed queries, and modern design principles.
- Organize any number of VCF results for any number of samples.
- Import, merge, filter, annotate, and export VCF files directly from the command line.
- Apply
bcftools
filter syntax to refine variant selection using parallel processing to scale to billions of variants. - Merge any number of samples into a project-level VCF.
- Automatically calculate allele frequencies and other aggregate statistics.
- Annotate your variants using a variety of popular annotations, or import your own.
The vdb uses Amazon S3 and Athena. An AWS EC2 instance, Internet access, a VPN, or AWS Direct Connect is required to use the vdb.
To access the biograph vdb
commands, set the following environment variables to the values provided by Spiral:
-
VDB_DB
: The name of your vdb database -
VDB_BUCKET
: The name of the S3 bucket where your data will be stored
The usual AWS_ACCESS_KEY_ID
, AWS_SECRET_ACCESS_KEY
, and AWS_DEFAULT_REGION
environment variables configure access to AWS services. These will also be provided by Spiral if needed. If you are running biograph vdb
commands from an AWS instance deployed with an instance profile with sufficient privileges, these environment variables can be omitted.
# import some VCFs
$ biograph vdb vcf import HG002.vcf.gz
$ biograph vdb vcf import HG003.vcf.gz
$ biograph vdb vcf import HG004.vcf.gz
...
# show available VCFs
$ biograph vdb vcf list
sample_name imported_on refname variant_count aid
HG002 2021-05-13 14:31:21 grch38 4556521 2e7a0129-13a5-44b6-8594-2fc2e6c80e6c
HG003 2021-05-13 14:32:34 grch38 4350342 a98626d6-758b-468d-add9-fbfbac47d207
HG004 2021-05-13 14:33:43 grch38 4632093 9a174215-8fc5-4c6f-bc4b-134654f65b99
# add samples to a study
$ biograph vdb study create ajtrio
Study 'ajtrio' created
$ biograph vdb study add ajtrio 'HG00*'
Matching VCFs:
HG002: 2e7a0129-13a5-44b6-8594-2fc2e6c80e6c
HG003: a98626d6-758b-468d-add9-fbfbac47d207
HG004: 9a174215-8fc5-4c6f-bc4b-134654f65b99
Adding 13,538,956 variants from 3 VCFs to study ajtrio
# remove non-PASS variants
$ biograph vdb study filter ajtrio --include 'FILTER = "PASS"'
Applying filter
Study ajtrio:
variants: 13538956 -> 13264569
# import an annotation
$ biograph vdb anno import -i clinvar.vcf.gz ClinVar 2020-10-03 -r grch38
Importing from clinvar.vcf.gz for build GRCh38
...
# merge and export a project-level VCF with annotations
$ biograph vdb study export ajtrio --anno ClinVar | bgzip > project.vcf.gz
Merging variants for checkpoint 2
Annotating variants with ClinVar
...
Exporting VCF
# project VCF is merged, sorted, annotated, and ready to go
$ ls -lh project.vcf.gz
-rw-rw-r-- 1 ubuntu ubuntu 892M May 13 14:41 project.vcf.gz
The vdb consists of several commands grouped by function. To get simple help for any vdb command, simply run it with no additional parameters. For extended help, include the --help
option.
$ biograph vdb
usage: vdb [-h] CMD ...
vdb - the Spiral Variant DataBase
Subcommands:
vcf Import and export VCF files
study Gather, filter, and report on variants
anno Manage variant annotations
positional arguments:
CMD Command to execute
OPTIONS Options to pass to the command
optional arguments:
-h, --help show this help message and exit
The biograph vdb vcf
commands are used to import and manipulate VCF data.
$ biograph vdb vcf
usage: vcf [COMMAND] [options]
Import and export VCF data.
Run any command with --help for additional information.
import Import a VCF
export Export a VCF
list List all available VCFs
delete Delete a VCF
sort Sort a VCF file
optional arguments:
-h, --help show this help message and exit
The biograph vdb study
commands are used group VCF variants into a study. Studies allow variants to be filtered or merged, and a checkpoint is created each time a change is made. Merged studies can then be exported back into VCF for further processing by other tools.
- study create
- study list
- study show
- study add
- study filter
- study export
- study freeze
- study unfreeze
- study revert
- study delete
$ biograph vdb study
usage: study [COMMAND] [options]
Manage studies in the Spiral Variant DataBase (VDB).
Run any command with --help for additional information.
create Create a new study
list List all available studies
show Show details about a study
add Add variants to a study
filter Filter variants in a study
export Export a study to a VCF file
freeze Prevent changes to a study
unfreeze Allow changes to a study
revert Revert to a previous checkpoint
delete Delete a study
optional arguments:
-h, --help show this help message and exit
The biograph vdb anno
commands are used to import and maintain variant annotation data in a variety of formats.
$ biograph vdb anno
usage: anno [COMMAND] [options]
Import and export variant annotation data.
Run any command with --help for additional information.
import Import an annotation file
export Export an annotation
list List all available annotations
delete Delete an annotation
optional arguments:
-h, --help show this help message and exit
- Multiallelic sites should be reported one per line. If your caller reports multiple alleles per line, split them with
bcftools norm -m-any my.vcf.gz
- This is a BETA feature under active development. Your feedback is very welcome!