Skip to content

vcf import

Rob Flickenger edited this page Aug 9, 2021 · 2 revisions

To import a VCF file into the vdb, use biograph vdb vcf import:

$ biograph vdb vcf import HG002.vcf.gz
2e7a0129-13a5-44b6-8594-2fc2e6c80e6c
Importing HG002.vcf.gz
Conversion complete. Verifying parquet.
4556521 variants converted to parquet
Uploading to S3...
Validating upload...
VCF imported, aid=2e7a0129-13a5-44b6-8594-2fc2e6c80e6c

VCF files may be optionally compressed with bgzip/gzip. If the filename starts with s3:// it will automatically be downloaded from S3 prior to import.

The import process takes less than a minute for a typical WGS VCF containing about 4.5M variants. The VCF is converted locally to Apache parquet format, uploaded to S3, then imported into Athena.

References and chromosome names

The genetic reference used for variant calling must be known prior to import. An attempt is made to derive the reference from the VCF header information using biograph refhash. If the reference cannot be determined, it must be provided explicitly using the --refhash (or -r) option.

The vdb uses the genetic reference to enforce rules for data consistency:

  • VCFs may only be added to a study if they use the same reference
  • Annotations may only be applied to a study if it uses a compatible reference from the same family (GRCh37, GRCh38, etc.)
  • Chromosome names are stored internally in EBI format (1, 2, 3) and are automatically translated to/from the native reference format as needed.

Capturing the AID

A random analysis identifier (aid) is generated at import time and is used to uniquely identify entries from each imported VCF. The aid is printed to STDOUT and may be captured for further pipeline use if desired:

$ AID=$(biograph vdb vcf import HG002.vcf.gz)
Importing HG002.vcf.gz
...
$ echo $AID
5c159228-866c-48b0-ab98-348ae6b04196

Importing several samples in parallel

If your working environment has 32 (or more) processors and 64GB (or more) of memory, you can save time by importing several VCFs in parallel using xargs and a simple shell script.

$ echo 'biograph vdb vcf import $*' > go
$ chmod +x go
# six at a time is reasonable on an r5d.8xlarge
$ ls -d /path/to/lots/of/VCFs/* | xargs -n1 -P6 ./go

Additional import parameters

  • --sample or -s: Change the sample name for this VCF (default: extract the sample name from the VCF file)
  • --description or -d: Add an optional description for this VCF
  • --refhash or -r: Provide an explicit genetic reference name
  • --aid: Use a pre-generated aid instead of generating a random identifier. It must use the UUID4 GUID format and must be globally unique. Use with caution.
  • --output or -o: The output directory for generating parquet data files. Default is the current directory.
  • --keep-data: The parquet data is normally deleted after upload. Use this switch to retain it for further local processing if desired.
  • --threads: Maximum number of threads (default: use all processors)
  • --tmp: Temporary directory (default: $TMPDIR or /tmp)

Getting more help

$ biograph vdb vcf import --help
usage: import [-h] [-s SAMPLE] [-d DESCRIPTION] [-r REFHASH] [--aid AID]
                [-o OUTPUT] [--keep-data] [--threads THREADS] [--tmp TMP]
                [input]

Import a VCF file. It may be optionally gzip/bgzip compressed.

positional arguments:
  input                 Input VCF filename

optional arguments:
  -h, --help            show this help message and exit
  -s SAMPLE, --sample SAMPLE
                        Sample name for variants (default: extract from VCF file)
  -d DESCRIPTION, --description DESCRIPTION
                        Free form description of this VCF
  -r REFHASH, --refhash REFHASH
                        Explicit reference name or hash (default: extract from
                        input file)
  --aid AID             Unique GUID (default: autogenerate)
  -o OUTPUT, --output OUTPUT
                        Output directory prefix (.)
  --keep-data           Keep a local copy of the converted import data
                        (default: delete after upload)
  --threads THREADS     Maximum number of threads (auto)
  --tmp TMP             Temporary directory (/tmp)
Clone this wiki locally