-
Notifications
You must be signed in to change notification settings - Fork 10
vcf import
To import a VCF file into the vdb, use biograph vdb vcf import
:
$ biograph vdb vcf import HG002.vcf.gz
2e7a0129-13a5-44b6-8594-2fc2e6c80e6c
Importing HG002.vcf.gz
Conversion complete. Verifying parquet.
4556521 variants converted to parquet
Uploading to S3...
Validating upload...
VCF imported, aid=2e7a0129-13a5-44b6-8594-2fc2e6c80e6c
VCF files may be optionally compressed with bgzip/gzip. If the filename starts with s3://
it will automatically be downloaded from S3 prior to import.
The import process takes less than a minute for a typical WGS VCF containing about 4.5M variants. The VCF is converted locally to Apache parquet format, uploaded to S3, then imported into Athena.
The genetic reference used for variant calling must be known prior to import. An attempt is made to derive the reference from the VCF header information using biograph refhash. If the reference cannot be determined, it must be provided explicitly using the --refhash
(or -r
) option.
The vdb uses the genetic reference to enforce rules for data consistency:
- VCFs may only be added to a study if they use the same reference
- Annotations may only be applied to a study if it uses a compatible reference from the same family (GRCh37, GRCh38, etc.)
- Chromosome names are stored internally in EBI format (1, 2, 3) and are automatically translated to/from the native reference format as needed.
A random analysis identifier (aid) is generated at import time and is used to uniquely identify entries from each imported VCF. The aid is printed to STDOUT and may be captured for further pipeline use if desired:
$ AID=$(biograph vdb vcf import HG002.vcf.gz)
Importing HG002.vcf.gz
...
$ echo $AID
5c159228-866c-48b0-ab98-348ae6b04196
If your working environment has 32 (or more) processors and 64GB (or more) of memory, you can save time by importing several VCFs in parallel using xargs
and a simple shell script.
$ echo 'biograph vdb vcf import $*' > go
$ chmod +x go
# six at a time is reasonable on an r5d.8xlarge
$ ls -d /path/to/lots/of/VCFs/* | xargs -n1 -P6 ./go
-
--sample
or-s
: Change the sample name for this VCF (default: extract the sample name from the VCF file) -
--description
or-d
: Add an optional description for this VCF -
--refhash
or-r
: Provide an explicit genetic reference name -
--aid
: Use a pre-generated aid instead of generating a random identifier. It must use the UUID4 GUID format and must be globally unique. Use with caution. -
--output
or-o
: The output directory for generating parquet data files. Default is the current directory. -
--keep-data
: The parquet data is normally deleted after upload. Use this switch to retain it for further local processing if desired. -
--threads
: Maximum number of threads (default: use all processors) -
--tmp
: Temporary directory (default:$TMPDIR
or/tmp
)
$ biograph vdb vcf import --help
usage: import [-h] [-s SAMPLE] [-d DESCRIPTION] [-r REFHASH] [--aid AID]
[-o OUTPUT] [--keep-data] [--threads THREADS] [--tmp TMP]
[input]
Import a VCF file. It may be optionally gzip/bgzip compressed.
positional arguments:
input Input VCF filename
optional arguments:
-h, --help show this help message and exit
-s SAMPLE, --sample SAMPLE
Sample name for variants (default: extract from VCF file)
-d DESCRIPTION, --description DESCRIPTION
Free form description of this VCF
-r REFHASH, --refhash REFHASH
Explicit reference name or hash (default: extract from
input file)
--aid AID Unique GUID (default: autogenerate)
-o OUTPUT, --output OUTPUT
Output directory prefix (.)
--keep-data Keep a local copy of the converted import data
(default: delete after upload)
--threads THREADS Maximum number of threads (auto)
--tmp TMP Temporary directory (/tmp)