Skip to content

anno import

Rob Flickenger edited this page Aug 9, 2021 · 1 revision

Annotations can be applied to a study when exporting to VCF. Use biograph vdb anno import to import a variant annotation source. Annotation files may be in GFF, GTF, or VCF format. Common annotation sources include ClinVar, dbSNP, Ensembl gene annotations, or any custom annotation data that includes at least a chromosome name and range of positions.

This command is similar to vcf import. The annotation data is locally converted to parquet format and then uploaded to S3 and Athena.

  • --input / -i: The path to the annotation file.
  • --format / -f: The annotation file format. This is only required if it cannot be automatically determined by biograph.
  • --refhash / -r: A refhash name or hash. This can often be determined from the annotation file headers, but must be specified if reference contigs are not available. This determines the reference build, which is required to ensure that annotations are only applied to studies using a compatible reference.
  • --description / -d: An arbitrary string description of this annotation.
  • --aid: Use a pre-generated aid instead of generating a random identifier. It must use the UUID4 GUID format and must be globally unique. Use with caution.
  • --output or -o: The output directory for generating parquet data files. Default is the current directory.
  • --keep-data: The parquet data is normally deleted after upload. Use this switch to retain it for further local processing if desired.
  • --threads: Maximum number of threads (default: use all processors)
  • --tmp: Temporary directory (default: $TMPDIR or /tmp)
$ biograph vdb anno import -i clinvar38.vcf.gz ClinVar 2020-10-03 --refhash grch38
5669322b-dcc1-4aa6-960a-f68e20c64cf9
Importing from clinvar38.vcf.gz for build GRCh38
Conversion complete. Verifying parquet.
776050 annotations converted to parquet
Uploading to S3...
Validating upload...
Annotation imported, aid=5669322b-dcc1-4aa6-960a-f68e20c64cf9

$ biograph vdb anno list
anno_name    version      imported_on          build      annotations   aid                                  description
ClinVar      2020-10-03   2021-05-27 14:32:58  GRCh38     776050        5669322b-dcc1-4aa6-960a-f68e20c64cf9

Be sure to import annotations for each genetic reference you intend to use. The correct version will be applied when running study export.

$ biograph vdb anno import -i clinvar37.vcf.gz ClinVar 2020-10-03 --refhash hs37d5
b480aa60-b81d-4741-8e5a-616cd0f01c5f
Importing from clinvar37.vcf.gz for build GRCh37
Conversion complete. Verifying parquet.
775850 annotations converted to parquet
Uploading to S3...
Validating upload...
Annotation imported, aid=b480aa60-b81d-4741-8e5a-616cd0f01c5f

$ biograph vdb anno list
anno_name    version      imported_on          build      annotations   aid                                  description
ClinVar      2020-10-03   2021-05-27 14:35:56  GRCh37     775850        b480aa60-b81d-4741-8e5a-616cd0f01c5f
ClinVar      2020-10-03   2021-05-27 14:32:58  GRCh38     776050        5669322b-dcc1-4aa6-960a-f68e20c64cf9

Getting more help

$ biograph vdb anno import --help
usage: biograph vdb anno import [-h] [-i INPUT] [-f {vcf,gtf,gff}]
                                [-r REFHASH] [-d DESCRIPTION] [-o OUTPUT]
                                [--aid AID] [--keep-data] [--threads THREADS]
                                [--tmp TMP]
                                anno_name version

Import variant annotation data in GFF, GTF, or VCF format.

positional arguments:
  anno_name             Name for this annotation (eg. ClinVar)
  version               Version of this annotation (eg. 2020-10-03)

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        Input filename (default: STDIN)
  -f {vcf,gtf,gff}, --format {vcf,gtf,gff}
                        Input contains annotations in this format
  -r REFHASH, --refhash REFHASH
                        Explicit reference name or hash (default: extract from
                        input file if possible)
  -d DESCRIPTION, --description DESCRIPTION
                        Free form description of this annotation
  -o OUTPUT, --output OUTPUT
                        Output directory prefix (.)
  --aid AID             Unique GUID (default: autogenerate)
  --keep-data           Keep a local copy of the converted import data
                        (default: delete after upload)
  --threads THREADS     Maximum number of threads (48)
  --tmp TMP             Temporary directory (/raid2/rob/tmp)
Clone this wiki locally