Skip to content

study filter

Rob Flickenger edited this page Aug 9, 2021 · 1 revision

study filter

The biograph vdb study filter command creates a new study checkpoint with filters applied to the variants from the most recent checkpoint. Each call to filter must include exactly one --include/-i or --exclude/-e option. If --include is specified, variants that match the filter are copied to the new checkpoint. If --exclude is specified, variants that match the filter are excluded from the new checkpoint.

The vdb study filter syntax is flexible and similar to bcftools filter.

Filter syntax

Filter terms may be combined using (), AND, and OR. Common INFO and FORMAT fields can be referred to directly (eg. GT, SVTYPE) and all fields can be looked up directly (FMT/GT, INFO/SVTYPE). String values should be enclosed in ' or " quotes (both are equivalent, so choose the quote that most easily accommodates your shell requirements).

All operators are case insensitive.

Logical operators

 AND && : logical AND
OR , || : logical OR
      ! : NOT (negation)

Comparison operators

= == : equality
  != : inequality
   > : greater than
  >= : greater or equal to
   < : less than
  <= : less than or equal to

Arithmetic operators

  + : addition
  - : subtraction
  * : multiplication
  / : division
( ) : parenthesis grouping

VCF columns

         CHROM : chromosome
           POS : position
            ID : variant ID
           REF : reference sequence
           ALT : alternate sequence
          QUAL : quality
   FILT FILTER : filter
        INFO/? : INFO field lookup, eg. INFO/SVTYPE
FMT/? FORMAT/? : FORMAT field lookup, eg. FMT/GT

Other terms

Common fields can be looked up automatically without requiring INFO/ or FMT/

Common FORMAT field integers:
DP DV GQ LAALTSEQLEN LALANCH LARANCH LAREFSPAN LASCORE NUMASM OV PDP PI RC

Common FORMAT field floats:
LAALTGC LAREFGC

Common FORMAT field strings:
AC AD DC DCC DDC DMO DS DXO EC GT MC MO MP NR PAD PG PL UC UCC UDC UMO US UXO XC XO

Common INFO field integers:
SVLEN END

Common INFO field strings:
SVTYPE

See [Missingness filters]() below for examples of how to use these terms:

N_MISS N_MISSING : number of samples missing this variant
F_MISS F_MISSING : fraction of samples missing this variant

S_MISS SAMPLE_MISS
S_MISSING SAMPLE_MISSING : fraction of variants missing per-sample

Filter examples

The following examples are not exhaustive but demonstrate common filter query syntax.

INFO or FORMAT lookups

# Before filtering
$ biograph vdb study show my_study
      study_name: my_study
      created_on: 2021-05-18 13:02:51
           build: GRCh38
         refname: grch38

checkpoints:
   1: added HG002: 2e7a0129-13a5-44b6-8594-2fc2e6c80e6c; HG003: a98626d6-758b-468d-add9-fbfbac47d207; HG004: 9a174215-8fc5-4c6f-bc4b-134654f65b99

sample_name      variant_count
HG002            4556521
HG003            4350342
HG004            4632093

# Filter for het only. GT is a common field, short for FMT/GT
$ biograph vdb study filter my_study --include "GT = '0/1'"
Applying filter
Study my_study:
  variants: 13538956 -> 7148030

# After filtering
$ biograph vdb study show my_study
study_name: my_study
      created_on: 2021-05-18 13:02:51
           build: GRCh38
         refname: grch38

checkpoints:
   1: added HG002: 2e7a0129-13a5-44b6-8594-2fc2e6c80e6c; HG003: a98626d6-758b-468d-add9-fbfbac47d207; HG004: 9a174215-8fc5-4c6f-bc4b-134654f65b99
   2: include GT = '0/1'

sample_name      variant_count
HG002            2464954
HG003            2183539
HG004            2499537

# Keep only SV inserts. SVTYPE is common, short for INFO/SVTYPE.
# Since we already filtered for het, these are all het SV inserts.
$ biograph vdb study filter my_study --include "SVTYPE = 'INS'"
Applying filter
Study my_study:
  variants: 7148030 -> 50365

$ biograph vdb study show my_study
study_name: my_study
      created_on: 2021-05-18 13:02:51
           build: GRCh38
         refname: grch38

checkpoints:
   1: added HG002: 2e7a0129-13a5-44b6-8594-2fc2e6c80e6c; HG003: a98626d6-758b-468d-add9-fbfbac47d207; HG004: 9a174215-8fc5-4c6f-bc4b-134654f65b99
   2: include GT = '0/1'
   3: include SVTYPE = 'INS'

sample_name      variant_count
HG002            16447
HG003            15883
HG004            18035

# Chaining filter terms produces equivalent results but is faster and 
# creates only one intermediary checkpoint.
$ biograph vdb study revert my_study --checkpoint 1
Removing S3 data for 'my_study' at checkpoint 3
Dropping SQL partitions for 'my_study' at checkpoint 3
Removing S3 data for 'my_study' at checkpoint 2
Dropping SQL partitions for 'my_study' at checkpoint 2
Study 'my_study' reverted to checkpoint 1

$ biograph vdb study filter my_study --include "GT = '0/1' AND SVTYPE = 'INS'"
Applying filter
Study my_study:
  variants: 13538956 -> 50365

$ biograph vdb study show my_study
study_name: my_study
      created_on: 2021-05-18 13:02:51
           build: GRCh38
         refname: grch38

checkpoints:
   1: added HG002: 2e7a0129-13a5-44b6-8594-2fc2e6c80e6c; HG003: a98626d6-758b-468d-add9-fbfbac47d207; HG004: 9a174215-8fc5-4c6f-bc4b-134654f65b99
   2: include GT = '0/1' AND SVTYPE = 'INS'

sample_name      variant_count
HG002            16447
HG003            15883
HG004            18035

VCF FILTER field

# before filtering
$ biograph vdb study show my_study
study_name: my_study
      created_on: 2021-05-13 14:50:19
           build: GRCh38
         refname: grch38

checkpoints:
   1: added HG002: 2e7a0129-13a5-44b6-8594-2fc2e6c80e6c; HG003: a98626d6-758b-468d-add9-fbfbac47d207; HG004: 9a174215-8fc5-4c6f-bc4b-134654f65b99

sample_name      variant_count
HG002            4556521
HG003            4350342
HG004            4632093

# filter for PASS only. This is equivalent to --exclude "FILTER != 'PASS'"
$ biograph vdb study filter my_study --include 'FILTER = "PASS"'
Applying filter
Study my_study:
  variants: 13538956 -> 13264569

# after filtering
$ biograph vdb study show my_study
study_name: my_study
      created_on: 2021-05-13 14:50:19
           build: GRCh38
         refname: grch38

checkpoints:
   1: added HG002: 2e7a0129-13a5-44b6-8594-2fc2e6c80e6c; HG003: a98626d6-758b-468d-add9-fbfbac47d207; HG004: 9a174215-8fc5-4c6f-bc4b-134654f65b99
   2: include FILTER = "PASS"

sample_name      variant_count
HG002            4450296
HG003            4271571
HG004            4542702

VCF QUAL field

$ biograph vdb study filter my_study --include 'QUAL > 60'

VCF chromosome and position

The vdb uses EBI notation (1, 2, 3, ... X, Y, MT) for the autosomes, sex chromosomes, and mitochondria regardless of the original VCF format. All other chromosome names (alternates, decoys, etc.) are used verbatim. The original chromosome names are automatically translated to native format when exporting back to VCF with study export, but EBI format should be used for the vdb study filter command.

$ biograph vdb study filter my_study --include 'chrom = "5" and pos > 20000'
Applying filter
Study ajtrio:
  variants: 12468296 -> 763212

If a filter would exclude all available variants, the study is automatically reverted back to the previous checkpoint. This may happen if the filtering criteria are too strict, or a chromosome was not specified in EBI notation.

$ biograph vdb study filter my_study --include 'chrom = "chr9"'
Applying filter
This filter removed all variants from the study. Rolling back to previous checkpoint.
Removing S3 data for 'my_study' at checkpoint 5
Dropping SQL partitions for 'my_study' at checkpoint 5 

The vdb stores reference position internally starting at position 0, not 1. This translation is automatically handled by vdb study filter and on VCF import / export. Positions should be specified as they would be in VCF, starting with 1.

Missingness filters

When variant calls are observed across all samples in a study, not every variant will be present in every individual.

To filter variants that are missing in a specific number of individuals, use N_MISS.

# only include variants that are missing from five or fewer individuals
$ biograph vdb study filter my_study --include 'N_MISS <= 5'

To filter variants that are missing in a fraction of the total number of individuals, use F_MISS.

# only include variants that are missing from < 66% of individuals
$ biograph vdb study filter my_study --include 'F_MISS < 0.667'

To filter samples that are missing a fraction of the total number of variants in the study, use SAMPLE_MISS.

# drop samples that are missing more than 33% of all variants
$ biograph vdb study filter my_study --exclude 'SAMPLE_MISS > 0.33'
Applying filter
Study my_study:
  variants: 13538956 -> 4632093
   samples: [HG002, HG003, HG004] -> [HG004]

Getting more help

$ biograph vdb study filter --help
usage: biograph vdb study filter [-h] (-i INCLUDE | -e EXCLUDE) study_name

Filter variants in a study

Filter variants in a study using bcftools filter syntax. A new study
checkpoint will be created.

Use --include to include variants that match the filter.

Use --exclude to exclude variants that match the filter.

Examples:

 # PASS only
 $ biograph vdb study filter my_study --exclude "FILTER != 'PASS'"

 # High quality hets on chr16
 $ biograph vdb study filter my_study --include "chrom = '16' AND GT = 0/1 AND qual > 50"

 # Per-variant missingness
 $ biograph vdb study filter my_study --include "F_MISS > 0.2"

 # Per-sample missingness
 $ biograph vdb study filter my_study --exclude "SAMPLE_MISS > 0.1"

positional arguments:
  study_name            Name of the study

optional arguments:
  -h, --help            show this help message and exit
  -i INCLUDE, --include INCLUDE
                        Include only variants that match these criteria
  -e EXCLUDE, --exclude EXCLUDE
                        Exclude all variants that match these criteria