-
Notifications
You must be signed in to change notification settings - Fork 10
study filter
The biograph vdb study filter
command creates a new study checkpoint with filters applied to the variants from the most recent checkpoint. Each call to filter must include exactly one
--include
/-i
or --exclude
/-e
option. If --include
is specified, variants that match the filter are copied to the new checkpoint. If --exclude
is specified, variants that match the filter are excluded from the new checkpoint.
The vdb study filter
syntax is flexible and similar to bcftools filter.
Filter terms may be combined using ()
, AND
, and OR
. Common INFO
and FORMAT
fields can be referred to directly (eg. GT
, SVTYPE
) and all fields can be looked up directly (FMT/GT
, INFO/SVTYPE
). String values should be enclosed in '
or "
quotes (both are equivalent, so choose the quote that most easily accommodates your shell requirements).
All operators are case insensitive.
AND && : logical AND
OR , || : logical OR
! : NOT (negation)
= == : equality
!= : inequality
> : greater than
>= : greater or equal to
< : less than
<= : less than or equal to
+ : addition
- : subtraction
* : multiplication
/ : division
( ) : parenthesis grouping
CHROM : chromosome
POS : position
ID : variant ID
REF : reference sequence
ALT : alternate sequence
QUAL : quality
FILT FILTER : filter
INFO/? : INFO field lookup, eg. INFO/SVTYPE
FMT/? FORMAT/? : FORMAT field lookup, eg. FMT/GT
Common fields can be looked up automatically without requiring INFO/ or FMT/
Common FORMAT field integers:
DP DV GQ LAALTSEQLEN LALANCH LARANCH LAREFSPAN LASCORE NUMASM OV PDP PI RC
Common FORMAT field floats:
LAALTGC LAREFGC
Common FORMAT field strings:
AC AD DC DCC DDC DMO DS DXO EC GT MC MO MP NR PAD PG PL UC UCC UDC UMO US UXO XC XO
Common INFO field integers:
SVLEN END
Common INFO field strings:
SVTYPE
See [Missingness filters]() below for examples of how to use these terms:
N_MISS N_MISSING : number of samples missing this variant
F_MISS F_MISSING : fraction of samples missing this variant
S_MISS SAMPLE_MISS
S_MISSING SAMPLE_MISSING : fraction of variants missing per-sample
The following examples are not exhaustive but demonstrate common filter query syntax.
# Before filtering
$ biograph vdb study show my_study
study_name: my_study
created_on: 2021-05-18 13:02:51
build: GRCh38
refname: grch38
checkpoints:
1: added HG002: 2e7a0129-13a5-44b6-8594-2fc2e6c80e6c; HG003: a98626d6-758b-468d-add9-fbfbac47d207; HG004: 9a174215-8fc5-4c6f-bc4b-134654f65b99
sample_name variant_count
HG002 4556521
HG003 4350342
HG004 4632093
# Filter for het only. GT is a common field, short for FMT/GT
$ biograph vdb study filter my_study --include "GT = '0/1'"
Applying filter
Study my_study:
variants: 13538956 -> 7148030
# After filtering
$ biograph vdb study show my_study
study_name: my_study
created_on: 2021-05-18 13:02:51
build: GRCh38
refname: grch38
checkpoints:
1: added HG002: 2e7a0129-13a5-44b6-8594-2fc2e6c80e6c; HG003: a98626d6-758b-468d-add9-fbfbac47d207; HG004: 9a174215-8fc5-4c6f-bc4b-134654f65b99
2: include GT = '0/1'
sample_name variant_count
HG002 2464954
HG003 2183539
HG004 2499537
# Keep only SV inserts. SVTYPE is common, short for INFO/SVTYPE.
# Since we already filtered for het, these are all het SV inserts.
$ biograph vdb study filter my_study --include "SVTYPE = 'INS'"
Applying filter
Study my_study:
variants: 7148030 -> 50365
$ biograph vdb study show my_study
study_name: my_study
created_on: 2021-05-18 13:02:51
build: GRCh38
refname: grch38
checkpoints:
1: added HG002: 2e7a0129-13a5-44b6-8594-2fc2e6c80e6c; HG003: a98626d6-758b-468d-add9-fbfbac47d207; HG004: 9a174215-8fc5-4c6f-bc4b-134654f65b99
2: include GT = '0/1'
3: include SVTYPE = 'INS'
sample_name variant_count
HG002 16447
HG003 15883
HG004 18035
# Chaining filter terms produces equivalent results but is faster and
# creates only one intermediary checkpoint.
$ biograph vdb study revert my_study --checkpoint 1
Removing S3 data for 'my_study' at checkpoint 3
Dropping SQL partitions for 'my_study' at checkpoint 3
Removing S3 data for 'my_study' at checkpoint 2
Dropping SQL partitions for 'my_study' at checkpoint 2
Study 'my_study' reverted to checkpoint 1
$ biograph vdb study filter my_study --include "GT = '0/1' AND SVTYPE = 'INS'"
Applying filter
Study my_study:
variants: 13538956 -> 50365
$ biograph vdb study show my_study
study_name: my_study
created_on: 2021-05-18 13:02:51
build: GRCh38
refname: grch38
checkpoints:
1: added HG002: 2e7a0129-13a5-44b6-8594-2fc2e6c80e6c; HG003: a98626d6-758b-468d-add9-fbfbac47d207; HG004: 9a174215-8fc5-4c6f-bc4b-134654f65b99
2: include GT = '0/1' AND SVTYPE = 'INS'
sample_name variant_count
HG002 16447
HG003 15883
HG004 18035
# before filtering
$ biograph vdb study show my_study
study_name: my_study
created_on: 2021-05-13 14:50:19
build: GRCh38
refname: grch38
checkpoints:
1: added HG002: 2e7a0129-13a5-44b6-8594-2fc2e6c80e6c; HG003: a98626d6-758b-468d-add9-fbfbac47d207; HG004: 9a174215-8fc5-4c6f-bc4b-134654f65b99
sample_name variant_count
HG002 4556521
HG003 4350342
HG004 4632093
# filter for PASS only. This is equivalent to --exclude "FILTER != 'PASS'"
$ biograph vdb study filter my_study --include 'FILTER = "PASS"'
Applying filter
Study my_study:
variants: 13538956 -> 13264569
# after filtering
$ biograph vdb study show my_study
study_name: my_study
created_on: 2021-05-13 14:50:19
build: GRCh38
refname: grch38
checkpoints:
1: added HG002: 2e7a0129-13a5-44b6-8594-2fc2e6c80e6c; HG003: a98626d6-758b-468d-add9-fbfbac47d207; HG004: 9a174215-8fc5-4c6f-bc4b-134654f65b99
2: include FILTER = "PASS"
sample_name variant_count
HG002 4450296
HG003 4271571
HG004 4542702
$ biograph vdb study filter my_study --include 'QUAL > 60'
The vdb uses EBI notation (1, 2, 3, ... X, Y, MT) for the autosomes, sex chromosomes, and mitochondria regardless of the original VCF format. All other chromosome names (alternates, decoys, etc.) are used verbatim. The original chromosome names are automatically translated to native format when exporting back to VCF with study export, but EBI format should be used for the vdb study filter
command.
$ biograph vdb study filter my_study --include 'chrom = "5" and pos > 20000'
Applying filter
Study ajtrio:
variants: 12468296 -> 763212
If a filter would exclude all available variants, the study is automatically reverted back to the previous checkpoint. This may happen if the filtering criteria are too strict, or a chromosome was not specified in EBI notation.
$ biograph vdb study filter my_study --include 'chrom = "chr9"'
Applying filter
This filter removed all variants from the study. Rolling back to previous checkpoint.
Removing S3 data for 'my_study' at checkpoint 5
Dropping SQL partitions for 'my_study' at checkpoint 5
The vdb stores reference position internally starting at position 0, not 1. This translation is automatically handled by vdb study filter
and on VCF import / export. Positions should be specified as they would be in VCF, starting with 1.
When variant calls are observed across all samples in a study, not every variant will be present in every individual.
To filter variants that are missing in a specific number of individuals, use N_MISS
.
# only include variants that are missing from five or fewer individuals
$ biograph vdb study filter my_study --include 'N_MISS <= 5'
To filter variants that are missing in a fraction of the total number of individuals, use F_MISS
.
# only include variants that are missing from < 66% of individuals
$ biograph vdb study filter my_study --include 'F_MISS < 0.667'
To filter samples that are missing a fraction of the total number of variants in the study, use SAMPLE_MISS
.
# drop samples that are missing more than 33% of all variants
$ biograph vdb study filter my_study --exclude 'SAMPLE_MISS > 0.33'
Applying filter
Study my_study:
variants: 13538956 -> 4632093
samples: [HG002, HG003, HG004] -> [HG004]
$ biograph vdb study filter --help
usage: biograph vdb study filter [-h] (-i INCLUDE | -e EXCLUDE) study_name
Filter variants in a study
Filter variants in a study using bcftools filter syntax. A new study
checkpoint will be created.
Use --include to include variants that match the filter.
Use --exclude to exclude variants that match the filter.
Examples:
# PASS only
$ biograph vdb study filter my_study --exclude "FILTER != 'PASS'"
# High quality hets on chr16
$ biograph vdb study filter my_study --include "chrom = '16' AND GT = 0/1 AND qual > 50"
# Per-variant missingness
$ biograph vdb study filter my_study --include "F_MISS > 0.2"
# Per-sample missingness
$ biograph vdb study filter my_study --exclude "SAMPLE_MISS > 0.1"
positional arguments:
study_name Name of the study
optional arguments:
-h, --help show this help message and exit
-i INCLUDE, --include INCLUDE
Include only variants that match these criteria
-e EXCLUDE, --exclude EXCLUDE
Exclude all variants that match these criteria