Quality control methods for human genomic variants.
Branch: master
Clone or download
deflaux Update genome_variant_calls to count by chromosome.
- update corresponding plot to sum counts to obtain totals.
- update all per chromosome plots to have a display order for the chromosomes

Change-Id: I5f0002d3e12e07043eda2327e68480bdc07abca0
Latest commit 878eeb0 Dec 10, 2018



This is not an official Verily product.


This repository contains code to perform cohort-level quality control checks on human genomic variants. Cloud technology is used to perform queries in parallel. For prior work, see Cloud-based interactive analytics for terabytes of genomic variants data.

View output from these queries run on public data

Before running the queries yourself, you can see the results on a few public datasets:

Run these queries on your own data

Load data to BigQuery

The queries in this repository assume that the VCFs were loaded to BigQuery using Variant Transforms with the MOVE_TO_CALLS merge strategy included.

Using the MOVE_TO_CALLS merge strategy will produce a core set of columns common to all tables created from VCFs and calls for the exact same (reference_name, start_position, end_position, reference_bases, and all alternate_bases) grouped together in a single row.

We recommend loading single-sample VCFs into a "genome call table" and also the multisample VCF into a "multisample-variants table".

If you do not have a multisample VCF, you could:

Predict ancestry

If your sample information does not already include ancestry, you can predict the ancestry for each genome using Genomic ancestry inference with deep learning.

Run the QC overview reports

Run the RMarkdown parameterized reports to get an overview of your data.

Drill down on results

Drill down further on results by creating additional plots and/or performing additional queries. For example, these queries can be used from the context of Jupyter notebooks, and then additional queries or other queries can be used to further explain the results for a particular dataset.

Technologies used

The methods make use of:

Each technology has introductory material that may help you when working with the code in this repository.