Skip to content

Commit

Permalink
Add WES and WGS validation data
Browse files Browse the repository at this point in the history
  • Loading branch information
adthrasher committed Jan 29, 2024
1 parent 677b651 commit fb0e83a
Showing 1 changed file with 35 additions and 0 deletions.
35 changes: 35 additions & 0 deletions text/data_versioning.md
Expand Up @@ -31,6 +31,41 @@ A data versioning process should help us find a reasonable balance that doesn't

We propose to evaluate whole-genome (WGS) and whole-exome (WES) sequencing by running well characterized samples through each iteration of the analysis pipeline. The resulting variant calls (gVCFs) will be compared to existing high-quality variant calls. This comparison will be conducted using Illumina's `hap.py` [comparison tool](https://github.com/Illumina/hap.py) as [recommended](https://www.biorxiv.org/content/10.1101/270157v3) by the Global Alliance for Genomics and Health (GA4GH) Benchmarking Team. Specifically, we propose to run samples from the National Institute for Standards and Technology (NIST)'s Genome in a Bottle (GIAB) project. We will perform analysis using samples HG002, HG003, HG004, HG006, and HG007 for WGS. For WES, we will use samples HG002, HG003, HG004, and HG005. The results from prior iterations of the pipeline will be supplied as the truth set. The variant calls from the new workflow version will be treated as the query. The confident call sets from GIAB will be provided as the truth dataset for a second comparison. In this way we can track divergences in the pipeline and also determine the significance of any discovered divergence.

### Validation

#### HG003 `snapgatk-20180730_1` vs. `snapgatk-20190409_1` with GIAB confidence set WGS
| Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
| ----- | ------ | ----------- | --------- | -------- | ----------- | -------- | --------- | ----- | ----- | ------------- | ---------------- | -------------- | --------------- | ---------------------- | ---------------------- | ------------------------- | ------------------------- |
| INDEL | ALL | 544,059 | 543,343 | 716 | 947,277 | 801 | 381,997 | 156 | 176 | 0.999… | 0.999… | 0.403… | 0.999… | | | 1.623… | 1.753… |
| INDEL | PASS | 544,059 | 543,343 | 716 | 947,277 | 801 | 381,997 | 156 | 176 | 0.999… | 0.999… | 0.403… | 0.999… | | | 1.623… | 1.753… |
| SNP | ALL | 3,400,884 | 3,399,899 | 985 | 4,255,179 | 852 | 852,967 | 32 | 105 | 1.000… | 1.000… | 0.200… | 1.000… | 2.087… | 1.890… | 1.590… | 1.676… |
| SNP | PASS | 3,400,884 | 3,399,899 | 985 | 4,255,179 | 852 | 852,967 | 32 | 105 | 1.000… | 1.000… | 0.200… | 1.000… | 2.087… | 1.890… | 1.590… | 1.676… |

#### HG003 `snapgatk-20180730_1` vs. `snapgatk-20190409_1` without GIAB confidence set WGS
| Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
| ----- | ------ | ----------- | --------- | -------- | ----------- | -------- | --------- | ----- | ----- | ------------- | ---------------- | -------------- | --------------- | ---------------------- | ---------------------- | ------------------------- | ------------------------- |
| INDEL | ALL | 910,429 | 904,637 | 5,792 | 947,277 | 6,661 | 0 | 991 | 1,388 | 0.994… | 0.993… | 0 | 0.993… | | | 1.521… | 1.753… |
| INDEL | PASS | 910,429 | 904,637 | 5,792 | 947,277 | 6,661 | 0 | 991 | 1,388 | 0.994… | 0.993… | 0 | 0.993… | | | 1.521… | 1.753… |
| SNP | ALL | 4,250,930 | 4,240,371 | 10,559 | 4,255,179 | 10,107 | 0 | 856 | 1,190 | 0.998… | 0.998… | 0 | 0.998… | 1.891… | 1.890… | 1.674… | 1.676… |
| SNP | PASS | 4,250,930 | 4,240,371 | 10,559 | 4,255,179 | 10,107 | 0 | 856 | 1,190 | 0.998… | 0.998… | 0 | 0.998… | 1.891… | 1.890… | 1.674… | 1.676… |


#### HG003 `snapgatk-20180730_1` vs. `snapgatk-20190409_1` with GIAB confidence set WES
| Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
| ----- | ------ | ----------- | -------- | -------- | ----------- | -------- | --------- | ----- | ----- | ------------- | ---------------- | -------------- | --------------- | ---------------------- | ---------------------- | ------------------------- | ------------------------- |
| INDEL | ALL | 75,154 | 74,802 | 352 | 101,040 | 230 | 25,394 | 35 | 54 | 0.995… | 0.997… | 0.251… | 0.996… | | | 0.432… | 0.471… |
| INDEL | PASS | 75,154 | 74,802 | 352 | 101,040 | 230 | 25,394 | 35 | 54 | 0.995… | 0.997… | 0.251… | 0.996… | | | 0.432… | 0.471… |
| SNP | ALL | 759,249 | 757,903 | 1,346 | 866,321 | 687 | 107,685 | 105 | 19 | 0.998… | 0.999… | 0.124… | 0.999… | 1.922… | 1.867… | 0.427… | 0.459… |
| SNP | PASS | 759,249 | 757,903 | 1,346 | 866,321 | 687 | 107,685 | 105 | 19 | 0.998… | 0.999… | 0.124… | 0.999… | 1.922… | 1.867… | 0.427… | 0.459… |

#### HG003 `snapgatk-20180730_1` vs. `snapgatk-20190409_1` without GIAB confidence set WES
| Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
| ----- | ------ | ----------- | -------- | -------- | ----------- | -------- | --------- | ----- | ----- | ------------- | ---------------- | -------------- | --------------- | ---------------------- | ---------------------- | ------------------------- | ------------------------- |
| INDEL | ALL | 100,236 | 99,522 | 714 | 101,040 | 612 | 0 | 94 | 169 | 0.993… | 0.994… | 0 | 0.993… | | | 0.441… | 0.471… |
| INDEL | PASS | 100,236 | 99,522 | 714 | 101,040 | 612 | 0 | 94 | 169 | 0.993… | 0.994… | 0 | 0.993… | | | 0.441… | 0.471… |
| SNP | ALL | 866,942 | 864,591 | 2,351 | 866,321 | 1,596 | 0 | 250 | 134 | 0.997… | 0.998… | 0 | 0.998… | 1.867… | 1.867… | 0.460… | 0.459… |
| SNP | PASS | 866,942 | 864,591 | 2,351 | 866,321 | 1,596 | 0 | 250 | 134 | 0.997… | 0.998… | 0 | 0.998… | 1.867… | 1.867… | 0.460… | 0.459… |

## RNA-Seq

We propose to evaluate RNA sequencing (RNA-Seq) by running well characterized samples through each iteration of the analysis pipeline. Specifically, we propose to run samples from the National Institute for Standards and Technology (NIST)'s Genome in a Bottle (GIAB) project. We will use RNA-Seq data from HG002, HG004, and HG005 samples. These three samples will be run through new iterations of the [St. Jude Cloud RNA-Seq harmonization workflow](https://stjudecloud.github.io/rfcs/0001-rnaseq-workflow-v2.0.0.html). The pipeline will output aligned BAMs and feature count files for each sample. We will then generate variant calls using the [GATK RNA-Seq short variant discovery best practices workflow](https://gatk.broadinstitute.org/hc/en-us/articles/360035531192-RNAseq-short-variant-discovery-SNPs-Indels-). We will use Illumina's hap.py comparison tool, along with the high confidence variant calls from GIAB, to compare output from prior versions of the RNA-Seq pipeline to the proposed version.
Expand Down

0 comments on commit fb0e83a

Please sign in to comment.