Add WES and WGS validation data

stjudecloud · Jan 29, 2024 · fb0e83a · fb0e83a
1 parent 677b651
commit fb0e83a
Showing 1 changed file with 35 additions and 0 deletions.
diff --git a/text/data_versioning.md b/text/data_versioning.md
@@ -31,6 +31,41 @@ A data versioning process should help us find a reasonable balance that doesn't
 
 We propose to evaluate whole-genome (WGS) and whole-exome (WES) sequencing by running well characterized samples through each iteration of the analysis pipeline. The resulting variant calls (gVCFs) will be compared to existing high-quality variant calls. This comparison will be conducted using Illumina's `hap.py` [comparison tool](https://github.com/Illumina/hap.py) as [recommended](https://www.biorxiv.org/content/10.1101/270157v3) by the Global Alliance for Genomics and Health (GA4GH) Benchmarking Team. Specifically, we propose to run samples from the National Institute for Standards and Technology (NIST)'s Genome in a Bottle (GIAB) project. We will  perform analysis using samples HG002, HG003, HG004, HG006, and HG007 for WGS. For WES, we will use samples HG002, HG003, HG004, and HG005. The results from prior iterations of the pipeline will be supplied as the truth set. The variant calls from the new workflow version will be treated as the query. The confident call sets from GIAB will be provided as the truth dataset for a second comparison. In this way we can track divergences in the pipeline and also determine the significance of any discovered divergence.
 
+### Validation
+
+#### HG003 `snapgatk-20180730_1` vs. `snapgatk-20190409_1` with GIAB confidence set WGS
+| Type  | Filter | TRUTH.TOTAL | TRUTH.TP  | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
+| ----- | ------ | ----------- | --------- | -------- | ----------- | -------- | --------- | ----- | ----- | ------------- | ---------------- | -------------- | --------------- | ---------------------- | ---------------------- | ------------------------- | ------------------------- |
+| INDEL | ALL    | 544,059     | 543,343   | 716      | 947,277     | 801      | 381,997   | 156   | 176   | 0.999…        | 0.999…           | 0.403…         | 0.999…          |                        |                        | 1.623…                    | 1.753…                    |
+| INDEL | PASS   | 544,059     | 543,343   | 716      | 947,277     | 801      | 381,997   | 156   | 176   | 0.999…        | 0.999…           | 0.403…         | 0.999…          |                        |                        | 1.623…                    | 1.753…                    |
+| SNP   | ALL    | 3,400,884   | 3,399,899 | 985      | 4,255,179   | 852      | 852,967   | 32    | 105   | 1.000…        | 1.000…           | 0.200…         | 1.000…          | 2.087…                 | 1.890…                 | 1.590…                    | 1.676…                    |
+| SNP   | PASS   | 3,400,884   | 3,399,899 | 985      | 4,255,179   | 852      | 852,967   | 32    | 105   | 1.000…        | 1.000…           | 0.200…         | 1.000…          | 2.087…                 | 1.890…                 | 1.590…                    | 1.676…                    |
+
+#### HG003 `snapgatk-20180730_1` vs. `snapgatk-20190409_1` without GIAB confidence set WGS
+| Type  | Filter | TRUTH.TOTAL | TRUTH.TP  | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
+| ----- | ------ | ----------- | --------- | -------- | ----------- | -------- | --------- | ----- | ----- | ------------- | ---------------- | -------------- | --------------- | ---------------------- | ---------------------- | ------------------------- | ------------------------- |
+| INDEL | ALL    | 910,429     | 904,637   | 5,792    | 947,277     | 6,661    | 0         | 991   | 1,388 | 0.994…        | 0.993…           | 0              | 0.993…          |                        |                        | 1.521…                    | 1.753…                    |
+| INDEL | PASS   | 910,429     | 904,637   | 5,792    | 947,277     | 6,661    | 0         | 991   | 1,388 | 0.994…        | 0.993…           | 0              | 0.993…          |                        |                        | 1.521…                    | 1.753…                    |
+| SNP   | ALL    | 4,250,930   | 4,240,371 | 10,559   | 4,255,179   | 10,107   | 0         | 856   | 1,190 | 0.998…        | 0.998…           | 0              | 0.998…          | 1.891…                 | 1.890…                 | 1.674…                    | 1.676…                    |
+| SNP   | PASS   | 4,250,930   | 4,240,371 | 10,559   | 4,255,179   | 10,107   | 0         | 856   | 1,190 | 0.998…        | 0.998…           | 0              | 0.998…          | 1.891…                 | 1.890…                 | 1.674…                    | 1.676…                    |
+
+
+#### HG003 `snapgatk-20180730_1` vs. `snapgatk-20190409_1` with GIAB confidence set WES
+| Type  | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
+| ----- | ------ | ----------- | -------- | -------- | ----------- | -------- | --------- | ----- | ----- | ------------- | ---------------- | -------------- | --------------- | ---------------------- | ---------------------- | ------------------------- | ------------------------- |
+| INDEL | ALL    | 75,154      | 74,802   | 352      | 101,040     | 230      | 25,394    | 35    | 54    | 0.995…        | 0.997…           | 0.251…         | 0.996…          |                        |                        | 0.432…                    | 0.471…                    |
+| INDEL | PASS   | 75,154      | 74,802   | 352      | 101,040     | 230      | 25,394    | 35    | 54    | 0.995…        | 0.997…           | 0.251…         | 0.996…          |                        |                        | 0.432…                    | 0.471…                    |
+| SNP   | ALL    | 759,249     | 757,903  | 1,346    | 866,321     | 687      | 107,685   | 105   | 19    | 0.998…        | 0.999…           | 0.124…         | 0.999…          | 1.922…                 | 1.867…                 | 0.427…                    | 0.459…                    |
+| SNP   | PASS   | 759,249     | 757,903  | 1,346    | 866,321     | 687      | 107,685   | 105   | 19    | 0.998…        | 0.999…           | 0.124…         | 0.999…          | 1.922…                 | 1.867…                 | 0.427…                    | 0.459…                    |
+
+#### HG003 `snapgatk-20180730_1` vs. `snapgatk-20190409_1` without GIAB confidence set WES
+| Type  | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
+| ----- | ------ | ----------- | -------- | -------- | ----------- | -------- | --------- | ----- | ----- | ------------- | ---------------- | -------------- | --------------- | ---------------------- | ---------------------- | ------------------------- | ------------------------- |
+| INDEL | ALL    | 100,236     | 99,522   | 714      | 101,040     | 612      | 0         | 94    | 169   | 0.993…        | 0.994…           | 0              | 0.993…          |                        |                        | 0.441…                    | 0.471…                    |
+| INDEL | PASS   | 100,236     | 99,522   | 714      | 101,040     | 612      | 0         | 94    | 169   | 0.993…        | 0.994…           | 0              | 0.993…          |                        |                        | 0.441…                    | 0.471…                    |
+| SNP   | ALL    | 866,942     | 864,591  | 2,351    | 866,321     | 1,596    | 0         | 250   | 134   | 0.997…        | 0.998…           | 0              | 0.998…          | 1.867…                 | 1.867…                 | 0.460…                    | 0.459…                    |
+| SNP   | PASS   | 866,942     | 864,591  | 2,351    | 866,321     | 1,596    | 0         | 250   | 134   | 0.997…        | 0.998…           | 0              | 0.998…          | 1.867…                 | 1.867…                 | 0.460…                    | 0.459…                    |
+
 ## RNA-Seq
 
 We propose to evaluate RNA sequencing (RNA-Seq) by running well characterized samples through each iteration of the analysis pipeline. Specifically, we propose to run samples from the National Institute for Standards and Technology (NIST)'s Genome in a Bottle (GIAB) project. We will use RNA-Seq data from HG002, HG004, and HG005 samples. These three samples will be run through new iterations of the [St. Jude Cloud RNA-Seq harmonization workflow](https://stjudecloud.github.io/rfcs/0001-rnaseq-workflow-v2.0.0.html). The pipeline will output aligned BAMs and feature count files for each sample. We will then generate variant calls using the [GATK RNA-Seq short variant discovery best practices workflow](https://gatk.broadinstitute.org/hc/en-us/articles/360035531192-RNAseq-short-variant-discovery-SNPs-Indels-). We will use Illumina's hap.py comparison tool, along with the high confidence variant calls from GIAB, to compare output from prior versions of the RNA-Seq pipeline to the proposed version.