Panel of normals
Using a sample from a normal tissue as a baseline for cancer variant calling allows to get rid of pre-existing "germline" calls in the tissue. However due to the following issues this method might still fail to filter out some false positives:
- sequencing or alignment artefacts that failed to be filtered in "tumor",
- real germline variants, not called in "normal" due to insufficient coverage in normal samples in those regions (e.g. due to low GC, unbalanced structural variants).
Applying a set unrelated normal samples to filter false positives might help to reduce the FP number by partly addressing the second factor. The poorly covered regions from the matched normal might turn out to be sufficiently covered in unrelated normals.
We are building a panel from germline calls from normal (non-cancer) samples used in UMCCR. To generate, export the samples doc
to csv, and run
make_normals_csv.R to generate
normals.tsv. Then on Spartan run the following to produce the panel of normal VCFs:
snakemake -s prep_normals.smk -j30
The script recalls variants in normal samples at AF>5%. We are not using ready germline calling VCFs because germline calls can be filtered by running against gnomAD, and we want to cover artefacts that would appear in lower freqs.
Called VCFs are normalized and merged into 2 VCFs, with indels and SNPs corresponsingly. Repetitive variants collased and put into
PoN_CNT field with a number of occurrences of that variant. The resulting merged VCFs are lifted over to hg19 and hg38 with the same snakefile.
For annotation/fitlering your target somatic calls with panel of normals, see the vcf_stuff README.md. The panel is also used by umccrise
Designing the panel
Matt Eldridge used an SNV panel from 149 blood normals from the oesophageal ICGC project to flag variants that are observed in:
- at least 5 unrelated normals
- in each, minimum variant allele fraction of 0.05
- in each, minimum allele count of 3.
Matt has evaluated against two validation truth sets:
- MuTect2 from DREAM challenge synthetic 4 dataset results: precision 0.91 -> 0.97 while the sensitivity is only marginally reduced, remaining at 0.75.
- ICGC benchmark datasets (MB), MuTect2 and Strelka: improves not in quite such spectacular fashion, possibly reflecting the relative similarity of the sequencing carried out for these datasets compared with that run on our oesophageal samples.
Using germline calls
The first iteration is taking the germline VCFs called from normal (non-cancer) samples used in UMCCR.
We want to explore the following things:
- How many occurences in a panel we want to allow to keep variant or filter?
- Samples come from different cohorts, and large cohorts can dominate small ones, bringing a bias. Do we want to collapse cohorts?
- To call a hit, should require the exact allele matching, or just variant location is enough?
- Should we split multiallelics in the panel of normals? Might correllate with the previous point.
VCFs are normalized and merged. Repetitive variants we collapse and put a
PoN_CNT field with a number of occurrences of that variant.
For the testing pusposes, we put separate
PoN_cohorts to explore if we want to collapse cohorts. Also for testing, we make 4 versions panels, using all combinations of the following:
- With merged multiallelics:
- merged with
bcftools merge -m all
- merged with
- With split multiallelics:
- all normals pre-normalized with
norm_vcfwhich splits multiallelics
- merged with
bcftools merge -m none
- all normals pre-normalized with
- With the large A5 cohort (30 samples versus remaining 20) and without.
Plus, we make a panel from normals that don't undergo any normalization (however some of them are normalized in bcbio, which since a recent release started to pre-normalized ensemble calls).
panel_of_normals.vcf.gz # NOT pre-normalized normals, without A5 panel_of_normals.NOT_NORMALIZED.vcf.gz # NOT pre-normalized normals, with A5 panel_of_normals.NEW.vcf.gz # Merged multiallelics, with A5 panel_of_normals.WO_A5.vcf.gz # Merged multiallelics, without A5 panel_of_normals.NO_MULTIALLELIC.vcf.gz # Split multiallelics, with A5 panel_of_normals.NO_MULTIALLELIC.WO_A5.vcf.gz # Split multiallelics, without A5
Eyeballing different versions of PoN
Here we review how 1. pre-normalization of normals; 2. using
bcftools merge -m none vs the default
bcftools merge -m both to skip merging multiallelic; and 3. the presence of the large A5 cohort (30 samples) affects the panel. We eyeball the variant that appears to be as A>G or A>T or multiallelic in different noraml samples:
normal vcfs NOT pre-normalized, no A5: panel_of_normals.vcf.gz:1 15274 rs2758118 A G,T 6947.5 PASS PoN=13 normal vcfs NOT pre-normalized (however the new cohort A5 is pre-normalized and split-multialleic, so it brings a separate A>G): panel_of_normals.NOT_NORMALIZED.vcf.gz:1 15274 rs2758118 A G,T 6947.5 PASS PoN_cohorts=14;PoN_samples=43 panel_of_normals.NOT_NORMALIZED.vcf.gz:1 15274 rs2758118 A G 224 PASS PoN_cohorts=1;PoN_samples=29 14 cohorts have A->G,T, one has A->G. In reality, those 14 are not normalized, and the one is normalized. Thus the one has A>G and A>T separately, whereas the rest have a multiallelic call. normal vcfs normalized, but merged without -m none: panel_of_normals.NEW.vcf.gz:1 15274 rs2758118 A G 6947.5 PASS PoN_cohorts=10;PoN_samples=38 panel_of_normals.NEW.vcf.gz:1 15274 rs2758118 A T 6947.5 PASS PoN_cohorts=14;PoN_samples=43 wc -l 17754387 normal vcfs normalized, but merged without -m none; no A5 panel_of_normals.WO_A5.vcf.gz:1 15274 rs2758118 A T 6947.5 PASS PoN_cohorts=13;PoN_samples=13 panel_of_normals.WO_A5.vcf.gz:1 15274 rs2758118 A G 6947.5 PASS PoN_cohorts=9;PoN_samples=9 wc -l 10857590 normal vcfs normalized, merged with -m none panel_of_normals.NO_MULTIALLELIC.vcf.gz:1 15274 rs2758118 A T 6947.5 PASS PoN_cohorts=14;PoN_samples=43 panel_of_normals.NO_MULTIALLELIC.vcf.gz:1 15274 rs2758118 A G 6947.5 PASS PoN_cohorts=10;PoN_samples=38 wc -l 18664962 normal vcfs normalized, merged with -m none: no A5 panel_of_normals.NO_MULTIALLELIC.WO_A5.vcf.gz:1 15274 rs2758118 A T 6947.5 PASS PoN_cohorts=13;PoN_samples=13 panel_of_normals.NO_MULTIALLELIC.WO_A5.vcf.gz:1 15274 rs2758118 A G 6947.5 PASS PoN_cohorts=9;PoN_samples=9 wc -l 10888111
-m none doesn't affect the result when there are normal VCFs having both variants, e.g.:
17MHP002Bld.clean.norm.1_15274.vcf.gz:1 15274 rs2758118 A G 6947.5 PASS . GT 1|0 17MHP002Bld.clean.norm.1_15274.vcf.gz:1 15274 rs2758118 A T 6947.5 PASS . GT 0|1 17MHP031Bld-germline.clean.norm.1_15274.vcf.gz:1 15274 rs2758118 A G 6239.6 PASS . GT 1|0 17MHP031Bld-germline.clean.norm.1_15274.vcf.gz:1 15274 rs2758118 A T 6239.6 PASS . GT 0|1
However, if we take the following one where each allele appears in different normals:
COLO829_normal.clean.norm.1_28682.vcf.gz:1 28682 . G A 141 PASS . GT 0/1 MH17B001P004-germline.clean.norm.1_28682.vcf.gz:1 28682 . G T 111 PASS . GT 0/1
-m none starts to make the difference:
panel_of_normals.WO_A5.vcf.gz < 1 28682 . G A,T 141 PASS PoN_cohorts=6;PoN_samples=7 panel_of_normals.NO_MULTIALLELIC.WO_A5.vcf.gz > 1 28682 . G A 141 PASS PoN_cohorts=1;PoN_samples=1 > 1 28682 . G T 111 PASS PoN_cohorts=5;PoN_samples=6
We can see here that splitting multiallelic should be done when we compare ALT, and multiallelic should be merged when we compare SITE. Say, G>T is also PoN_cohorts=1;PoN_samples=1 in the PoN, and we require more than 1 hit: In this case, none of the variants will support filtering; however, if we check by ALT, that's be correct. Also for comparing SITE, we don't need to explicitly pre-merge all multiallelics in separate normals, as each variant will contribute to the count. However, we should do normalization in any case, as normalizing indels may benefit both for ALT and SITE.
Evaluate on ICGC-MB
To answer the remaining questions and derive the best parameters, we evaluate filtering with the panels on a tumor VCF from ICGC MB study. We start with 100x tumor / 100x normal, called with ensemble approach in bcbio-nextgen.
cd /data/cephfs/punim0010/extras/panel_of_normals/test cp /data/cephfs/punim0010/projects/Saveliev_ICGC_MB/mb_workflow_bwa/final/2018-03-21_mb_workflow/batch1-ensemble-annotated.vcf.gz mb-ensemble.vcf.gz cp /data/cephfs/punim0010/projects/Saveliev_ICGC_MB/mb_workflow_bwa/final/2018-03-21_mb_workflow/batch1-ensemble-annotated.vcf.gz.tbi mb-ensemble.vcf.gz.tbi
We annotate the VCF in following 8 different ways: with samples and cohorts; requiring ALT matching (against the split-multiallelic panel) or just SITE (against the merged-multialleic panel); and including A5 cohort or not. For that, we prepare 8 tomls for
samples.NO_MULTIALLELIC.toml samples.NO_MULTIALLELIC.WO_A5.toml cohorts.NO_MULTIALLELIC.toml cohorts.NO_MULTIALLELIC.WO_A5.toml samples.toml samples.WO_A5.toml cohorts.toml cohorts.WO_A5.toml vcfanno samples.NO_MULTIALLELIC.toml mb-ensemble.vcf.gz | bgzip -c > mb-ensemble.S_ALT.vcf.gz vcfanno cohorts.NO_MULTIALLELIC.toml mb-ensemble.S_ALT.vcf.gz | bgzip -c > mb-ensemble.ALT.vcf.gz && rm mb-ensemble.S_ALT.vcf.gz vcfanno -permissive-overlap samples.toml mb-ensemble.vcf.gz | bgzip -c > mb-ensemble.S_SITE.vcf.gz vcfanno -permissive-overlap cohorts.toml mb-ensemble.S_SITE.vcf.gz | bgzip -c > mb-ensemble.SITE.vcf.gz && rm mb-ensemble.S_SITE.vcf.gz vcfanno samples.NO_MULTIALLELIC.WO_A5.toml mb-ensemble.vcf.gz | bgzip -c > mb-ensemble.WO_A5.S_ALT.vcf.gz vcfanno cohorts.NO_MULTIALLELIC.WO_A5.toml mb-ensemble.WO_A5.S_ALT.vcf.gz | bgzip -c > mb-ensemble.WO_A5.ALT.vcf.gz && rm mb-ensemble.WO_A5.S_ALT.vcf.gz vcfanno -permissive-overlap samples.WO_A5.toml mb-ensemble.vcf.gz | bgzip -c > mb-ensemble.WO_A5.S_SITE.vcf.gz vcfanno -permissive-overlap cohorts.WO_A5.toml mb-ensemble.WO_A5.S_SITE.vcf.gz | bgzip -c > mb-ensemble.WO_A5.SITE.vcf.gz && rm mb-ensemble.WO_A5.S_SITE.vcf.gz
Now filtering with different thresholds (1, 2, or 3 hits):
bcftools filter -e "INFO/PoN_cohorts>=1" mb-ensemble.SITE.vcf.gz -Oz -o mb-ensemble-SITE-filt-1.vcf.gz bcftools filter -e "INFO/PoN_cohorts>=2" mb-ensemble.SITE.vcf.gz -Oz -o mb-ensemble-SITE-filt-2cohort.vcf.gz bcftools filter -e "INFO/PoN_samples>=2" mb-ensemble.SITE.vcf.gz -Oz -o mb-ensemble-SITE-filt-2sample.vcf.gz bcftools filter -e "INFO/PoN_cohorts>=3" mb-ensemble.SITE.vcf.gz -Oz -o mb-ensemble-SITE-filt-3cohort.vcf.gz bcftools filter -e "INFO/PoN_samples>=3" mb-ensemble.SITE.vcf.gz -Oz -o mb-ensemble-SITE-filt-3sample.vcf.gz bcftools filter -e "INFO/PoN_cohorts>=1" mb-ensemble.ALT.vcf.gz -Oz -o mb-ensemble-ALT-filt-1.vcf.gz bcftools filter -e "INFO/PoN_cohorts>=2" mb-ensemble.ALT.vcf.gz -Oz -o mb-ensemble-ALT-filt-2cohort.vcf.gz bcftools filter -e "INFO/PoN_samples>=2" mb-ensemble.ALT.vcf.gz -Oz -o mb-ensemble-ALT-filt-2sample.vcf.gz bcftools filter -e "INFO/PoN_cohorts>=3" mb-ensemble.ALT.vcf.gz -Oz -o mb-ensemble-ALT-filt-3cohort.vcf.gz bcftools filter -e "INFO/PoN_samples>=3" mb-ensemble.ALT.vcf.gz -Oz -o mb-ensemble-ALT-filt-3sample.vcf.gz bcftools filter -e "INFO/PoN_cohorts>=1" mb-ensemble.WO_A5.SITE.vcf.gz -Oz -o mb-ensemble-WO_A5.SITE-filt-1.vcf.gz bcftools filter -e "INFO/PoN_cohorts>=2" mb-ensemble.WO_A5.SITE.vcf.gz -Oz -o mb-ensemble-WO_A5.SITE-filt-2cohort.vcf.gz bcftools filter -e "INFO/PoN_samples>=2" mb-ensemble.WO_A5.SITE.vcf.gz -Oz -o mb-ensemble-WO_A5.SITE-filt-2sample.vcf.gz bcftools filter -e "INFO/PoN_cohorts>=3" mb-ensemble.WO_A5.SITE.vcf.gz -Oz -o mb-ensemble-WO_A5.SITE-filt-3cohort.vcf.gz bcftools filter -e "INFO/PoN_samples>=3" mb-ensemble.WO_A5.SITE.vcf.gz -Oz -o mb-ensemble-WO_A5.SITE-filt-3sample.vcf.gz bcftools filter -e "INFO/PoN_cohorts>=1" mb-ensemble.WO_A5.ALT.vcf.gz -Oz -o mb-ensemble-WO_A5.ALT-filt-1.vcf.gz bcftools filter -e "INFO/PoN_cohorts>=2" mb-ensemble.WO_A5.ALT.vcf.gz -Oz -o mb-ensemble-WO_A5.ALT-filt-2cohort.vcf.gz bcftools filter -e "INFO/PoN_samples>=2" mb-ensemble.WO_A5.ALT.vcf.gz -Oz -o mb-ensemble-WO_A5.ALT-filt-2sample.vcf.gz bcftools filter -e "INFO/PoN_cohorts>=3" mb-ensemble.WO_A5.ALT.vcf.gz -Oz -o mb-ensemble-WO_A5.ALT-filt-3cohort.vcf.gz bcftools filter -e "INFO/PoN_samples>=3" mb-ensemble.WO_A5.ALT.vcf.gz -Oz -o mb-ensemble-WO_A5.ALT-filt-3sample.vcf.gz
Using vcf_eval, we evaluate all VCFs against the ICGC-MB truth set.
eval_vcf mb mb-ensemble.vcf.gz *filt*.vcf.gz -o eval -j 24
A more interesting example would be a tumor/normal pair with unbalanced tumor/normal coverage, where the normal coverage is significantly lower - in such cases, the panel of normals can resque some artefacts appearing due to gaps in normal.
Tumor 100x vs normal 50x:
cd /data/cephfs/punim0010/extras/panel_of_normals/test_mb_100_50 cp /data/cephfs/punim0010/projects/Saveliev_ICGC_MB/bcbio_300_50/final_af10/2017-11-22_bcbio/MB_300vs50-ensemble-annotated-decomposed.vcf.gz mb-ensemble.vcf.gz ... eval_vcf mb mb-ensemble.vcf.gz *filt*.vcf.gz -o eval -j 24
Tumor 300x vs normal 50x:
cd /data/cephfs/punim0010/extras/panel_of_normals/test_mb_300_50 cp /data/cephfs/punim0010/projects/Saveliev_ICGC_MB/bcbio_300_50/final_af10/2017-11-22_bcbio/MB_300vs50-ensemble-annotated-decomposed.vcf.gz mb-ensemble.vcf.gz ... eval_vcf mb mb-ensemble.vcf.gz *filt*.vcf.gz -o eval -j 24
We also evaluate the panel on a COLO829 tumor/normal pair:
cd /data/cephfs/punim0010/extras/panel_of_normals/test_colo cp /data/cephfs/punim0010/projects/Saveliev_COLO829_Craig/bcbio_bwa/final.beforeStrelkaReportEVSFeatures/2018-03-01_bcbio_bwa/COLO_TGEN_bwa-ensemble-annotated.vcf.gz colo.vcf.gz ... eval_vcf colo colo.vcf.gz *filt*.vcf.gz -o eval -j 24
We can see that adding A5 cohort improves the result, and filtering by ALT seems to make a positive effect overall.
F2 weights recall twice as higher than it does F1, and F3 goes 3 times higher. But we probably don't need to weigh the recall that much, so can look at F2.
Requiring just 1 hit (either cohort or SNP - doesn't matter for 1 hit) reduces the FP rate very significantly and maximizes F1 and F2 for SNPs in all scenarios.
However, for indels, the situation is different for balance and unbalanced coverage cases:
- For a balanced coverage, indels F1 and F2 are highest when we ignore ALT (SITE-* rows) - that might be explained by different indel representation.
- For unbalanced coverage, indels are best when we ignore the large A5 cohort.
Eyeballing those indels that are mistakenly removed and searching against the gnomad database, nearly all of them are homopolymers and LCR, and most likely artefacts in the thuth sets (both ICGC-MB and COLO829).
Homopolymeric artefacts might occur in different representations in VCFs, so comparing ALT will decrease the accuracy. For that reason, we will annotate SNPs by ALT, and indels by SITE.
We will also stick with "1 sample" rule as it gives the highest result, given that most of additional FN are artefacts.
Refined evaluation will look as the following:
cd test pon_anno mb-ensemble.vcf.gz -h 1 -o mb-ensemble.1sample.vcf.gz pon_anno mb-ensemble.vcf.gz -h 2 -o mb-ensemble.2sample.vcf.gz pon_anno mb-ensemble.vcf.gz -h 3 -o mb-ensemble.3sample.vcf.gz eval_vcf mb mb-ensemble.vcf.gz mb-ensemble.1sample.vcf.gz -o eval_new -j 30
And others samples:
cd test_mb_100_50 pon_pipeline mb-ensemble.vcf.gz -o pon_new -j 30 -h1,2,3 eval_vcf pon_new/pon_filter/mb-ensemble-ann-n*.vcf.gz mb-ensemble.vcf.gz -o pon_new/eval cd test_mb_300_50 cd test_colo
Recalling on low frequency
However, this approach is rather a germline filter than an artefact filter. We can cover germline calls by filtering against gnomAD, and indeed they overlap a lot. To cover artefacts, we should try to recall variants from normals allowing for 5% allele frequency (like in Matt's slides).
Using Caveman to generate Matt's panel.
cd /data/cephfs/punim0010/extras/vlad/synced/umccr/vcf_stuff/vcf_stuff/panel_of_normals git clone https://github.com/cancerit/CaVEMan cd CaVEMan ./setup.sh install export PATH=/data/cephfs/punim0010/extras/vlad/synced/umccr/vcf_stuff/vcf_stuff/panel_of_normals/CaVEMan/install/bin:/data/cephfs/punim0010/extras/vlad/synced/umccr/vcf_stuff/vcf_stuff/panel_of_normals/CaVEMan/bin:$PATH
Running following the generation of unmatchedNormal.bed.gz docs:
generateCavemanUMNormVCF\ --out-file play/unmatchedNormal.vcf \ --reference GRCh37.fa \ --split-sections GRCh37_noalt.bed \ --species human \ --spp-vers GRCh37d5 \ --bam-files ../work/bam/*.bam
Trying to call with AF 5% in bcbio:
bcbio_nextgen.py -w template bcbio.yaml bcbio.csv /data/cephfs/punim0010/data/Results/Patients/2019-02-11/SFRC01122/final/PRJ190085_SFRC01122B/PRJ190085_SFRC01122B-ready.bam bcbio_nextgen.py ../config/bcbio.yaml -n 30
Not working: bcbio errors out.
(iii) To create the panel of normals (PoN), call on each normal sample using Mutect2's tumor-only mode and then use GATK4's CreateSomaticPanelOfNormals. This contrasts with the GATK3 workflow, which uses an artifact mode in MuTect2 and CombineVariants for PoN creation. In GATK4, omitting filtering with FilterMutectCalls achieves the same artifact mode.
Single normal sample for panel of normals (PoN) creation To create a panel of normals (PoN), call on each normal sample as if a tumor sample. Then use CreateSomaticPanelOfNormals to output a PoN of germline and artifactual sites. This contrasts with the GATK3 workflow, which uses CombineVariants to retain variant sites called in at least two samples and then uses Picard MakeSitesOnlyVcf to simplify the callset for use as a PoN.
gatk --java-options "-Xmx4g" Mutect2 \ -R /data/cephfs/punim0010/extras/umccrise/genomes/GRCh37/GRCh37.fa \ -I /data/cephfs/punim0010/data/Results/Tothill-A5/2018-08-11/final/PRJ180495_E199-B01-D/PRJ180495_E199-B01-D-ready.bam \ -tumor PRJ180495_E199-B01-D \ -O work/recall/PRJ180495_E199-B01-D.vcf.gz
gatk --java-options "-Xmx4g" CreateSomaticPanelOfNormals \ -vcfs work/recall/PRJ180495_E199-B01-D.vcf.gz \ -O work/gatk_combined_pon.vcf.gz
We can check if the population frequency of the variant matches its frequency in the panel, and only apply it if it is significantly higher. Because otherwise it might be not an artefact, but a common pre-existing variant.
E.g. if we see more than 5 times the dbSNP GMAF or gnomAD AF - then keep it in the panel. For instance, if the GMAF is 0.05 - which means it's common to see it 1 in 20 by chance - we'd expect around 2,5 samples in the panel to have it. If that's the case, we should remove it from the panel. If we see it more often - then it's probably an artefact. Though we are not interested in germline calls as well, so maybe do not need this approach.
2 differrent NA12878 as a T/N pair
Two NA12878 samples running as a T/N pair ideally should yeild zero somatic variants. However, in real world the output will be non-empty. Since all real variants should cancel out, the remainings are assumed to be false positives. We count how many of them are removed by applying the panel of normals, making sure that it doesn't reduce the true positive count significantly.
TP FP FN Precision Sensitivity TP FP FN Precision Sensitivity 1VDvs2VD 217 2,482 1,007,778 8.04% 0.02% 30 858 98,346 3.38% 0.03% 1VDvs2VD, 1 hit in PoN 78 150 1,007,917 34.21% 0.01% 4 14 98,372 22.22% 0.00% 1VDvs2VD, 2 hits in PoN 65 134 1,007,930 32.66% 0.01% 1 11 98,375 8.33% 0.00%
The reduction of false positives is significant.
Just for the demonstration, applying to original variant calls, sensititity drops significancatly because of the large amount of germline variants shared between these samples and the samples in the panel:
1VD 1,007,291 1,948,935 704 34.07% 99.93% 97762 240731 614 28.88% 99.38% 1VD, 2 hits 292,624 73,936 715,371 79.83% 29.03% 42172 87484 56204 32.53% 42.87% 1VD, 1 hit 171,940 37,439 836,055 82.12% 17.06% 32507 84206 65869 27.85% 33.04%