Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix GT splitting in bcftools norm #344

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

jpiper
Copy link

@jpiper jpiper commented Oct 28, 2015

Hi All,

Second time lucky (first pull request was polluted with stuff). It appears to me there's a pretty serious flaw with GT flag splitting in the normalisation. This is what I'm seeing:

Splitting multiallelic sites such as

chr1    113938929   .   T   A,C 52.00   LowGQX;HighDPFRatio SNVSB=0.0;SNVHPOL=21    GT:GQ:GQX:DP:DPF:AD 1/2:7:5:5:8:0,4,1

with a GT of 1/2 with bcftools norm -m- <in> incorrectly yields two records with a GT 1/0 and 0/1, which should really be be 1/. and ./1

chr1    113938929   .   T   A   52  LowGQX;HighDPFRatio SNVSB=0;SNVHPOL=21  GT:GQ:GQX:DP:DPF:AD 1/0:7:5:5:8:0,4
chr1    113938929   .   T   C   52  LowGQX;HighDPFRatio SNVSB=0;SNVHPOL=21  GT:GQ:GQX:DP:DPF:AD 0/1:7:5:5:8:0,1

In the integration tests, I also notice sites with a 2 getting split to a 0 and a 1 site...

20  275 .   A   C,G 999 PASS    INDEL;AN=2;AC=0,2   GT:PL:DP:FGF:FGI:FGS:FSTR   2:0,0,0:0:1e+06,2e+06,3e+06:1111,2222,3333:A,BB,CCC:WORD    2:0,0,0:0:1e+06,2e+06,3e+06:1111,2222,3333:A,BB,CCC:WORD

to

20  275 .   A   C   999 PASS    INDEL;AN=2;AC=0 GT:PL:DP:FGF:FGI:FGS:FSTR   0:0,0:0:1e+06,2e+06:1111,2222:A,BB:WORD 0:0,0:0:1e+06,2e+06:1111,2222:A,BB:WORD
20  275 .   A   G   999 PASS    INDEL;AN=2;AC=2 GT:PL:DP:FGF:FGI:FGS:FSTR   1:0,0:0:1e+06,3e+06:1111,3333:A,CCC:WORD    1:0,0:0:1e+06,3e+06:1111,3333:A,CCC:WORD

which should surely be . and 1 (For humans, you wouldn't expect a haploid call in this specific region - the spec says we would only expect this in M or Y for humans)

I have patched here to replace

gt[j] = bcf_gt_unphased(0) | bcf_gt_is_phased(gt[j]); // set to REF

with

gt[j] = bcf_gt_missing | bcf_gt_is_phased(gt[j]); // set to REF

and altered the expected output from the unit tests accordingly, in case you find this is actually a bug rather than intended behaviour.

@mp15
Copy link
Member

mp15 commented Oct 28, 2015

Technically shouldn't it be the symbolic allele <*>, the dot allele being missing data?

@pd3
Copy link
Member

pd3 commented Oct 28, 2015

Not <*>, but maybe *.

@jpiper
Copy link
Author

jpiper commented Oct 28, 2015

Ideally, yes. Unfortunately the VCF specification only allows non-negative integers or . in this field though, so this was the best I could do without breaking spec

@pd3
Copy link
Member

pd3 commented Oct 28, 2015

I think @mp15 meant to use * ALT allele which is a valid way how to specify missing alleles due to conflicting variants. A valid VCF notation then would be:

REF=A  ALT=C,*  GT=0/2

@sachalau
Copy link

sachalau commented Feb 26, 2018

Hello @mp15 @pd3 @jpiper

I am using bcftools norm at the moment and I think I have observed a problematic case during my use of the behaviour of bcftools norm for GT splitting, just as @jpiper described.

I am splitting multi allelic haploid sites to filter on the ratio of alternate allele observation counts over read depth. However when I try merging bi allelic sites into multi allelic sites after the filtering, the conflicting GT are badly resolved for everything but the first allele I think.

Consider this :

NC_000962.3	1849	.	C	A,T	3709.44	PASS	INFO_STUFF	GT:DP:AD:RO:QR:AO:QA:GL	0:247:247,0,0:247:9070:0,0:0,0:0,-816.14,-816.14	1:58:1,57,0:1:29:57,0:2142,0:-190.113,0,-192.721	1:38:18,20,0:0:0:38,0:1380,0:-124.453,0,-124.453	2:42:12,0,30:0:0:0,42:0,1578:-142.262,-142.262,0

(I modified by hand the AD value for my test cases)

If I filter to keep only calls supported by more than 0.75 of the observation with the following command :

bcftools norm -m -any input.vcf | bcftools filter -s "freq" -S "." -e "GT='a' & FORMAT/AD[1]/FORMAT/DP < 0.75" | bcftools norm -m +any

I get the following result:

NC_000962.3	1849	.	C	A,T	3709.44	freq	INFO_STUFF	GT:DP:AD:RO:QR:AO:QA:GL	0:247:247,0,0:247:9070:0,0:0,0:0,-816.14,-816.14	1:58:1,57,0:1:29:57,0:2142,0:-190.113,0,-192.721	.:38:18,20,0:0:0:38,0:1380,0:-124.453,0,-124.453	0:42:12,0,30:0:0:0,42:0,1578:-142.262,-142.262,0	

The third sample was correctly set to GT='.' whether the fourth one is now "0" where I would like him to be "."

By any chance do you think the proposed enhancement here is coming for bcftools ? Or by chance would you know an easier way of filtering each sample based with an ALT call based on the frequency of the most frequent allele observation like this ?

For now the only fix I have thought about is adding an extra filtering for the reference calls that seem to be "erroneous", such as this :

bcftools filter -s "feq" -S "." -e "GT='r' & FORMAT/AD[0]/FORMAT/DP < 1-0.75"

but I think this might be a bit dangerous. Like this, both entries for the fourth sample in the MWE I presented are set to "." and stay "." when merged.

Thanks a lot for your help and thanks a lot for the tool !

@mdkeehan
Copy link

I think its a fundamental problem with representing multi allelics as binary.
Three alleles A,B,C
Gives the following genotypes:

alleles A B C
A A/A A/B A/C
B B/B B/C
C C/C

If we try and convert to two bi-allelics using A as the reference we can represent the following genotypes

multiple biallelics R/R R/A A/A
alt=B A/A A/B B/B
alt=C A/A A/C C/C

We lose the non-reference mutual heterozygotes i.e. the B/C genotype class
plus the AA genotype gets represented twice.

The * allele code is for long deletions causing a third allele.
But what we if we have two distinct long deletions overlapping a SNP site?

Variant graphs will sort this out in the long term.

Short term: Filter them out and hope it's not a signal lost for analysis in GWAS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants