-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix GT splitting in bcftools norm #344
base: develop
Are you sure you want to change the base?
Conversation
Technically shouldn't it be the symbolic allele <*>, the dot allele being missing data? |
Not |
Ideally, yes. Unfortunately the VCF specification only allows non-negative integers or |
I think @mp15 meant to use
|
I am using bcftools norm at the moment and I think I have observed a problematic case during my use of the behaviour of bcftools norm for GT splitting, just as @jpiper described. I am splitting multi allelic haploid sites to filter on the ratio of alternate allele observation counts over read depth. However when I try merging bi allelic sites into multi allelic sites after the filtering, the conflicting GT are badly resolved for everything but the first allele I think. Consider this :
(I modified by hand the AD value for my test cases) If I filter to keep only calls supported by more than 0.75 of the observation with the following command :
I get the following result:
The third sample was correctly set to GT='.' whether the fourth one is now "0" where I would like him to be "." By any chance do you think the proposed enhancement here is coming for bcftools ? Or by chance would you know an easier way of filtering each sample based with an ALT call based on the frequency of the most frequent allele observation like this ? For now the only fix I have thought about is adding an extra filtering for the reference calls that seem to be "erroneous", such as this :
but I think this might be a bit dangerous. Like this, both entries for the fourth sample in the MWE I presented are set to "." and stay "." when merged. Thanks a lot for your help and thanks a lot for the tool ! |
I think its a fundamental problem with representing multi allelics as binary.
If we try and convert to two bi-allelics using A as the reference we can represent the following genotypes
We lose the non-reference mutual heterozygotes i.e. the B/C genotype class The * allele code is for long deletions causing a third allele. Variant graphs will sort this out in the long term. Short term: Filter them out and hope it's not a signal lost for analysis in GWAS. |
Hi All,
Second time lucky (first pull request was polluted with stuff). It appears to me there's a pretty serious flaw with GT flag splitting in the normalisation. This is what I'm seeing:
Splitting multiallelic sites such as
with a GT of
1/2
withbcftools norm -m- <in>
incorrectly yields two records with a GT1/0
and0/1
, which should really be be1/.
and./1
In the integration tests, I also notice sites with a
2
getting split to a0
and a1
site...to
which should surely be
.
and1
(For humans, you wouldn't expect a haploid call in this specific region - the spec says we would only expect this in M or Y for humans)I have patched here to replace
with
and altered the expected output from the unit tests accordingly, in case you find this is actually a bug rather than intended behaviour.