-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bcftools norm inconsistency with allelic depth (AD) when splitting multi-allelic sites #360
Comments
@freeseek I think I understand what you are asking for, but I'm not sure it would make sense as a core feature. What would you suggest that the option be called, and how could it be made clear that the depths reported don't actually refer to the reference allele? I can understand your frustration with the current output in this particular situation, but I'm not sure there is any better way to represent heterozygous sites where both alleles are non-reference. It seems like the "right" thing to do in this case might be to refuse the split the multiallelic line at all (perhaps on the basis that there is no evidence for it actually being multiallelic except for the fact that the reference contains a different allele than anything that was observed). Wouldn't your suggestion to report all of the AD in the multiallelic input on each output line be even more confusing? In that case it would appear that there was actually evidence for a heterozygous call vs the reference allele, when in fact there was no evidence for the reference at all. Also, it would mean that if you sum up all of the AD for a particular location, there would appear to be twice as much as there actually was in the input data. |
@jrandall I totally see how what I am asking for would not make sense to everybody, but there are several situations where it would make sense, especially if allelic depth is being taken into account after the split. As a matter of fact, HAIL does the splitting leaving the sum of AD constant: Again, to reiterate, I would like the following VCF file:
to split as follows:
the same way HAIL does it, even if requiring a special additional parameter AD-specific. Thank you! :-) |
I wrote the following patch that adds the HAIL-like splitting functionality (using the "--keep-sum-AD" flag):
But I am not sure whether this will cause unintended behaviors in non-diploid cases. Could something like this be added to the main bcftools code? |
@freeseek this is a super helpful option! thanks for generating the nice patch! |
Hi, thank you for contributing the patch. However, it would be nicer to have a general way to specify rules for arbitrary tags, something like what |
I see your point, but I am not aware of how people would want to split the format fields. Notice that HAIL: |
I added this feature. Please check it out and let me know in case of problems. |
I hope this is not something that has been already asked before. Here an example VCF file:
If I split this file with the command "bcftools norm -m -any" I obtain:
However now I am in the uncomfortable situation where each site is heterozygous despite the allelic depth supporting "1/1" calls rather than "0/1" calls. I am sure people will have different opinions about this, but part of the reason many want to split multi-allelic sites is to consider each alternate allele as an allele to be interpreted as that allele against every other allele. It would be great to have at least an option to properly re-format the AD field so that the total sum of the AD fields is maintained after splitting, so that instead of splitting:
It gets split instead as:
I hope this makes sense.
The text was updated successfully, but these errors were encountered: