-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VCF: Clarify * as overlapping allele, not overlapping base #437 #464
VCF: Clarify * as overlapping allele, not overlapping base #437 #464
Conversation
VCFv4.3.tex
Outdated
@@ -311,8 +311,7 @@ \subsubsection{Fixed fields} | |||
|
|||
\item ALT --- alternate base(s): Comma separated list of alternate non-reference alleles. | |||
These alleles do not have to be called in any of the samples. | |||
Options are base Strings made up of the bases A,C,G,T,N,*, (case insensitive) or a MISSING value `.' (no variant) or an angle-bracketed ID String (``$<$ID$>$'') or a breakend replacement string as described in the section on breakends. | |||
The `*' allele is reserved to indicate that the allele is missing due to an overlapping deletion. | |||
Options are base Strings made up of the bases A,C,G,T,N (case insensitive) or the `*' symbol (allele missing due to overlapping deletion) or a MISSING value `.' (no variant) or an angle-bracketed ID String (``$<$ID$>$'') or a breakend replacement string as described in the section on breakends. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
allele missing due to overlapping deletion
This should be changed to "allele missing due to other overlapping alleles", not limited to deletions. For example
chr1 100 A G
chr1 100 A T,*
should be technically allowed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be changed to "allele missing due to other overlapping alleles", not limited to deletions
I agree with this since it would allow the option of using 0
as an assertion of reference. Possibly replace "missing" with "unspecified" to avoid confusion with .
?
If the meaning of Standardising this definition of |
I agree with @dancooke that it would be unfortunate to incorporate the ambiguous interpretation of
Perhaps this language could be sharpened just a bit, to acknowledge it as part of the past and present but discourage it for the future. |
@mlin A summary of the previous thread. A
We all think VCF follows def 1, but with that definition, most existing VCFs would be wrong. As @pd3 pointed out, VCF4.1 actually adopted def 2 without us being aware of it. VCF4.2 introduced the Within the scope of v4.3, we can't make def 1 work (and I believe we won't ever make it work in practice, but that is a debate for v4.4). Both def 2 and 3 have already been widely used for years. Even for the sake of backward compatibility, we have to allow them both. This PR just acknowledges this unfortunate fact, not changing the way we use VCF. |
@lh3 I too see no practical alternative to acknowledging the subtle meanings, or lack thereof, of GT=0 at this point. I just don't want the see the spec endorse/recommend/require the plain appearance of contradictory genotypes, the undesirability of which I think is implied by "Due to underspecifying in older versions and for compatibility with existing tools..."; but future developers who haven't had the chance to enjoy #437 might need a bit more context for that lede (I for one certainly did enjoy it, but regrettably did not have bandwidth to participate at the time 😅). FWIW, GLnexus made the tradeoff of sometimes failing to call a subset of reference bases that it should be able to if not for the other overlapping alleles; which I felt preferable to the appearance of contradictory genotypes. There are definitely users who don't want to see the resulting half-calls like |
@mlin With v4.3, your GLnexus example is better to add the In the lack of a consensus, this PR does a very good job to acknowledge the dilemma. If we want to go beyond this PR, we need to reach a consensus and phrase it as a recommended practice. |
Perhaps the fairest and most pragmatic way forward is to keep this pull request with the suggested change to
|
@lh3 Yes, currently it's not appropriate for GLnexus to use Regarding I don't agree that "this PR does a very good job to acknowledge the dilemma" with the vague allusion "Due to underspecifying in older versions and for compatibility with existing tools...". I would append something like (very rough) "This is one way of resolving the troubling appearance of contradictory genotypes when overlapping ALT alleles are called with [1] where (i) the input gVCF has called a |
TL;DR: I will make the next changes:
Most people (@lh3, @dancooke, @pd3, @lbergelson, @mlin) seem ok with making In general I agree it's more intuitive than restricting to deletions, but I still think that's almost redundant with the "delta GT=0" interpretation, and it makes all of these similar options valid, with slightly different meanings:
and
and
If anyone wants " Strictly speaking, those 3 examples wouldn't mean exactly the same, so different syntax can be used for slightly different purposes. If you do haplotype reconstruction, the first record of the first example doesn't specify at all the genotype of s2 from pos 10 to 14, and the first record of the second example says implicitly (because it depends on looking at the other records) that s2 is Naive-filtering (just removing a record without changing others, as I think most VCF filters work) the first example is ok. Naive-filtering the second and third example can yield incorrect data. For instance, filtering the second record would make the first record mean s2 is Naive-merging VCFs in the first example is ok. Naive-merging the second and third example could make a given genotype to be Naive-realigning VCFs in all examples is ok as long as the "delta GT=0" interpretation is used. See "Naive realignment" below for an explanation of this. In summary, the second and third examples (avoiding * in the first record) convey a bit more information but are more complex to handle (in most cases, see realignment below) because it relies on record interdependence to be correct. As a reminder, all this only applies if you are doing haplotype reconstruction; if you only care about samples "having a variant or not" everything is fine. Naive realignmentI still don't like "delta GT=0", but we can't remove it if we want to support naive-realignment (this is basically a note to myself: the next example won't work with "GT=0 is true reference" regardless of "
I have never heard of anyone taking into account the surrounding variants to realign (except GATK 4.1, which only left-aligns until the previous variant), so " |
I am ok with the change.
Are you saying the first example hard to filter or the second/third hard to filter? I think we can independently remove any line in the third example, or remove POS12/14 in the second example. The first example is harder to deal with.
"delta GT=0" is often an invariant regardless of the record dependencies. It is arguably the cleanest in theory. In practice, this definition may be inconvenient as we can't count the reference allele from a single record. The problem with "GT=0 is true reference" goes beyond realignment and filtering. The second half of #437 highlights that. The discussion was complicated by base POS REF ALT INFO FORMAT s1 s2
10 GTATA G,* . GT 2/1 ?/?
12 A C,* . GT 1/2 0/1
14 A T,* . GT 2/2 0/1 If you take "GT=0 is true reference", the genotype of s2 at pos 10 should be either "0/2" or "2/2", but you don't know due to unknown phasing. If you take "GT=0 is true reference up to the next record", the s2 genotype is unambiguously "0/0". |
What I mean is that having the second or third examples (let's use the third):
if you remove the second record it becomes this:
which is incorrect, as now the first record
I don't understand why. In the first example this kind of naive filtering doesn't yield incorrect data. Admittedly it loses a bit more information than needed, but you are filtering data anyway. Related to this, for your second point (the genotype of s2 at pos 10 should be either "0/2" or "2/2" due to unknown phasing), in my understanding, you can say 2/2 anyway, which does not provide all the information that is available to you as the VCF writer, but as @lbergelson pointed in the #437, we could add a different field to add info for haplotype reconstruction. There may be other harder arguments for keeping "delta GT=0" but IMHO this looks like a small inconvenience, not a hard VCF requirement. Small inconvenience like stating genotypes between variants, aka reference blocks. Even if that field is not eventually added, I really really think that providing less information that is correct, is better than providing more information in a brittle way (brittle in the sense that only super careful handling preserves correctness). |
It is correct if you take "delta GT=0". You have to take the "delta GT=0" definition for the unfiltered VCF in the third example; otherwise that example is incorrect. Actually GT=0 means three different things in your three examples.
In the first example, if you remove POS12, you get POS REF ALT INFO FORMAT s1 s2
10 GTATA G,* . GT 2|1 2/2
14 A T,* . GT 2|2 0/1 which is inconsistent, because s2 should have GT=0/2 at POS10, not 2/2. Its genotype now becomes deterministic.
If you adopt the "delta GT=0" definition or the "GT=0 is reference allele up to the next record" definition, you won't have this problem in the first place. PS: although not clarified in the spec, |
I think there may be a little difference in whether we consider filtering/deleting a record means the affected bases should revert to (i) reference or (ii) some non-called/half-called state. @lh3 last comment on the first and third examples seems to be using (i), while previously @jmmut seemed to suggest (ii) -- and further, that I think this is interesting and debatable between (i) and (ii) and if (ii), I'm not convinced of the last point. |
2614923
to
211d2e9
Compare
211d2e9
to
f378568
Compare
As agreed on the calls, I changed this PR to be only about the rejection of *-base, and work on delta-GT separately. Also, copied the *-base rejection in VCFv4.4. |
can we have a regular expression here? |
f378568
to
b76900e
Compare
I added the sentence
Below there's an explanation why I didn't include a regex for symbolic alleles and breakends. I have been thinking about including the symbolic allele and the breakend notation also in the regex but I think it's best not to (at least in this PR) because that will delay this issue even more and would be a breaking change, because at the moment the ALT is not exactly defined. My guess is that the symbolic allele regex currently specified would be:
where only comma (
Note that contigs disallow star ('*') and equals ('=') as first characters. I don't know the reason for that and I don't know if it applies to ALT. The breakend regex includes the contig regex:
With all that, the whole ALT regex, including breakends would be:
or, with the contig expanded to try in regexr.com/56mtk :
And that's a javascript regex, I haven't gone through the process of making it a portable regex, nor escaping that for latex. Although I'm hesitant to include a regex for ALT in this PR and for VCFv4.3, I agree that for 4.4 most (all?) fields should specify the valid characters in either a regex or in natural language. Right now I can see restrictions for the data fields (chrom, pos, id, etc) which implicitly specifies most metadata lines, but not custom metadata lines. Related issue about custom fields: #297 At the moment I don't see any other unspecified field. |
@mlin I agree it's important to have a clear notation for overlapping alleles. I removed those changes from this PR but I still intend to improve that. |
This PR aims to clarify some points discussed in #437 :
*
is a complete allele. It must not be mixed with other ACTG bases.This approach is controversial as can be seen in the issue, but the disadvantages look less important than the following points:
*
is more explicit and sometimes clearer but imposes a record interdependency that is not really aligned with VCF in terms of record filtering or VCF merging.There are some points (raised in the issue) that should be addressed before merging this PR:
*
as overlapping deletion? with the above definition of genotype 0, changing*
to be overlapping allele would be redundant. I favour keeping as overlapping deletion even if just for the sake of avoiding having multiple ways to do the same.