Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VCF: Clarify * as overlapping allele, not overlapping base #437 #464

Merged
merged 4 commits into from
Jun 22, 2020

Conversation

jmmut
Copy link
Contributor

@jmmut jmmut commented Dec 12, 2019

This PR aims to clarify some points discussed in #437 :

  • * is a complete allele. It must not be mixed with other ACTG bases.
  • Genotypes are meant to be deltas: changes from the reference sequence. Thus, a sample may have a genotype 0 that contradicts another record. Genotype 0 means "no change unless other records say so".

This approach is controversial as can be seen in the issue, but the disadvantages look less important than the following points:

  • The notation with base-* is more explicit and sometimes clearer but imposes a record interdependency that is not really aligned with VCF in terms of record filtering or VCF merging.
  • Following on this independence of records, the genotype 0 has been historically understood by a big part of the community and tools as "this record reports no change for this sample".

There are some points (raised in the issue) that should be addressed before merging this PR:

  • Should we explicitly say that a "true reference" can not be stated in VCF? or should we explicitly explain a way to state a "true reference" with the REF-only blocks? (I'm not sure how the latter can be done)
  • Should we keep * as overlapping deletion? with the above definition of genotype 0, changing * to be overlapping allele would be redundant. I favour keeping as overlapping deletion even if just for the sake of avoiding having multiple ways to do the same.
  • this PR only changes 4.3, which is what I think was agreed regarding version handling, where we try not to modify old versions except in critical cases, but the clarification of genotype 0 looks a bit critical to me.

@hts-specs-bot
Copy link

Changed PDFs as of 2614923: VCFv4.3 (diff).

@lbergelson lbergelson self-assigned this Dec 12, 2019
@lbergelson lbergelson added the vcf label Dec 12, 2019
VCFv4.3.tex Outdated
@@ -311,8 +311,7 @@ \subsubsection{Fixed fields}

\item ALT --- alternate base(s): Comma separated list of alternate non-reference alleles.
These alleles do not have to be called in any of the samples.
Options are base Strings made up of the bases A,C,G,T,N,*, (case insensitive) or a MISSING value `.' (no variant) or an angle-bracketed ID String (``$<$ID$>$'') or a breakend replacement string as described in the section on breakends.
The `*' allele is reserved to indicate that the allele is missing due to an overlapping deletion.
Options are base Strings made up of the bases A,C,G,T,N (case insensitive) or the `*' symbol (allele missing due to overlapping deletion) or a MISSING value `.' (no variant) or an angle-bracketed ID String (``$<$ID$>$'') or a breakend replacement string as described in the section on breakends.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allele missing due to overlapping deletion

This should be changed to "allele missing due to other overlapping alleles", not limited to deletions. For example

chr1  100  A  G
chr1  100  A  T,*

should be technically allowed.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be changed to "allele missing due to other overlapping alleles", not limited to deletions

I agree with this since it would allow the option of using 0 as an assertion of reference. Possibly replace "missing" with "unspecified" to avoid confusion with .?

@dancooke
Copy link

If the meaning of * is changed to indicate "overlap with any allele", as @lh3 advocated, then there is no need to enforce GT being a delta. This is almost surely what the original intent behind * was. The meaning that GT has historically been adopted (by some) is a symptom of the problems with previous versions of VCF and should not be part of the solution; I don't think it can be understated how important it is to have a mechanism to assert the reference allele at an arbitrary haplotype (i.e. not just homref), especially now that we're adding more and more phase information to variant calls and working with non-diploid samples. Any other mechanism than GT=0 would be ad-hoc and confusing.

Standardising this definition of GT=0 would be catastrophic for me as it would render all of my VCF-based code (and output) broken. It would break pysam (sample.alleles) and vcflib (Variant::alleles), since these libraries assume GT=0 implies REF. Please do not do this - I'm certain this will come back to bite us. I'd much rather leaving the spec ambiguous on GT=0 (even mentioning this is the case) but clarify *. Then GT=0 can be specified in the next version of VCF to mean assertion - as it always should have been.

@mlin
Copy link
Member

mlin commented Dec 24, 2019

I agree with @dancooke that it would be unfortunate to incorporate the ambiguous interpretation of GT=0 in a way that'd invite developers to build new tools based on it. I do note the PR here introduces it with,

Due to underspecifying in older versions and for compatibility with existing tools, ...

Perhaps this language could be sharpened just a bit, to acknowledge it as part of the past and present but discourage it for the future.

@lh3
Copy link
Member

lh3 commented Dec 24, 2019

@mlin A summary of the previous thread. A 0 allele could potentially have three meanings:

  1. The REF allele in its entirety.
  2. No change to the REF allele at the current record.
  3. REF allele up to the next overlapping record (i.e. partial REF allele).

We all think VCF follows def 1, but with that definition, most existing VCFs would be wrong. As @pd3 pointed out, VCF4.1 actually adopted def 2 without us being aware of it. VCF4.2 introduced the * allele, which made def 3 an option. GATK practically uses this definition (confirmed by core GATK developers), again without us being aware of it.

Within the scope of v4.3, we can't make def 1 work (and I believe we won't ever make it work in practice, but that is a debate for v4.4). Both def 2 and 3 have already been widely used for years. Even for the sake of backward compatibility, we have to allow them both. This PR just acknowledges this unfortunate fact, not changing the way we use VCF.

@mlin
Copy link
Member

mlin commented Dec 24, 2019

@lh3 I too see no practical alternative to acknowledging the subtle meanings, or lack thereof, of GT=0 at this point. I just don't want the see the spec endorse/recommend/require the plain appearance of contradictory genotypes, the undesirability of which I think is implied by "Due to underspecifying in older versions and for compatibility with existing tools..."; but future developers who haven't had the chance to enjoy #437 might need a bit more context for that lede (I for one certainly did enjoy it, but regrettably did not have bandwidth to participate at the time 😅).

FWIW, GLnexus made the tradeoff of sometimes failing to call a subset of reference bases that it should be able to if not for the other overlapping alleles; which I felt preferable to the appearance of contradictory genotypes. There are definitely users who don't want to see the resulting half-calls like .|1, but (i) some subset of users will have legitimate complaint about any representation here and (ii) this is a relatively mild one of the numerous representation problems in large-cohort gVCF merging. Single-sample or family-level callers can/should make a different tradeoff of course.

@lh3
Copy link
Member

lh3 commented Dec 24, 2019

@mlin With v4.3, your GLnexus example is better to add the * allele and to replace all . with the allele number corresponding to * (def 3); or alternatively to turn all . to 0 (def 2). We had discussed the . representation in that thread, but my impression is this proposal quickly went under as it interprets . differently from the spec and thus adds more confusion.

In the lack of a consensus, this PR does a very good job to acknowledge the dilemma. If we want to go beyond this PR, we need to reach a consensus and phrase it as a recommended practice.

@dancooke
Copy link

Within the scope of v4.3, we can't make def 1 work (and I believe we won't ever make it work in practice, but that is a debate for v4.4). Both def 2 and 3 have already been widely used for years. Even for the sake of backward compatibility, we have to allow them both. This PR just acknowledges this unfortunate fact, not changing the way we use VCF.

Perhaps the fairest and most pragmatic way forward is to keep this pull request with the suggested change to * (i.e. overlap with any allele), but concurrently branch v5.0 with "GT=0 means REF" specified. That way:

  • Anyone using GT=0 in the GATK way is happy, and the GATK developers can move towards the next version at their leisure. The only change that the GATK team will need to make now is to add support for the corrected * definition.
  • Anyone currently using GT=0 means REF will need to switch to VCF v5.0, but at least no major code changes are needed (only dealing with cases where . is currently used rather than * or 0, as @mlin mentioned).
  • Any v5.0 VCF is automatically v4.3 compatible (with the corrected * definition), so anyone switching to v5.0 can still use GATK tools.
  • v4.3+ and v5.x could otherwise be kept in sync up until some predefined time (to give everyone fair chance to make the switch). It should even be fairly straightforward to provide a conversion utility.

@mlin
Copy link
Member

mlin commented Dec 24, 2019

@lh3 Yes, currently it's not appropriate for GLnexus to use * because of how the spec constrains that to represent a deletion, but this PR would make that a viable option so that's exciting. However, there's another case where GLnexus can generate a half-call where . seems to remain appropriate [1], so then I have to consider whether to inflict on our users two different kinds of half-calls, or just continue using .. I'll continue to monitor developments here naturally.

Regarding GT=0, I understand that defs 2/3 are widely used, but I personally will continue to avoid generating the plain appearance of three overlapping alleles at a diploid position, even if the contradiction is technically defined away. I feel a strong enough aversion to that to prefer my current trade-off of failing to call some of the reference bases. But, I acknowledge this calculation would differ in another application (other than gVCF merging) where inclusion/exclusion judgments and missing data weren't so intrinsically pervasive.

I don't agree that "this PR does a very good job to acknowledge the dilemma" with the vague allusion "Due to underspecifying in older versions and for compatibility with existing tools...". I would append something like (very rough) "This is one way of resolving the troubling appearance of contradictory genotypes when overlapping ALT alleles are called with 0/1 genotypes in nearby records." The last edits in the VCF records section get to this, but these edits are spaced widely apart...ironic...


[1] where (i) the input gVCF has called a 0/1 genotype, but (ii) the ALT was not of sufficient quality for inclusion in the output joint pVCF site, and (iii) the output site does exist to represent other high-quality allele(s) at the same position. Most of the time the statistical re-genotyping rounds these down to 0/0, but the in-between case can happen.

@jmmut
Copy link
Contributor Author

jmmut commented Jan 22, 2020

TL;DR: I will make the next changes:

  • keep the statement that GT=0 means reference unless other records say otherwise, rewording the contradiction solution.
  • say that * means overlapping allele.
  • add a small note for the complications of filtering/merging/realigning VCFs before doing haplotype reconstruction in overlapping records (see below).

Most people (@lh3, @dancooke, @pd3, @lbergelson, @mlin) seem ok with making * mean overlapping allele and as I'm going to make that change to the PR, I want to state the consequences I see of it.

In general I agree it's more intuitive than restricting to deletions, but I still think that's almost redundant with the "delta GT=0" interpretation, and it makes all of these similar options valid, with slightly different meanings:

POS  REF   ALT INFO FORMAT s1  s2
10   GTATA G,* .    GT     2|1 2/2
12   A     C,* .    GT     1|2 1/0
14   A     T,* .    GT     2|2 0/1

and

POS  REF   ALT INFO FORMAT s1  s2
10   GTATA G   .    GT     0|1 0/0
12   A     C,* .    GT     1|2 1/0
14   A     T,* .    GT     2|2 0/1

and

POS  REF   ALT INFO FORMAT s1  s2
10   GTATA G   .    GT     0|1 0/0
12   A     C   .    GT     1|0 1/0
14   A     T   .    GT     0|0 0/1

If anyone wants "* is overlapping allele" but doesn't want to allow the first example, I can imagine that they are thinking actually in "* is a downstream overlapping allele"; but I (and others such as @d-cameron) don't think it's a good idea to change the syntax and/or meaning depending on which record appears first, so I will stick to "overlapping allele".

Strictly speaking, those 3 examples wouldn't mean exactly the same, so different syntax can be used for slightly different purposes. If you do haplotype reconstruction, the first record of the first example doesn't specify at all the genotype of s2 from pos 10 to 14, and the first record of the second example says implicitly (because it depends on looking at the other records) that s2 is GT?T?/GT?T?.

Naive-filtering (just removing a record without changing others, as I think most VCF filters work) the first example is ok. Naive-filtering the second and third example can yield incorrect data. For instance, filtering the second record would make the first record mean s2 is GTAT?/GTAT?. A correct filter would need to change the first record (and any overlapping record) to 10 GTATA G,* . GT 2|1 2/2.

Naive-merging VCFs in the first example is ok. Naive-merging the second and third example could make a given genotype to be . or 0 depending on which record you are looking at, but this looks kind of ok to me.

Naive-realigning VCFs in all examples is ok as long as the "delta GT=0" interpretation is used. See "Naive realignment" below for an explanation of this.

In summary, the second and third examples (avoiding * in the first record) convey a bit more information but are more complex to handle (in most cases, see realignment below) because it relies on record interdependence to be correct. As a reminder, all this only applies if you are doing haplotype reconstruction; if you only care about samples "having a variant or not" everything is fine.


Naive realignment

I still don't like "delta GT=0", but we can't remove it if we want to support naive-realignment (this is basically a note to myself: the next example won't work with "GT=0 is true reference" regardless of "* is overlapping allele" or "* is overlapping deletion"):

reference: 123456
           GTATAC
s1:        GTA--C
s2:        GCATAC

original variants:
POS   REF   ALT  s1   s2
2     T     C    0/0  1/1
3     ATA   A    1/1  0/0

realigned to the left (by `vt normalize`):
POS   REF   ALT  s1   s2
1     GTA   G    1/1  0/0    // this GT=0 is NOT true reference! needs to be delta GT=0
2     T     C    0/0  1/1

SPDY realigned (ignoring 1-base/0-base issues):
POS   REF   ALT  s1   s2
1     GTATA GTA  1/1  0/0     // this GT=0 is NOT true reference! needs to be delta GT=0
2     T     C    0/0  1/1

I have never heard of anyone taking into account the surrounding variants to realign (except GATK 4.1, which only left-aligns until the previous variant), so "GT=0 is true reference" could break for anyone doing naive-realignment and haplotype reconstruction. It's arguable for next versions whether it's preferable to prioritise correctness of naive-filtering + haplotype-reconstruction or naive-realignment + haplotype-reconstruction ("GT=0 is true reference" vs "delta GT=0").

@lh3
Copy link
Member

lh3 commented Jan 22, 2020

I am ok with the change.

Naive-filtering the second and third example can yield incorrect data. ... A correct filter would need to change the first record (and any overlapping record) to 10 GTATA G,* . GT 2|1 2/2.

Are you saying the first example hard to filter or the second/third hard to filter? I think we can independently remove any line in the third example, or remove POS12/14 in the second example. The first example is harder to deal with.

I still don't like "delta GT=0", but we can't remove it if we want to support naive-realignment

"delta GT=0" is often an invariant regardless of the record dependencies. It is arguably the cleanest in theory. In practice, this definition may be inconvenient as we can't count the reference allele from a single record. The problem with "GT=0 is true reference" goes beyond realignment and filtering. The second half of #437 highlights that. The discussion was complicated by base * which is not an issue for now with this PR. Use the old example again:

POS  REF   ALT INFO FORMAT s1  s2
10   GTATA G,* .    GT     2/1 ?/?
12   A     C,* .    GT     1/2 0/1
14   A     T,* .    GT     2/2 0/1

If you take "GT=0 is true reference", the genotype of s2 at pos 10 should be either "0/2" or "2/2", but you don't know due to unknown phasing. If you take "GT=0 is true reference up to the next record", the s2 genotype is unambiguously "0/0".

@jmmut
Copy link
Contributor Author

jmmut commented Jan 22, 2020

Are you saying the first example hard to filter or the second/third hard to filter? I think we can independently remove any line in the third example, or remove POS12/14 in the second example. The first example is harder to deal with.

What I mean is that having the second or third examples (let's use the third):

POS  REF   ALT INFO FORMAT s1  s2
10   GTATA G   .    GT     0|1 0/0
12   A     C   .    GT     1|0 1/0
14   A     T   .    GT     0|0 0/1

if you remove the second record it becomes this:

POS  REF   ALT INFO FORMAT s1  s2
10   GTATA G   .    GT     0|1 0/0
14   A     T   .    GT     0|0 0/1

which is incorrect, as now the first record 10 GTATA G . GT 0|1 0/0 is the only one talking about POS 12 and it says that it's A/A for s2. Before filtering, the VCF said A/C. If you remove the second record, you should change the first record to 10 GTATA G,* . GT 2|1 2/2, which now says that s2 has an unspecified genotype. Subsequent haplotype reconstruction would yield incorrect data.


The first example is harder to deal with.

I don't understand why. In the first example this kind of naive filtering doesn't yield incorrect data. Admittedly it loses a bit more information than needed, but you are filtering data anyway.

Related to this, for your second point (the genotype of s2 at pos 10 should be either "0/2" or "2/2" due to unknown phasing), in my understanding, you can say 2/2 anyway, which does not provide all the information that is available to you as the VCF writer, but as @lbergelson pointed in the #437, we could add a different field to add info for haplotype reconstruction. There may be other harder arguments for keeping "delta GT=0" but IMHO this looks like a small inconvenience, not a hard VCF requirement. Small inconvenience like stating genotypes between variants, aka reference blocks.

Even if that field is not eventually added, I really really think that providing less information that is correct, is better than providing more information in a brittle way (brittle in the sense that only super careful handling preserves correctness).

@lh3
Copy link
Member

lh3 commented Jan 22, 2020

if you remove the second record it becomes this, ... which is incorrect

It is correct if you take "delta GT=0". You have to take the "delta GT=0" definition for the unfiltered VCF in the third example; otherwise that example is incorrect. Actually GT=0 means three different things in your three examples.

I don't understand why. In the first example this kind of naive filtering doesn't yield incorrect data.

In the first example, if you remove POS12, you get

POS  REF   ALT INFO FORMAT s1  s2
10   GTATA G,* .    GT     2|1 2/2
14   A     T,* .    GT     2|2 0/1

which is inconsistent, because s2 should have GT=0/2 at POS10, not 2/2. Its genotype now becomes deterministic.

in my understanding, you can say 2/2 anyway, which does not provide all the information that is available to you as the VCF writer

If you adopt the "delta GT=0" definition or the "GT=0 is reference allele up to the next record" definition, you won't have this problem in the first place.

PS: although not clarified in the spec, * practically means an unspecified allele that is not REF or ALT on the current line. I believe everyone using * is implicitly assuming this condition. If you use "2/2" in my example, * means an unspecified allele or the REF allele. It has two meanings and is harder to use properly. Generally, in an unphased VCF, the complete reference allele may be non-deterministic – we don't know if a sample has the complete reference allele or not based on VCF. If we can't determine the complete reference allele, we can't define GT=0 as the complete reference allele. It is ok to define GT=0 as the reference allele up to the next record.

@mlin
Copy link
Member

mlin commented Jan 22, 2020

I think there may be a little difference in whether we consider filtering/deleting a record means the affected bases should revert to (i) reference or (ii) some non-called/half-called state. @lh3 last comment on the first and third examples seems to be using (i), while previously @jmmut seemed to suggest (ii) -- and further, that * be used to represent the state (as opposed to . perhaps).

I think this is interesting and debatable between (i) and (ii) and if (ii), I'm not convinced of the last point.

@hts-specs-bot
Copy link

Changed PDFs as of 211d2e9: VCFv4.3 (diff).

@hts-specs-bot
Copy link

Changed PDFs as of f378568: VCFv4.3 (diff), VCFv4.4 (diff).

@jmmut
Copy link
Contributor Author

jmmut commented May 28, 2020

As agreed on the calls, I changed this PR to be only about the rejection of *-base, and work on delta-GT separately.

Also, copied the *-base rejection in VCFv4.4.

@jmmut jmmut added this to the VCF v4.3 final revision milestone May 28, 2020
@yfarjoun
Copy link
Contributor

can we have a regular expression here? ([ACGTacgt]+)|(*)|(.)

@mlin
Copy link
Member

mlin commented May 28, 2020

@jmmut Thanks for your work on this; I was excited about generalizing the star allele beyond deletions, and I hope we won't lose too much momentum around that.

@jmmut jmmut force-pushed the clarify-0-and-overlapping-alleles branch from f378568 to b76900e Compare June 15, 2020 08:25
@hts-specs-bot
Copy link

Changed PDFs as of b76900e: VCFv4.3 (diff), VCFv4.4 (diff).

@jmmut
Copy link
Contributor Author

jmmut commented Jun 15, 2020

I added the sentence

In other words, the ALT field must be a symbolic allele, or a breakend replacement string, or match the regular expression ^([ACGTNacgtn]+|\*|\.)$.

Below there's an explanation why I didn't include a regex for symbolic alleles and breakends.


I have been thinking about including the symbolic allele and the breakend notation also in the regex but I think it's best not to (at least in this PR) because that will delay this issue even more and would be a breaking change, because at the moment the ALT is not exactly defined. My guess is that the symbolic allele regex currently specified would be:

<[0-9A-Za-z!"#$%&'()*+./:;=?@[\\\]^_`{|}~-]+>

where only comma (,) and angle brackets (<, >) are excluded for parsing reasons (stated in Section 1.6.1.5 ALT). It's not mentioned for ALT but it would also be reasonable to exclude quotes (", ', '`'), backslash (\), parenthenesis ((, )), square brackets ([, ]), and braces ({, }). Those are also excluded from the contig regex, matching with SAM. This more constrained regex would be:

<[0-9A-Za-z!#$%&*+./:;=?@^_|~-]+>

Note that contigs disallow star ('*') and equals ('=') as first characters. I don't know the reason for that and I don't know if it applies to ALT.

The breakend regex includes the contig regex:

contig = [0-9A-Za-z!#$%&+./:;?@^_|~-][0-9A-Za-z!#$%&*+./:;=?@^_|~-]*

([ACGTNacgtn]+\[contig:[0-9]+\[)|([ACGTNacgtn]+\]contig:[0-9]+\])|
(\[contig:[0-9]+\[[ACGTNacgtn]+)|(\]contig:[0-9]+\][ACGTNacgtn]+)

With all that, the whole ALT regex, including breakends would be:

^([ACGTNacgtn]+|\*|\.|<[0-9A-Za-z!#$%&*+./:;=?@^_|~-]+>|([ACGTNacgtn]+\[contig:[0-9]+\[)|([ACGTNacgtn]+\]contig:[0-9]+\])|(\[contig:[0-9]+\[[ACGTNacgtn]+)|(\]contig:[0-9]+\][ACGTNacgtn]+))$

or, with the contig expanded to try in regexr.com/56mtk :

^([ACGTNacgtn]+|\*|\.|<[0-9A-Za-z!#$%&*+./:;=?@^_|~-]+>|([ACGTNacgtn]+\[[0-9A-Za-z!#$%&+./:;?@^_|~-][0-9A-Za-z!#$%&*+./:;=?@^_|~-]*:[0-9]+\[)|([ACGTNacgtn]+\][0-9A-Za-z!#$%&+./:;?@^_|~-][0-9A-Za-z!#$%&*+./:;=?@^_|~-]*:[0-9]+\])|(\[[0-9A-Za-z!#$%&+./:;?@^_|~-][0-9A-Za-z!#$%&*+./:;=?@^_|~-]*:[0-9]+\[[ACGTNacgtn]+)|(\][0-9A-Za-z!#$%&+./:;?@^_|~-][0-9A-Za-z!#$%&*+./:;=?@^_|~-]*:[0-9]+\][ACGTNacgtn]+))$

And that's a javascript regex, I haven't gone through the process of making it a portable regex, nor escaping that for latex.

Although I'm hesitant to include a regex for ALT in this PR and for VCFv4.3, I agree that for 4.4 most (all?) fields should specify the valid characters in either a regex or in natural language. Right now I can see restrictions for the data fields (chrom, pos, id, etc) which implicitly specifies most metadata lines, but not custom metadata lines. Related issue about custom fields: #297

At the moment I don't see any other unspecified field.

@jmmut
Copy link
Contributor Author

jmmut commented Jun 15, 2020

@mlin I agree it's important to have a clear notation for overlapping alleles. I removed those changes from this PR but I still intend to improve that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants