Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VCF spec allows ALTs that are mixes of * and bases, but doesn't define how to interpret them #437

Open
tfenne opened this issue Aug 15, 2019 · 68 comments
Labels

Comments

@tfenne
Copy link
Member

tfenne commented Aug 15, 2019

The VCF spec contains this text on ALTs:

Options are base Strings made up of the bases A,C,G,T,N,*, (case insensitive)
or an angle-bracketed ID String (“<ID>”) or a breakend replacement string as described in the section on breakends. The ‘*’ allele is reserved to indicate that the allele is missing due to a upstream deletion

I think probably the intention here was to allow either * or [ACGTN]+, but not a mixture of the two. But as written it theoretically allows alleles like *A*C*G*T*, but has nothing to say on how such alleles should be interpreted.

I've recently come across VCFs generated by Octopus that contain alleles that start with a * and end with base sequence, e.g. *AC. This is technically valid VCF, but I suspect most tooling won't support it (GATK tools fail on it). In their case they seem to be using it to indicate that the anchor base in a deletion record is covered by an upstream deletion, but not the whole allele.

I think the spec should either be updated to clarify this kind of usage or if the horse is already out of the gate, prohibit it.

@jmarshall jmarshall added the vcf label Aug 15, 2019
@pd3
Copy link
Member

pd3 commented Aug 15, 2019

Yes, the intention was to allow either * or [ACGTN]+ and the combination should be prohibited.

@tfenne
Copy link
Member Author

tfenne commented Aug 15, 2019

Some related discussion here that I missed previously: #151
And an issue/question in Octopus which uses * followed by bases to represent partially spanned alleles: luntergroup/octopus#75

@tfenne
Copy link
Member Author

tfenne commented Aug 30, 2019

@lbergelson, @pd3 & @jmmut pinging you guys since you are the currently listed VCF maintainers. Is there a path to making decision here in the near future? It's becoming a little awkward since Octopus uses alleles of \**[ACGT]+ and currently HTSJDK cannot read VCFs containing those alleles, which makes it a) impossible to use Picard of GATK tools with Octopus VCFs and b) near-impossible to write custom code that uses HTSJDK to process Octopus VCFs (see samtools/htsjdk#1413).

My sense is that since this isn't explicitly dis-allowed by the spec and that @dancooke has shown a good use case for alleles that start with * and continue with bases, and that it's never been disallowed by the specification, we should probably just formalize that it's ok and give examples of how it can be used to represent partially-spanned alleles. But the worst case is to have uses like this in the wild and reference implementations that don't want to support that syntax even though it's not in violation of the spec.

If folks are on board with formalizing the usage from octopus I'm happy to put together a PR to the spec. And if that's a non-starter could someone else make a PR to clarify the restrictions on use of * please?

@lbergelson
Copy link
Member

@tfenne I'm not totally sold on this. It's not explicitly disallowed by the spec, but it was definitely not intended.

I see how it's useful for figuring out a partially spanned allele, but it seems like it could add a lot of complication to an already error prone and poorly supported feature.

Is the intent to support alleles starting with * or do you want to allow * at any position in the read? We could potentially have very complicated alleles overlapping multiple things specified at other points in the vcf. It seems like supporting this will open up another set of equivalent ways to represent the same haplotypes in a vcf.

@dancooke
Copy link

@lbergelson I would argue that the 'symbolic' interpretation adds even more complexity as it adds a completely distinct concept. The base-*interpretation simply extends the set of allowed ALT bases by one, essentially meaning "already reported upstream".

I think the definition I gave implicitly requires any * to be a the head of the allele - otherwise the allele that is causing the non-head-* would be reported in a downstream record, contradicting the definition.

@tfenne
Copy link
Member Author

tfenne commented Aug 30, 2019

@lbergelson What I really want is for the spec to be 100% clear on what is and isn't supported, and for HTSJDK to be in line with that, and for Octopus to generate VCFs that are spec compliant. I tend to think clarifying the spec to allow one or more *s followed by 0 or more bases is a good approach as it doesn't significantly complicate what's there now, and it doesn't penalize @dancooke for implementing a feature in a spec-compliant (albeit underspecified) way. But mostly I'm just looking for alignment between the spec, Octopus and HTSJDK.

@lh3
Copy link
Member

lh3 commented Aug 30, 2019

I prefer to disallow an ALT like **AAC. Due to the way VCF is designed, we often need to look at the previous records to make the right decision at the current record. Having **AAC doesn't solve this problem and further introduces additional conceptual and technical complexity, and more ways to generate inconsistent VCFs.

@dancooke
Copy link

There is actually a more fatal problem with the symbolic *. Considering again the example given by @tfenne:

# Symbolic * representation
chr1  1000  GTATATA G         ... 0|1
chr1  1005  TAGCGA  T,*       ... 1|2

# Base * representation
chr1  1000  GTATATA G         ... 0|1
chr1  1005  TAGCGA  T,**GCGA  ... 1|2

Both of these examples are actually inconsistent as the first record states the reference (A) on the first haplotype at chr1:1006 while the second states that this base is deleted. Octopus currently deals with this by using .:

chr1  1000  GTATATA G         ... .|1
chr1  1005  TAGCGA  T,**GCGA  ... 1|2

But this isn't particularly satisfactory as it doesn't make a strong claim about the reference between chr1:1000-1004. I don't see a way around this using the symbolic representation. However, it's fairly simple to extend the definition of base-* to something like:

This base is stated by an overlapping record.

Then the output would be

chr1  1000  GTATATA G,GTATA**         ... 2|1
chr1  1005  TAGCGA  T,**GCGA  ... 1|2

Which is consistent and makes it simple to reconstruct haplotypes as there no need to remember position information between overlapping records.

@lh3
Copy link
Member

lh3 commented Sep 1, 2019

The haplotype alignment is:

Pos: 1234567890123456789012345
Ref: XXXXXXXXXGTATATAGCGAXXXXX
H1:  XXXXXXXXXGTATAT-----XXXXX
H2:  XXXXXXXXXG------GCGAXXXXX

Your final VCF is wrong. The first line should be:

chr1  1000  GTATATA G,GTATAT*         ... 2|1

A VCF parser can only detect this error when it comes to the second line. To validate a VCF, you always need to look at multiple lines. Your proposal is not making things simpler. Furthermore, your proposal allows inconsistencies between the two lines. Wrong second lines could look like:

chr1  1005  TAGCGA  T,**GCA  ... 1|2

or

chr1  1005  TAGCGA  T,****CA  ... 1|2

A proper design should try to minimize such errors.

The right way to represent this alignment is:

chr1  1000  GTATATA G,GTATAT  2|1
chr1  1005  TAGCGA  T,*       1|2

It is easy enough to reconstruct the alignment. There are fewer ways to create wrong VCFs.

@dancooke
Copy link

dancooke commented Sep 1, 2019

@lh3

The first line should be:

chr1 1000 GTATATA G,GTATAT* ... 2|1

Nope. The T before the * is (and must be) determined by the next record.

The right way to represent this alignment is:

chr1 1000 GTATATA G,GTATAT 2|1
chr1 1005 TAGCGA T,* 1|2

Completely disagree. You're permitting duplicate information in one direction but not the other. Not only is this inconsistent, it's also very confusing - It looks like the first record is calling a 1bp deletion, but you don't know that this isn't the case until you see the next record, and the first gives no indication you must do this (unlike the way I proposed).

There are fewer ways to create wrong VCFs.

I think this is a poor argument. For a start, the number of tools generating VCFs de novo is small compared with the number that parse/manipulate them, and I'd strongly argue that the base * representation makes parsing VCFs less error prone than symbolic * (or your proposed "right way") as * has clear meaning - it essentially says "ignore this base". Moreover, if you're generating inconsistent VCFs with this representation then probably you have some deeper problem with your algorithms internals. I'd rather have an output format that forces me to address this rather than allowing me to sweep it under the rug.

@dancooke
Copy link

dancooke commented Sep 1, 2019

Would also be good to hear @Lenbok's thoughts on this - he added support for base * to RTG Tools vcfeval.

@lh3
Copy link
Member

lh3 commented Sep 1, 2019

Nope. The T before the * is (and must be) determined by the next record.

Then we are interpreting your * differently (I see your * as padding). This complicates an already complex feature.

It is worth mentioning the freebayes way to encode this VCF. See a more complex example:

Pos: 123456789012345678 9012345
Ref: XXXXXXXXXGTATATAGC-GAXXXXX
H1:  XXXXXXXXXGTATAT------XXXXX
H2:  XXXXXXXXXG------GCTGAXXXXX

The freebayes way is to use CIGAR for a series of interlocking variants:

chr1  10  GTATATAGCGA GTATAT,GGCTGA  CIGAR=6M5D,1M6D2M1I2M

This is the best way to describe both haplotypes at the same time. On this example, your attempt to describe both haplotypes will become more complex. I guess there should be ways with your method, but I can't think of one for now. With the current *, we can describe one allele on one haplotype:

chr1  10  GTATATA G,GTATAT  2|1
chr1  15  TAGCGA  T,*       1|2
chr1  18  C       CT,*      2|1

PS: the major problem with the freebayes method is that it becomes awkward when there are a 6kb deletion on one haplotype and multiple small variants on the other haplotype. How do you deal with that in your representation?

The more I think of this problem, the stronger I feel we should forbid **AC.

@jmmut
Copy link
Contributor

jmmut commented Sep 2, 2019

Another issue to take into account is that the current spec doesn't mention that the record where the overlapping deletion is described needs to appear before the current record, so would this be a legal VCF?

chr1  10  GTATATA G,*  2|1
chr1  15  TAGCGA  T,*       1|2
chr1  18  C       CT,*      2|1

This has less redundancy and looks more straight to the point, but of course makes haplotype reconstruction harder by having to look at later records.

As long as the spec is clear and easy to understand I'm ok with any approach. At the very least I think we should address the current wordings:

  • Ambiguity on the "Strings made up of the bases A,C,G,T,N,*" part.
  • Ambiguity on whether the overlapped allele has to be described before or can appear later.

If we instead adopted the partial spanning notation, it should make clear the meaning of *. From this (to me, surprising) sentence:

The T before the * is (and must be) determined by the next record.

What I understand is that the purpose of each * is there to avoid redundancy, i.e. avoid putting a base that is described in other records, which is yet another different meaning of the already too overloaded *. Maybe I'm wrong, but I don't see any fatal issue with the current notation that the partial spanning notation solves. The partial spanning one may be clearer on a few examples, but the added complexity makes it likely there will be more confusion about it in most cases.

I'm slightly in favour of just fixing the wording of the current simple approach.

@dancooke
Copy link

dancooke commented Sep 2, 2019

The definition I'm suggesting is:

`*` indicates that this reference base is specified by an overlapping record

Note that this does't only apply to deletions, or upstream records (as currently required), in addition, there is no requirement to introduce a new type of 'symbolic' allele.

Using this definition, your example would be reported as:

chr1  10  GTATATA G,GTATA**  2|1
chr1  15  TAGCGA  T,**G*GA  1|2
chr1  18  C  CT,*  2|1

This has several advantages over the symbolic representation you show:

  • It's clear that the first record is not calling a 1bp deletion on the second haplotype.
  • The reference is explicitly being called (it's not clear that this is the case with the current symbolic definition).
  • We know exactly which other positions we need to look at to complete the haplotypes.
  • It doesn't use one set of rules for upstream events and another for downstream events.
  • There's no duplicate information.

Just to stress that last point, let's look how you would represent your example without the first deletion:

chr1  15  TAGCGA  T, TAGCTGA       1|2
chr1  18  C       CT,*      2|1

So not only do you end up completely changing representation (going from symbolic * to an explicit allele), you also end up duplicating the insertion and therefore using the FreeBayes representation anyway. Using my definition this becomes:

chr1  15  TAGCGA  T, TAG*GA       1|2
chr1  18  C       CT,*      2|1

Consistent with the full example. No duplicate information - and therefore less error prone. I really can't understand how you find this more complex or confusing?

You already point out the main reason that the FreeBayes representation isn't viable - it doesn't scale. It's also hard to read a complex cigar string, and some complex substitutions can't be properly represented with a cigar (e.g. micro-inversions).

I'd also suggest that it makes matching and annotating variants from variation databases harder since simple variation gets buried into long haplotypes. Consider one more example, a complex substitution and an SNV. The 'FreeBayes' way would be:

chr1  15  TAGCGA  TCGCTA, TAGTGA       1|2

There's no way to 'decompose' this (without using .) using the existing specification since * only currently applies to deletions. It makes it far from obvious there's a SNV even in this small case. Using my suggested definition, this becomes:

chr1  15  TAGCGA  TCGCTA, TAG*GA       1|2
chr1  18  C       T,*      2|1

Now the SNV is clear, and can easily be annotated.

@lh3
Copy link
Member

lh3 commented Sep 2, 2019

@jmmut I was thinking about this on a flight. Now for my example:

Pos: 123456789012345678 9012345
Ref: XXXXXXXXXGTATATAGC-GAXXXXX
H1:  XXXXXXXXXGTATAT------XXXXX
H2:  XXXXXXXXXG------GCTGAXXXXX

my preference is:

chr1  10  GTATATA G         0|1
chr1  15  TAGCGA  T,*       1|2
chr1  18  C       CT,*      2|1

Here 0 means the reference allele until the next record. If I am right, this is how we are actually using 0 in practice. @tfenne's original example follows the same logic. This representation describes one allele at a time. It is succinct and consistent.

[the current * representation] makes haplotype reconstruction harder by having to look at later records.

When there are overlapping variations, you anyway need to read multiple records to reconstruct the haplotypes. The key information base * tells us is that there are overlapping records later on. You can trivially know this by reading the following records.

@dancooke
Copy link

dancooke commented Sep 2, 2019

@lh3 I really don't like this. It's weird to have one rule for one direction and a completely different one for the other. You're suggesting entirely changing the definition of GT and using a symbolic allele (*), but I can still never be sure if a 0 in GT actually means reference just from looking at the record - this will result in confusion and mistakes. It's not even consistent, my 'complex substitution' example would become

chr1  15  TAGCGA  TCGCTA       1|0
chr1  18  C       T      0|1

and we'd be right back to the situation without * in the first place.

@jmmut This is at least consistent, but the advantages of base * over symbolic * remain:

  • I can easily see which bases are called reference; symbolic * must implicitly assume the reference (although this isn't clear from the specification) and forces me to look at other records and calculate overlaps.
  • Haplotype construction is simpler and less error prone. It also helps spot bugs in code generating VCFs as sanity checking VCFs for consistency is simple.
  • Genotype zygosity is better preserved. For example, with symbolic * we easily end up having genotypes like chr1 18 C T,* 1|2|2, making it look like the zygosity is 2, but where the two * symbolic alleles actually refer to different overlapping alleles. This is unlikely to happen with base * (we'd usually have 1|2|3). The same problem for diploid could occur if joint calling multiple samples (e.g. chr1 18 C T,* 2|2 with * referring to different alleles).

The only downside (that I can think of) is that ALT alleles are more verbose than with symbolic *. I believe this is a price worth paying for the advantages. Note that is always possible to convert records from base * to symbolic *, but not the other way around, showing that you do loose information.

@lh3
Copy link
Member

lh3 commented Sep 2, 2019

I believe what I described is just an elaboration of the current approach. It is simple to generate: describe each atomic allele individually. It is also simple to parse: reconstruct the haplotype alignment based on the current record, discard the part of the haplotype beyond the the next record and write a new allele. It is imperfect, but it gets the job done and is compatible with existing ecosystem.

Your complex substitution example is actually more about multi-allelic sites. For example:

chr1  10  A  C,T  1|2

In bgt, I have long been using

chr1  10  A  C,<M>  1|2
chr1  10  A  T,<M>  2|1

Symbolic allele <M> means a different allele. For deletions, it plays the same role as *. If we like (not that I am proposing!), we can extend the definition of * to substitutions. Then * would be the same as <M> in bgt.

On three haplotypes, we can use

chr1  15  A  C,*,*  1|2|3

to indicate genotypes involving two overlapping deletions, if we really want. In practice, it doesn't matter either way. The parser can know the correct haplotypes based on the upstream records.

Several people in this thread thought base * helps to reduce dependencies between records. Not quite. First, a competent VCF parser should validate across records and have to look at down/upstream anyway. Second, see this base * example:

chr1  10  GTATATA G,GTATA**  2|1
chr1  15  TAGCGA  T,**G*GA   1|2
chr1  18  C       CT,*       2|1

At position 15, the parser needs pos 18 to construct the haplotype alignment, and when it comes to pos 18, it needs to memorize the record at pos 15. The is the result of the insertion. Furthermore, the redundancy across records is particularly concerning as it gives us more ways to create wrong records. I designed sam, bam, gfa and bcfv2. Redundancy like this is one of the first things I would strive to remove.

@d-cameron
Copy link
Contributor

I strongly oppose any format (re)definition that requires any information that relies in any way on record ordering or for which the variant call for a single record cannot be determined by reading that record in isolation. Common operations that will break such a definition are:

  • filtering a VCF (!)
  • left/right/centre aligning calls
  • sorting a VCF (due to weakly defined order)
  • taking a positional subset of a VCF (e.g. subset to exonic variants)

There is definitely room for improvement in defining phasing information but breaking steps that are used in almost every downsteam processing pipeline is something I am strongly opposed to.

Somewhat of a pity we didn't get around to fixing this given it was raised this as a problem 5 years ago:
#28

@kevinjacobs-progenity
Copy link

I agree with @d-cameron.

It seems that this discussion is about redefining VCF to incorporate a "graph style" representation of alleles and support breaking large haplotypes into simpler nested loci. If that is the intent, then let's carefully define the semantics AND document them in the specification. e.g. allowing overlapping records in VCF has been a nightmare for tool developers because everyone basically picks the semantics that makes the most sense to them at the time and assumes everyone else in the world should think about it the same way.

If we do wish to expand VCF semantics, then let's consider that most attempts at doing so have redefine either

  • '.' in GT to mean "no allele specified" as opposed to the more traditional interpretation of "no allele called"
  • '0' in GT to mean "maybe reference unless another record suggests otherwise" as opposed to the traditional interpretation of "this allele is reference"

Both of those redefinitions are incompatible with other interpretations of the standard. e.g. why can't '0' actually mean reference? e.g. GT of '1/.' has been used to indicate a "half-call" where ALT 1 is present at a diploid locus and a second allele cannot be determined. This is distinct from using the non-ref allele <*> because the uncalled allele may in fact be reference (just at an insufficient quality level to call).

@dancooke : Just as a data point, I understand that you consider the multi-allelic VCF representation as "unscalable", but I would probably be using Octopus now if it supported that VCF syntax. Having to re-do my entire clinical analysis pipeline for yet another experimental VCF dialect is a non-starter.

@lh3
Copy link
Member

lh3 commented Sep 4, 2019

allowing overlapping records in VCF has been a nightmare for tool developers

Everyone is complaining about this, but no one finds a satisfactory way to eliminate overlapping records. The best attempt is freebayes' CIGAR, but that is close to being useless in the presence of long deletions.

everyone basically picks the semantics that makes the most sense to them at the time and assumes everyone else in the world should think about it the same way.

In my view, the solution to overlapping records is not to eliminate them, but to impose a consistent representation.

why can't '0' actually mean reference? e.g. GT of '1/.' has been used to indicate a "half-call" where ALT 1 is present at a diploid locus and a second allele cannot be determined.

Let's use my example again:

Pos: 123456789012345678 9012345
Ref: XXXXXXXXXGTATATAGC-GAXXXXX
H1:  XXXXXXXXXGTATAT------XXXXX
H2:  XXXXXXXXXG------GCTGAXXXXX

Since the introduction of *, I have been writing VCF like:

chr1  10  GTATATA G         0|1
chr1  15  TAGCGA  T,*       1|2
chr1  18  C       CT,*      2|1

My impression is most tools use the same way? Are you suggesting replacing 0|1 to .|1 at pos 10? That is not technically right, either, because the first allele can be "determined". If this is not your suggestion, how do you write VCF for my example?

EDIT: hmm... Rereading @kevinjacobs-progenity's comment, I realize that he only pointed out the problems with both 0|1 and .|1 but didn't give a solution.

@dancooke
Copy link

dancooke commented Sep 4, 2019

@kevinjacobs-progenity I agree with most of this. I ended up using . as you describe only because the specification doesn't give me another option (if I want a consistent VCF). @lh3 Using 0 in this way is not the approach that I assumed, since the specification does not define it this way. Why then have * at all if we're going to allow reference overlaps with 0 in some places?

Regarding 'multi-allelic'/MNV calls. Scalability is only half the reason; more fundamentally, I think that variant calls should be interpreted as mutation event calls. So If I report chr1 15 TAGCGA TCGCTA, TAGTGA 1|2 then I make the claim that a single mutational event was TAGCGA > TAGTGA. Of course, in most cases, we do not know and cannot confirm which mutational event actually occurred, but we can still make an informed guess and add priors. If you don't care about such details and just want resulting haplotypes then that information is preserved with phase information.

@lh3 There's actually no need to 'look ahead' - you can just keep a record of which reference bases are yet to be determined and fill them in when you see the next record. I agree it's generally a good principle to reduce redundancy, but I think in this case it's worth the relatively small cost in order to avoid inconsistency and potentially reduce parsing errors.

@lh3
Copy link
Member

lh3 commented Sep 4, 2019

At pos 10, neither .|1 nor 0|1 is technically correct. I more like 0|1 and believe most existing tools are using it, but I can live with .|1. What is more important is a consensus to reduce the multiple ways to encode the same set of variants.

To close the loop: at one point, I was thinking to use something like the following at pos 10 (note that this is slightly different from the example above)

chr1  10  GTATATA G,GTATA         2|1

The problem is we normally assume the 2bp del on hap1 is placed at pos 11 (this is another problem with multi-allelic VCF), but here we want to put the 2bp del at pos 15, so we would need a CIGAR field to distinguish the two cases. The price to achieve technical correctness is too high. I wouldn't go this way. Base * can also achieve technical correctness, but its price is even higher.

@lh3
Copy link
Member

lh3 commented Sep 4, 2019

I went back to bgt and noticed that bgt converts my example to the following VCF:

chr1  10  GTATATA G,<M>  2|1
chr1  15  TAGCGA  T,<M>  1|2
chr1  18  C       CT,<M> 2|1

Here <M> is a symbolic allele, standing for "miscellaneous". It plays the same role as * for deletions, but can also refer to other types of alleles. If we redefine * to be a "miscellaneous allele overlapping with the current record", we will get @jmmut's VCF:

chr1  10  GTATATA G,*   2|1
chr1  15  TAGCGA  T,*   1|2
chr1  18  C       CT,*  2|1

which would be technically correct under the new * definition. With the current spec, though, the above is invalid. I am not sure if anyone is using this syntax in practice.

@d-cameron
Copy link
Contributor

Everyone is complaining about this, but no one finds a satisfactory way to eliminate overlapping records. The best attempt is freebayes' CIGAR, but that is close to being useless in the presence of long deletions.

Coming from the SV world, I don't really appreciate why overlapping records are so problematic for indels. Is it because downstream analysis pipeline break on unphased overlapping records, because left-alignment of phased records results in an self-contradictory haplotype, because caller don't reliably report phasing blocks, because VCF phasing doesn't work in the presence of aneuploidy (this is a huge issue for somatic SVs), because the non-uniqueness of variant and haplotype representation makes cohort analysis and variant comparison too complex, because the 'optimal' representation (e.g. 2 SNVs vs 1 MNV call) is different for different downsteam analyses, or something else?

My naive assumption has been that SNV/indel callers determing haplotypes, and reporting the left-aligned set of variants corresponding to the set of SNVs and indels in the haplotype-reference alignment would give a representation without any of the above issues. Why doesn't this work?

@dancooke
Copy link

dancooke commented Sep 5, 2019

@d-cameron It's not really indels specifically. There are three issues in my mind:

  1. There is more than one way to explain the same haplotype. For example, it's always possible to explain a haplotype resulting from a SNV with two indels (an insertion and deletion). This 'issue' is what lead to the development of several tools (e.g. RTG Tools vcfeval) to compare variant calls on a haplotype level - effectively ignoring representation differences. However, if we assume variant calls represent mutation events - which I believe should be the case - then you can begin to talk about an optimal representation. In most cases, we will never be able to determine what that optimal is, but our prior will usually point to the 'simplest' solution (Occam's razor and all). Simplest just means the solution with least assumptions, so if I say the mutation event was ACT > GCC then I assume a single mutation event resulted in this MNV, but arguably the simpler explanation is two separate SNV mutations A > G + T > C. On the other hand, if I need to explain the haplotype ATCGAA > TTCGAT then probably the simplest explanation is a single micro-inversion event. The key point here is that we can argue about these differences in representation with some theoretical underpinning, and are in principle verifiable.
  2. Calling multi-allelic loci is not scalable or practical. Even if we ignore point 1 and decide to conflate mutation calls and haplotype calls, there comes a point where it's just not practical to read and interpret the resulting call set. For example, if I call a heterozygous deletion of chr1:100-201, but also call an SNV at chr1:150, do you really want me to report the SNV as a 100nt string? Sure you can add a CIGAR field, but what help is that really? What unnecessary allele lengths are we willing to tolerate in order to have multi-allelic calls? This is obviously subjective, and will result in different representations - without any theoretically justifiable argument. This makes annotating calls more difficult and less consistent.
  3. Overlapping variant calls result in inconsistent genotype calls. So having accepted either of the two points above, we need to face the reality that we'll end up with individual variant records in our VCF output that describe alleles that overlap with one another. However, the first versions of VCF only allowed genotype calls (i.e. FORMAT/GT) to refer to the REF or ALT alleles, or indicate the allele was 'missing' (.). So let's suppose we call a heterozygous deletion overlapping with a SNV, and the reference elsewhere. How do we report this in VCF, like this?
chr1  10  GTATATA G   0|1
chr1  15  T G   1|0
  • But this is inconsistent. The first record asserts the reference 0 for the first haplotype while the second asserts an SNV. Likewise the first record asserts a deletion for the second haplotype while the second record asserts the reference. Another option is to replace the 0s with .s, but this doesn't accurately describe what we actually called. Later versions of VCF (4.3 I believe), added the * symbol in an attempt to resolve this situation, but the definition was ambiguous and insufficient, leading to the discussion here.

@dancooke
Copy link

dancooke commented Sep 5, 2019

I see three viable solutions:

  1. Clarify that * is symbolic. Extend the definition to specify overlaps with any allele in either direction. Make clear that * implies the reference for non-overlapping bases. This will result in calls like:
chr1  10  GTATATA G,*       2|1
chr1  15  TAGCGA  T,*       1|2
chr1  18  C       CT,*      2|1
  1. * indicates that this reference base is specified by an overlapping record. This will result in calls like:
chr1  10  GTATATA G,GTATA**  2|1
chr1  15  TAGCGA  T,**G*GA  1|2
chr1  18  C  CT,*  2|1
  • It's possible to extend this definition to allow alleles consisting only of * to be replaced with a single *, resulting in identical output as symbolic * for fully overlapped alleles. The difference in parsing is trivial.
  1. An extension to option 2 to include a distinct chapter for downstream overlaps. I realised allowing * to refer to overlaps in both directions means that it's not possible in general to reconstruct haplotypes without position information from the previous records. With distinct characters this wouldn't be required. Resulting output is like:
chr1  10  GTATATA G,GTATA##  2|1
chr1  15  TAGCGA  T,**G#GA  1|2
chr1  18  C  CT,*  2|1
  • This would enable an additional sanity check when parsing if position information from the previous record is provided.

Note that the alignments for the examples are the same as first given by @lh3 here:

Pos: 123456789012345678 9012345
Ref: XXXXXXXXXGTATATAGC-GAXXXXX
H1:  XXXXXXXXXGTATAT------XXXXX
H2:  XXXXXXXXXG------GCTGAXXXXX

My preference is option 2 or 3, with the *-only allele replacement. This results in the same output as option 1 most of the time, but is more explicit in the relatively rare case of partial overlaps - which I consider a positive. It's worth noting that the type of information provided by options 2 or 3 is what a VCF parser would need to figure out for option 1 anyway. So I don't think the argument that there are more ways to create invalid VCF's is a good one because by the same logic you can argue there are more ways to incorrectly parse the VCF in the case of option 1.

@pd3
Copy link
Member

pd3 commented Sep 5, 2019

@dancooke, can you edit your last comment to show the underlying alignments so that people can comment on your proposal?

In general, this problem is too complex and I don't believe it can be solved within VCF. This is in essence an attempt to represent a graph of arbitrary complexity.

@dancooke
Copy link

dancooke commented Sep 6, 2019

@pd3 Are you suggesting that under your proposed re-definition of 0 and * that the these examples are not semantically equivalent? If so I am truly lost.

@lh3
Copy link
Member

lh3 commented Sep 6, 2019

the only two open source variant callers that use * at all are GATK4 and Octopus

Htsbox has been using * since 2015 or 2016. This actually doesn't matter. Even before * was introduced, 0 had never been the REF haplotype in fact.

I would probably favour changing * from "overlapping deletion" to "ovelapping allele", to be able to restrict the meaning of 0 to "true reference", basically replacing just the usage of those 0-but-not-really to *.

This changes the semantics of VCF. A record in today's VCF effectively describes the interval maximally between this and the next record. With your change, a record will describe the interval of the current record. This is a subtle but important difference. Your proposal sounds better in theory, but will have problems in practice. Suppose we have a 100kb deletion in a population. If 0 only means the reference haplotype, we will be unable to determine the number of 0s with short reads because we won't be able to phase through this 100kb region in diploid samples without this deletion. Today's VCF doesn't have this issue because it doesn't require haplotype information or phasing.

Earlier I said "[redefining *] would be the top choice in the early days of VCF". Now I am in favor of the current practice, which is the one you and @pd3 have described.

@jmmut
Copy link
Contributor

jmmut commented Sep 6, 2019

@pd3

the * allele remains to indicate overlapping alleles from different records

the spec says "overlapping deletion", did you mean change to "overlapping allele" or remain as overlapping deletion?

@pd3
Copy link
Member

pd3 commented Sep 6, 2019

@jmmut Yes, I meant "overlapping allele". Can't think of any harm it might cause, happy to be proven wrong though.

@jmmut
Copy link
Contributor

jmmut commented Sep 6, 2019

@pd3 Then I'm lost too because this is the very issue that we were trying to avoid by staying with the "0 means not changing the REF allele at this record". This sounds to me doing both options I described, and having the disadvantages of both. (Do the rest think it's ok to change the spec to both "0 means not changing the REF allele at this record" and * to mean overlapping allele? this would mean that you can not state a "true reference" even if you use *)

@lh3 I'm sorry but I don't understand what problem this scenario represents. Could you elaborate, please? Would this still apply if we did both changes?

Suppose we have a 100kb deletion in a population. If 0 only means the reference haplotype, we will be unable to determine the number of 0s with short reads because we won't be able to phase through this 100kb region in diploid samples without this deletion.

Also, with the "0 means not changing the REF allele at this record" interpretation, I don't understand (just as @dancooke) if these are semantically equivalent to @pd3 or not, and if yes, why * was introduced in the first place. If they are not equivalent, then I don't know which one is the correct one nor why.

@dancooke
Copy link

dancooke commented Sep 6, 2019

@jmmut I believe that @lh3 is describing the following situation. Suppose we have (I'm going to use the most explicit base * form for clarity):

POS REF   ALT INFO FORMAT s1  s2
10  GTATA G,GT#TA,GTAT#  GT   2|1 2|3
12  A     C,*  GT             1|2 1|2
14  A     T,*  GT             2|2 2|1

Now suppose that we cannot phase the variants at 12 and 14 for sample s2. How do we represent this? We can no longer assert the genotype 2|3 at position 10 for s2 as this implicitly determines the phase of the entire region. I believe that the solution is:

POS REF   ALT INFO FORMAT s1  s2
10  GTATA G,GT#TA,GT#T#  GT:PS   2|1:10 3|3:10
12  A     C,*  GT:PS             1|2:10 1|0:12
14  A     T,*  GT:PS             2|2:10 0|1:14

Which accurately describes the situation. The solution for symbolic * would be

POS REF   ALT INFO FORMAT s1  s2
10  GTATA G,*  GT:PS   2|1:10 2|2:10
12  A     C,*  GT:PS   1|2:10 1|0:12
14  A     T,*  GT:PS   2|2:10 0|1:14

@jmmut
Copy link
Contributor

jmmut commented Sep 6, 2019

Ok, that makes sense, thanks. But if you don't know the phase, why not using the unphased separator /?

POS REF   ALT INFO FORMAT s1  s2
10  GTATA G,*  GT   2|1 2/2
12  A     C,*  GT   1|2 1/0
14  A     T,*  GT   2|2 0/1

As I understand it, PS is for grouping variants under the same phase when you don't know the absolute phase of that group, right? but s1 can be absolutely phased and there are no phase groups for s2 (in these variants). I'm not aware of any restriction mixing / and | in the same row, or in the same sample.

Also, I still don't see why is this an argument against * meaning "overlapping allele" to allow 0 to mean "true reference".

@dancooke
Copy link

dancooke commented Sep 6, 2019

@jmmut You could do that for this example. More generally, I like to use PS for every record because once we get past these toy examples we'd almost certainty want to give a PS to the records in s1, and then we must specify PS for s2 also. It's really just a matter of personal preference.

I don't think this is an argument against symbolic * meaning that; I believe @lh3 was mistaken, or I interpreted his argument incorrectly.

@dancooke
Copy link

dancooke commented Sep 6, 2019

Here's a summary, how I see it

  • GT=0 must be re-defined for pre VCF v4.2 to mean no REF change at this record (wording suggested by @pd3), since there is no *.

Then, VCF v4.2-3 could either

  1. Keep this definition, rendering * completely redundant, and introducing a type of semantic equivalence that will cause great confusion if not addressed going forward.
  2. Modify the definition of * to allow upstream overlaps with any allele and re-define GT=0 to mean REF allele until the next record (as suggested by @lh3) or re-define GT=0 to mean REF allele unless specified by a downstream record (as suggested by me). The former results in redundancy if we want to assert the reference past the next record. Either way should satisfy GATK, but break Octopus, and any tools using VCF v4.2+ but not using * (e.g. DeepVariant). Technically any VCFs (v4.2-3) not using * would immediately become invalid. It also results in different definitions of GT between VCF versions.

I would advise going with option 1, but having a new version of VCF ready that re-defines GT=0 to mean 'true reference' and gives * a complete definition. This is the least painful option; technically no existing VCFs should break. Parsers dealing with VCFs (up to the new version) containing * essentially just need to read * as 0 (since they would become semantically equivalent) - which should be easy if they're supporting pre-v4.2 VCF anyway. Moreover, * has yet to see widespread uptake, and this way, users and tools that do wish to adopt a meaningful * will need to explicitly opt-in (by using the new VCF version).

@lh3
Copy link
Member

lh3 commented Sep 7, 2019

I was in a full-day retreat.

There are three definitions for the 0 allele number. 0 could mean:

  1. The reference allele in full. We thought we were using this definition, but in fact 0 has never meant this, just as @yfarjoun said. I will show later this definition is actually a bad idea.

  2. The reference allele up to the next record. This is the interpretation in the current spec.

  3. There are no ALT alleles from the current record. This is the interpretation before we introduced the * allele.

I don't understand what problem this scenario represents.

@jmmut let's use your example:

POS REF   ALT INFO FORMAT s1  s2
10  GTATA G,*  GT   2|1 2/2
12  A     C,*  GT   1|2 1/0
14  A     T,*  GT   2|2 0/1

s2's genotype at pos 10 is in fact not determined. It could be 0/2 or 2/2 depending on the phase at 12 and 14. The root cause of this is in definition 1, 0 needs to be phased throughout, but we often don't have the phasing information. With the current spec (i.e. def 2):

POS REF   ALT INFO FORMAT s1  s2
10  GTATA G    GT   0|1 0/0
12  A     C,*  GT   1|2 1/0
14  A     T,*  GT   2|2 0/1

It is ok. Definition 3 also works. Between 2 and 3, def 2 is more convenient as we can know the REF allele count up to the next record. More importantly, as the current spec has already adopted 2, I see little reason to revoke our earlier decision.

Going forward, we should just follow our current practice: clarify * to be a single allele and redefine 0 to be the reference allele up to the next record. It is actually optimal. (EDIT: I wouldn't say this is optimal, but given that we have decided on 2, reverting back to 3 is worse.)

@dancooke
Copy link

dancooke commented Sep 7, 2019

@lh3 The 'problem' you describe is exactly as I thought, but I explained quite clearly why this is not an actually problem. You are mistaken I'm afraid.

Going forward, we should just follow our current practice: clarify * to be a single allele and redefine 0 to be the reference allele up to the next record. It is actually optimal. (EDIT: I wouldn't say this is optimal, but given that we have decided on 2, reverting back to 3 is worse.)

This is simply unacceptable since it completely ignores the elephant in the room that the working definition of GT=0 (i.e. your option 3) has remained the same, despite * being added to VCF v4.2. This was essentially confirmed by @pd3, but easily demonstrable by the fact that their are notable tools using VCF v4.2-3 without using * (e.g. DeepVariant). By going with option 2, you're condemning an unknowable number of VCF's and tools that use/accept VCF v4.2-3 without using * to become invalid. This would be ridiculously unfair since the definition of GT never changed in the specification, and * was poorly defined to begin with. In reality, your option 2 is only the "current practice" of GATK4 v4.0.9.0+. The only sensible way to address this is to give all existing VCF versions the same definition of GT, and option 3 is the only possible option. Then no existing VCF's are technically invalidated. The only changes required are for the few VCF parsers accepting * - since * then becomes semantically equivalent to GT=0 (a straightforward change).

Going forward, we can choose to define GT and * in another way for the next VCF version, offering a clean slate so to speak:

  1. Depreciate * and just keeping the previous version of GT.
  2. Define GT=0 to actually mean REF + one of the technically sound alternatives of * that I originally proposed.
  3. Your option 2 + some re-definition of *.

Option 1 would be a shame since the reason why * was originally added in the first place is clear. Option 3 is almost certainly the worst option since it has pretty much half the benefits of any choice of option 2 but most of the downsides of option 1 in addition to its own. Problems include:

  • It makes GT=0 sometimes symbolic, sometimes not, making VCF little easier to read than option 1.
  • It's unclear if the reference is actually called past the next record.
  • It's unclear what affect filtering downstream records has on any GT=0.
  • It makes merging a harder since the meaning of 0 for each record could change.
  • It means that GT=0 and * are still semantically coupled - which is basically why we are in such a mess now.

Just to stress the first point. Let's suppose I'm calling germline/somatic variants in a tumour sample and get (using option 3)

POS REF   ALT INFO FORMAT s1
10  GTATA G SOMATIC  GT    0|0|1
12  A     C,* GT    1|1|2

Now I'm only really interested in somatic variants, so I filter my VCF for these (this is a very common thing to do):

POS REF   ALT INFO FORMAT s1
10  GTATA G SOMATIC  GT    0|0|1

So I'm now left believing that my somatic variant occurred in the context of a tandem repeat (the reference), but this is not the case. This is exactly the type of situation that will mislead analysis.

Any of the choices of option 2 have none of these issues. At this point I would settle for the first 'symbolic' * choice just to avoid the catastrophe of option 3, but I still firmly believe either of the base * choices are better.

@lh3
Copy link
Member

lh3 commented Sep 7, 2019

I explained quite clearly why this is not an actually problem.

No, you didn't. Slightly modify your example:

POS REF   ALT INFO FORMAT s1  s2
10  GTATA G,GT#TA,GTAT#,GTCTT  GT   2|1 0|3
12  A     C,*  GT             1|2 0|1
14  A     T,*  GT             2|2 0|1

If phasing is unavailable, do you want to turn 0|3 to 2/2 for s2 at pos 10? If it is, * needs to represent the true REF allele in this case; if * may represent the true REF, any 0/1 genotype may be replaced by 2/1. Or you can say "* may represent the true REF if this can't be determined", but then you need to define "determined".

the working definition of GT=0 (i.e. your option 3) has remained the same, despite * being added to VCF v4.2.

With *, the definition of 0 has been implicitly changed to def 2 accordingly. It doesn't matter what you "think" 0 is. I am good with both def 2 and def 3. I just think it is too late to revert back. We can acknowledge the two meanings of 0 in the spec. We have already been doing that in the SAM spec (e.g. the TLEN field).

@lbergelson
Copy link
Member

I've been following along with this, but I'm on vacation with my family and each time I try to reply I get interrupted part way through, when I return the argument has moved.

I'm unconvinced that there is anything catastrophically wrong with the current use of the * allele.

We do definitely need to clarify what 0 means in the context of a downstream overlapping record.
The reference allele up to the next record. seems like it works for both for a large existing mass of vcfs as well as matches what many people have implicitly decided on.


It does seem like it's currently difficult to specify complex exact matches to reference using the existing * in some of the cases you mentioned. In general VCF hasn't done a very clear job of distinguishing when we believe something that isn't represented is reference vs when it is not defined.

How you interpret unspecified sites in the vcf mean depends on your assumptions about the provenance of the vcf and isn't explicitly encoded.

For instance, in the following, what is the genotype in any spot between 10 and 20?

10  A    C  GT     0|1
20  G    T  GT     0|1

It's unclear if it's a confident 0|0 or if it's unknown, and which it is depends on the provenance of the vcf. If we wanted to explicitly specify it we would need to add a homref block.

10  A     C           GT     0|1
11  G    <*>  END=19  GT     0|0
20  G     T           GT     0|1

Maybe we could clarify some of these examples in a similar way by adding a tag which can be used to specify exactly how long a match to the reference is called. Something like the proposed RBS. Here's an example using Haplotype-Reference-Block-Size (HRBS): a per haplotype integer value that specifies the number of reference bases included in a 0 call for that haplotype.

POS REF   ALT INFO FORMAT s1
10  GTATA G   GT:HRBS 0|1:2,.
12  A     C,* GT    1|2:
13  T     <*>,* GT:HRBS 0|2:2,.

or (somewhat less pleasant I think.)

It seems like you want to be able to explicitly encode this information, but I think doing so in the alleles is not the best option.


@dancooke Your proposed solution using * and # bases in alleles is problematic for processing large cohorts.

Based on this example:

POS REF   ALT INFO FORMAT s1  s2
10  GTATA G,GT#TA,GTAT#  GT   2|1 2|3
12  A     C,*  GT             1|2 1|2
14  A     T,*  GT             2|2 2|1

For a large deletion in 1 sample, you would need to introduce specific alleles for every other sample which act like a mask for the specific pattern of overlapping alleles in each sample. This prevents finalizing this site until the entirety of the length the deletion has been processed. In a million sample vcf holding all the relevant information in memory becomes tricky with an approach like yours. It would also require you to add up to sample number new unique long alleles which isn't ideal consider current cohort VCF size currently scales super linearly with the number of alleles. Having each sample have unique alleles would also significantly decrease compressibility of GT blocks. Neither of these is necessarily a blocker, but it is an important consideration when adding things to vcf.

The existing * allele instead requires you only to hold the alleles in memory which are still potential overlappers. For a 100kb symbolic deletion it's not clear what you would do with your scheme, just fall back to the existing * allele behavior? Introduce a million new symbolic alleles and record the exact bases somewhere?


As a side note, I would be in favor of redefining * to be equivalent to <M> since there's no reason that I can see that it doesn't work for things like a symbolic inversion that overlaps the site.


Filtering a vcf with *-allele IS hard. There's no question about that. The */# notation would make it a bit easier by identifying which downstream/upstream sites are relevant when filtering, but I don't think that it significantly reduces the processing difficulty in large vcfs. With *-allele you can keep track of relevant alleles and filter in the same way. Either way, it's difficult to decide what the meaning of filtering an allele is when it's overlapping another one and is probably going to application specific. For the example you gave, filtering sites while retaining the information is already supported in vcf with the use of site and genotype level filter fields. I don't see why you can't use those in the somatic/germline example you showed.

@dancooke
Copy link

dancooke commented Sep 7, 2019

@lh3 Your modified example is ill formed because you cannot have the ALT allele GTCTT at position 10.


@lbergelson I don't believe that you can confidently claim that the mass of VCFs are assuming this definition of GT=0, the truth is that we simply don't know whether or not this is the case. To simply enforce this retrospectively because the GATK team have decided upon this interpretation would be deeply unfair to the rest of the community that have formed different interpretations.


Regarding reference calls, while I agree this is a tricky topic, I do believe that the best way forward is in terms of explicitly called alleles. If I call

POS REF   ALT FORMAT s1
10  GTATA G,GT#TA GT  2|1
12  A     C,*       1|2

then I claim to have considered the two called haplotypes within whatever genotype model I'm using. In particular, I claim to have compared the reference base against a deleted base at each position where the reference has been called. From a haplotype perspective, I could just as well have called

POS REF   ALT FORMAT s1
10  GT  G    GT  0|1
11  TA  T,*A GT  2|1
12  A   C,*  GT  1|2
12  AT  A,*T GT  2|1
13  TA  T,*A GT  2|1

since the haplotype calls are the same (I would however argue that the overall interpretation is slightly different). I would then presumably report measures of uncertainty from my model (e.g. QUAL, GQ, GP, etc) for each record, which is by far the best way to report uncertainty in the reference. Moreover, since these two call sets are semantically equivalent from a haplotype perspective, I should be able to generate the exact same statistics for each reference base in the first example.

By adopting the GATK definition for GT=0, you're unwittingly breaking the semantic equivalence between these two call sets, since the former example could only be stated as

POS REF   ALT INFO FORMAT s1
10  GTATA G   GT    0|1
12  A     C,* GT    1|2
13 TA <*>,* GT 0|2

With this set of calls you make an entirely different claim about what the model has considered at positions 13 and 14, and cannot report the same measures of uncertainty as for the explicitly called bases.


You raise good points regarding the practicality of base *. I'll need some time to consider these in more detail. Right now I'm not overly concerned whether base * or symbolic * with explicit REF is adopted - I just want to see GATK-style GT=0 taken off the table.


I don't understand how you would filter my germline/somatic example while retaining the information from the germline. Under the explicit REF with symbolic * my example becomes

POS REF   ALT INFO FORMAT s1
10  GTATA G,* SOMATIC  GT    2|2|1
12  A     C,* GT    1|1|2

so when I filter for SOMATIC (e.g. with bcftools view -i SOMATIC=1) I get

POS REF   ALT INFO FORMAT s1
10  GTATA G,* SOMATIC  GT    2|2|1

which is considerably better than having 0|0|1 in my final somatic call-set. Are you saying that you can achieve this with per-record filtering from the GATK-style calls?

@lh3
Copy link
Member

lh3 commented Sep 7, 2019

you cannot have the ALT allele GTCTT at position 10.

Why? Either way, that allele doesn’t matter. What matters is how you encode the true reference allele without phasing.

@dancooke
Copy link

dancooke commented Sep 7, 2019

@lh3 Because then you're asserting the values of positions at 12 and 14 but re-defining them later.

What matters is how you encode the true reference allele without phasing.

This is an oxymoron. You can't state true reference for this allele if you don't know the phase. It would be like me asking how you can assert the reference in

POS REF   ALT INFO FORMAT s1
10  AA  CC  GT    0/1

if I don't know the phase between the two bases. The correct way to encode the situation is

POS REF   ALT INFO FORMAT s1  s2
10  GTATA G,GT#TA,GT#T#  GT:PS   2|1:10 3|3:10
12  A     C,*  GT:PS             1|2:10 1|0:12
14  A     T,*  GT:PS             2|2:10 0|1:14

since the allele GT#T# does not require phase information for the variants at positions 12 and 14.

@lh3
Copy link
Member

lh3 commented Sep 7, 2019

I still don’t understand why this is wrong (oh, a typo: it should be 0|4, but my argument is still there: you think * can represent reference allele; then all 0/1 can be written as 2/1)

POS REF   ALT INFO FORMAT s1  s2
10  GTATA G,GT#TA,GTAT#,GTCTT  GT   2|1 0|4
12  A     C,*  GT             1|2 0|1
14  A     T,*  GT             2|2 0|1

Are you thinking about 4-gamete test?

@dancooke
Copy link

dancooke commented Sep 7, 2019

Ah sorry, I missed that you'd modified the example. The correct way to encode this would be

POS REF   ALT INFO FORMAT s1  s2
10  GTATA G,GT#TA,GT#T#  GT   2|1 0|3
12  A     C,*  GT             1|2 2|1
14  A     T,*  GT             2|2 2|1

If the region is un-phasable for s2 then it becomes

POS REF   ALT INFO FORMAT s1  s2
10  GTATA G,GT#TA,GT#T#  GT   2|1 3/3
12  A     C,*  GT             1|2 0/1
14  A     T,*  GT             2|2 0/1

@lh3
Copy link
Member

lh3 commented Sep 7, 2019

Then GT#T# and the ref allele may be the same thing. It is indeterministic. The problem is more obvious when you use symbolic *: 0/1 in many of our examples would be equivalent to 2/1 if * can represent ref. Everything is good with def 2 or 3, but with def 1, you need to clarify more things.

@dancooke
Copy link

dancooke commented Sep 7, 2019

@lh3 It is not indeterministic at all - it is precisely determined when the phase is known. # just means base specified by downstream record. In the first case, since we do know the phasing, the #s are uniquely determined. In the second case, when the phasing is not determined, the #s are not uniquely determined either - but the bases they refer to are determined. This is a completely accurate picture of both situations; given the set of REF and explicit ALT alleles these are the only ways to encode the situation, with and without phase information. What it seems like you want is for the phase to be undetermined but the reference allele to be determined, but this would clearly be a contradiction.

@lh3
Copy link
Member

lh3 commented Sep 7, 2019

It depends on how you define indeterministic. A definition of 0 depending on phasing is indeterministic IMO. Think this way. With def 2, the number of 0 alleles exactly corresponds to the number of ref alleles up to the next record. With def 1, the number of 0 alleles doesn’t mean much. Some other ref alleles may be hiding behind *. Def 3 doesn’t say about the ref allele, so it doesn’t matter.

@dancooke
Copy link

dancooke commented Sep 7, 2019

It is not 0 that may be undetermined but #, and neither definition changes; 0 only ever has one definition and meaning for a given record - it will never change no matter what happens around it. However, under your def 2 0s meaning can change since it is determined by what may, or may not, follow it; the meaning changes if you add or remove (or filter?) proceeding records. I believe this is fundamentally why we're having this debate - there is no consensus on what 0 meant in the past, or now, if it depends on its context. This is why we end up with @pd3 saying that he agrees with your definition 2 yet in the same breath asserting a conflicting definition. It's impossible to avoid representation of uncertainty in VCF since fundamentally what is being described is uncertain - I believe it if far preferable for that uncertainty to be encoded in unique symbols (e.g. * & #, base or symbolic) rather than in a range of values (allele indices) where just one of the values (0) is given special meaning.

@dancooke
Copy link

Another data point. The Genome In A Bottle VCFs (v3.3.2) - arguably some of the most important and widely used VCFs publicly available - all specify VCF v4.2 yet do not use the * symbol, indicating that the intended use of 0 was the same as for VCF v4.1 before the * symbol was introduced.

Extract from HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer.vcf.gz:

##fileformat=VCFv4.2
...
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	HG001
1	63493035	.	CGGA	CA,C	50	PASS	platforms=2;platformnames=Illumina,10X;datasets=2;datasetnames=HiSeqPE300x,10XChromium;callsets=2;callsetnames=HiSeqPE300xGATK,10XGATKhaplo;datasetsmissingcall=CGnormal,HiSeqPE300x,IonExome,SolidPE50x50bp,SolidSE75bp;callable=CS_HiSeqPE300xGATK_callable;filt=CS_SolidPE50x50GATKHC_filt;difficultregion=AllRepeats_lt51bp_gt95identity_merged_slop5	GT:DP:ADALL:AD:GQ:IGT:IPS:PS	2|1:199:0,119,80:0,119,80:198:1/2:.:PATMAT
1	63493038	rs200371077	AG	A	50	PASS	platforms=1;platformnames=Illumina;datasets=1;datasetnames=HiSeqPE300x;callsets=1;callsetnames=HiSeqPE300xGATK;datasetsmissingcall=CGnormal,HiSeqPE300x,10XChromium,IonExome,SolidPE50x50bp,SolidSE75bp;callable=CS_HiSeqPE300xGATK_callable;filt=CS_SolidPE50x50GATKHC_filt;difficultregion=AllRepeats_lt51bp_gt95identity_merged_slop5	GT:DP:ADALL:AD:GQ:IGT:IPS:PS	0|1:218:99,119:99,119:99:0/1:.:PATMAT

jmmut added a commit that referenced this issue Jun 22, 2020
- Clarify that * can not be mixed with other bases
- "* is not a base" in VCFv4.4 too
- add regex for ALT
- add ALT regex for VCFv4.4 too

There are 2 other points where we haven't reached a conclusion:
- Clarifying that a GT=0 doesn't mean "true reference", and if it's possible to state "true ref" at all.
- We want to make * to mean "overlapping allele" instead of "overlapping deletion" but the previous point on GT=0 needs a conclusion before we can make a proper update to *. This also will require clarifications in relation to <*>, which now will be even more confusing to users.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests