Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a semantic difference between GT=./. and GT=0/0 + GQ=0 ? #756

Open
yfarjoun opened this issue Feb 7, 2024 · 14 comments
Open

Is there a semantic difference between GT=./. and GT=0/0 + GQ=0 ? #756

yfarjoun opened this issue Feb 7, 2024 · 14 comments
Labels

Comments

@yfarjoun
Copy link
Contributor

yfarjoun commented Feb 7, 2024

A recent GATK change replaces GT ./. due to DP=0 with 0/0 (with GQ=0).

A big argument erupted on the internet regarding the "breaking of the VCF spec"

I was wondering if the spec intends there to be a semantic difference between ./. genotype and 0/0 at GQ=0.

Personally, I am failing to articulate a difference since GQ=0 means no confidence in the genotype -> p(wrong)=1

but I'd love the discussion to happen here rather than twitter/GATK forum.

@pd3
Copy link
Member

pd3 commented Feb 7, 2024

The missing genotype was introduced by the VCF specification specifically for cases with missing data: in the absence of data (./.) we have no supporting evidence for the claim that the genotype is 0/0.

Even if we should entertain such idea, the semantics of ./. and 0/0+GQ=0 still would not be identical: for human samples most of the genotypes in fact are 0/0, therefore the probability of error is certainly not 1.

From the practical perspective, GQ tag is optional and most programs do not look at it at all. They can handle missing genotypes, but seeing 0/0 + GQ=0 they will be mislead to believe the sample has the reference genotype and treat it as such.

Using 0/0 + GQ=0 blindly would be an unfortunate decision.

@yfarjoun
Copy link
Contributor Author

yfarjoun commented Feb 7, 2024

I understand that some downstream tools do not look at GQ. That is indeed unfortunate.

But the question here is whether using 0/0 + GQ=0 is breaking the vcf spec and if it means something different from "have no idea what's going on in this genotype"

@jkbonfield
Copy link
Contributor

If (and it's a huge if) we assume GQ is correctly calibrated and that no tool ever outputs GQ=0 for genuine calls, then the official interpretation is that GQ=0 means probability of error is 1. That is not the same as saying there is zero confidence in the call. Quite the opposite infact - it is stating that there is absolute confidence that the call is NOT 0/0 as it is the probability the call listed is correct rather than the probability that any call made is correct.

Regardless, I think changing ./. to 0/0 is a "courageous decision" as there are too many bits of software that either don't look at GQ or will be filtering initially on GT. I don't really see any overarching reason why the change is necessary anyway.

@cjw85
Copy link

cjw85 commented Feb 8, 2024

That is not the same as saying there is zero confidence in the call. Quite the opposite infact

I would tend to agree, particularly given the background here is around merging call sets from different samples. The colloquial "absence of evidence is not evidence of absence" is somewhat apt here.

@jkbonfield
Copy link
Contributor

The spec states this

GQ (Integer): Conditional genotype quality, encoded as a phred quality $-10log{10}$ p(genotype call is wrong, conditioned on the site's being variant)._

Hence GT 0/0 and GQ 0 is saying 0/0 is definitely wrong, which means a 100% certainty that if this is a variant, then it is not REF/REF. So changing my above view about GQ 0, perhaps that is technically correct as it's conditional on a variant existing, and if it does it clearly can't be a non-variant so GQ 0 is the only valid value for GT 0/0. However the entire thing feels like attempting to exploit loopholes and something we absolutely shouldn't be exploiting as it'd just be another nail in the VCF coffin.

However I'll also add it then goes on to say:

If a call cannot be made for a sample at a given locus, . must be specified for each missing allele in the GT field

Note "must". If we cannot make a call, exploiting GQ 0 is not an option. It is required to be ..

@d-cameron
Copy link
Contributor

However I'll also add it then goes on to say:
If a call cannot be made for a sample at a given locus, . must be specified for each missing allele in the GT field

That seems pretty unambiguous to me. 0/0 means a the caller has actually made a ref call at this site. The GATK blog post even states:

Decreasing the amount of possible genotype "states" that GenotypeGVCFs needs to differentiate between (ie. "missing" vs. "no-call" genotypes)

0/0 GT=0 is neither "missing" nor is it "no-call". It's a call. It may be a very low confidence call but it's a call. As @jkbonfield has identified, it's also exposed an issue with the definition of GT.

Hence GT 0/0 and GQ 0 is saying 0/0 is definitely wrong, which means a 100% certainty that if this is a variant, then it is not REF/REF. So changing my above view about GQ 0, perhaps that is technically correct as it's conditional on a variant existing, and if it does it clearly can't be a non-variant so GQ 0 is the only valid value for GT 0/0.

Given GT is conditional on a variant, this means that. GT is meaningless for 0/0 calls. It's, by definition 0 and any other value is violates either VCF specifications or traditionally held rules of probability. This I like even less than GATK's change.

What do implementers actually do when calculating GT for 0/0 sites? Is this salvageable without special-casing a GT definition for 0/0?

@pd3
Copy link
Member

pd3 commented Feb 8, 2024

Given GT is conditional on a variant, this means that. GT is meaningless for 0/0 calls. It's, by definition 0 and any other value is violates either VCF specifications or traditionally held rules of probability. This I like even less than GATK's change.
What do implementers actually do when calculating GT for 0/0 sites? Is this salvageable without special-casing a GT definition for 0/0?

@d-cameron technical note: you likely meant GQ, not GT.

The VCF specification is not pedantic enough and leaves a lot to common sense. However, it would be plain mean to interpret it verbatim and assert that GQ is meaningless for 0/0.

Obviously, the same logic should be used for GQ as for the QUAL column, which makes sense for both variant and non-variant sites. The language can be changed very slightly for the definition to start making sense: " p(genotype call is wrong, conditioned on the assertion made in the ALT column"

@jkbonfield Regarding GQ=0 making no sense, we have the same situation with MQ=0, so it is not unimaginable to define it the same way. However, it is a different question if supporting it is desirable.

@yfarjoun It might be good to explain the motivation behind this so that people understand the context. Otherwise, short answers to your questions:

  • does it break the vcf spec: No, the specification does not forbid that
  • does it mean something different from "have no idea what's going on in this genotype": It depends.

And my own question:

  • do we want to support it? No, not unless it's well motivated.

@jkbonfield
Copy link
Contributor

jkbonfield commented Feb 8, 2024

does it break the vcf spec: No, the specification does not forbid that

I'd argue otherwise. The spec explicitly states a non-call must have GT ./. (presumably only if written out). It doesn't explicitly forbid GQ, but the GT bit is also changed by GATK and that is non-compliant surely.

p(genotype call is wrong, conditioned on the assertion made in the ALT column"

That would be one small change, which then means the field makes a bit more sense with GT=0/0. However what happens when ALT is .? It's also a bit unclear when ALT is multiple things, eg A,AC,AAC and we're calling one specific GT from the choices. Then "the assertion made in the ALT column" is confusing - what "the" is it? A specific assertion, which is a circular argument, or that any ALT listed is true? Conditional on there being a variant is easier to understand. Maybe it just needs an exception for the GT=0/0 case to explicitly state what it means there or what values are permitted (if any).

We do get GT=0/0 within gVCF, and it can be critical to distinguish a REF call from a non-call. Hence GT=./. has very valid meaning here and is distinct. The spec is clear and explicitly requires this distinction to be made.

If we asked for GQs to be generated for a gVCF, what do tools currently do for the REF calls? I see bcftools just omits it, but is that universally the case? This change in GATK appears to imply in gVCF they'd have GT 0/0 (REF called, GQ absent) or GT:GQ 0/0:0 (no-called made, previously GT ./.). If so it's even more confounding, with the simple presence of GQ being the disambiguating factor between REF vs unknown.

While calls could output GQ=0 for a REF gVCF entry, equally so a case could be made (changing the spec!) to reverse the conditional meaning so we can still emit a quality value indicating the confidence in the GT call, which is this case is now conditional on not being a variant. Ie how confident we are that a variant doesn't exist. You'd think maybe you don't need GQ then for REF as QUAL encompasses everything you need - there are no choices of which ALT when it's REF. However QUAL has historically been a bit of a basketcase and many tools, GATK included, aren't remotely calibrated and definitely don't use a Phred scale. It even has odd notions like QUAL/DP being used in filtering, implying it's deliberately far from logarithmic in nature. Given that mixing GQ and QUAL together is problematic. So using one single value (GQ) instead feels like it has merits.

That is however, not what the spec says! I think the easiest thing is just clarifying the spec. It already explicitly states a lack of call must be GT ./. (if written out). For gVCF is also needs a statement about what GQ can be when we have a REF call of 0/0.

@pd3
Copy link
Member

pd3 commented Feb 8, 2024

I'd argue otherwise. The spec explicitly states a non-call must have GT ./. (presumably only if written out). It doesn't explicitly forbid GQ, but the GT bit is also changed by GATK and that is non-compliant surely.

The comment #756 (comment) gives a hint about the motivation for GATK's behavior: to distinguish between "missing" vs. "no-call" genotypes (presumably "no data" vs. "the caller can't decide").

The VCF specification does not dictate one cannot have the combination of GT=0/0 + GQ=0. I think it's fair to say it is not currently forbidden, despite the fact its semantics is not well defined.

Then "the assertion made in the ALT column" is confusing - what "the" is it?

I agree that the small change in the wording of the GQ definition I suggested above would not work. The case of ALT=. has to be spelled out separately, similarly to the description of the QUAL column.

If it should be conditioned on something, then probably directly on the observed data. For example, if a site contains a lot of noise and the caller decides on a no variant call, it still makes sense to speak about GQ; in such case one would expect to have a low GQ value, even though the site ultimately ends up with ALT=. and GT=0/0.

The inevitable question is: how is this different from QUAL? In my intuition, one can have varying levels of confidence for different samples being GT=0/0 and the entire site being ALT=.

QUAL has historically been a bit of a basketcase and many tools, GATK included, aren't remotely calibrated and definitely don't use a Phred scale

This is not relevant for the thread and also is not entirely correct. The programs do use phred scale, it's just that the probabilities aren't what you think. They are emitted by mathematical models with specific assumptions, not met in the real life. One should not expect they are the exact real-world probabilities of an error. At least authors of these models certainly don't! If that was the case, the black magic art of variant filtering would not be required, one would just filter by QUAL to obtain a perfect callset for the desired false discovery rate.

@d-cameron
Copy link
Contributor

The VCF specification does not dictate one cannot have the combination of GT=0/0 + GQ=0. I think it's fair to say it is not currently forbidden, despite the fact its semantics is not well defined.

The spec does not disallow this combination but it does require missing alleles be ... By GATK own blog post admission, they merged this missing state with another state which the spec explicitly forbids.

GATK output violates the VCF specifications not because 0/0 GT=0 is not valid in VCF, but because it does not use . for missing alleles.

@jkbonfield
Copy link
Contributor

jkbonfield commented Feb 9, 2024

Answering my own internal question:

If we asked for GQs to be generated for a gVCF, what do tools currently do for the REF calls? I see bcftools just omits it, but is that universally the case?

https://gatk.broadinstitute.org/hc/en-us/articles/6012243429531-GenotypeGVCFs-and-the-death-of-the-dot

The above link contains a before and after example. Both include GQ 0.

Petr writes:

The comment #756 (comment) gives a hint about the motivation for GATK's behavior: to distinguish between "missing" vs. "no-call" genotypes (presumably "no data" vs. "the caller can't decide").

Previously missing (no-data) was ./. and can't decide (no-call) could be GQ close to 0 (with whatever GT we decide is most likely, even though we think it's quite probably not true). Now they look the same, so I don't buy this logic of wanting to distinguish because they've actually collapsed the two together. Infact they claim a different reason (see below). Fortunately, as they note, there now needs to be a third field that is assessed - DP - to disambiguate.

It seems they're making life far harder than is necessary. My bet is this will just spawn a bunch of scripts to convert GATK output back again to the original encoding and fix the brokeness. Sigh.

Their claimed reason for the change is efficiency and sizes of many-sample data sets. However I think this is misplaced, or at least it needed a discussion with hts-specs BEFORE unilaterally deciding to change the format. Really, VCF is inefficient and something needs to change. I do get that. However long term it's probably a different encoding format behind the scenes (as CRAM is to BAM) with the data being exposed in the same model once decoded, rather than this. There are many such alternatives already floated.

@jkbonfield
Copy link
Contributor

jkbonfield commented Feb 9, 2024

the official interpretation is that GQ=0 means probability of error is 1. That is not the same as saying there is zero confidence in the call. Quite the opposite infact - it is stating that there is absolute confidence that the call is NOT 0/0 as it is the probability the call listed is correct rather than the probability that any call made is correct.

I realise now I've been too strong in this assertion. While true that a P(is_error)=1 is an assertion that something is 100% not true, GQ=0 is not P(error)=1. GQ is an integer so there are many things which could get rounded to 0.

Ie 0.9 <= P(error) <= 1, if we round to nearest integer
or 0.8 <= P(error) <= 1, if we round down

(With a liberal avoidance of the "conditional on the site being a variant")

So It does kind of make sense. Sadly it also implies we perhaps should have been using log-odds where GQ=0 is p(err)=0.5 and if we think say P(GT==0/0) is 0.9 then GQ~=-10, and if P(GT==0/0) is 0.1 then GQ~=+10. Far more expressive about the couldn't-call case as it offers us a way to describe how confident we are in the "not-a-variant" call without resorting to a second field (QUAL). But... we are where we are.

QUAL has historically been a bit of a basketcase and many tools, GATK included, aren't remotely calibrated and definitely don't use a Phred scale

This is not relevant for the thread and also is not entirely correct. The programs do use phred scale, it's just that the probabilities aren't what you think.

Personally I don't think that's justifyable, especially if you want to work out the likelihood the specific called genotype is correct as then you have to hop between QUAL and GQ. You can only do that if they're calibrated to the same scale, and they're not (in practice).

It's like saying "sure my thermometer really is measuring in centegrade, but it's just not the same centegrade you're thinking about". Useless!

Phred has a specific meaning. I was in the field when the Phred program arrived, and it wasn't just a bunch of maths with numbers plucked out of thin air. Brent Ewing measured a bunch of stats and trained a model so they could be pulled together to give a calibrated quality value. Without that, it's not Phred - it's just an arbitrary score with no meaning other than the higher it is the better the quality.

I get that calibration is hard. I didn't do it particularly in my consensus algorithm either. However my point is an uncalibrated scale is only useful if you have one scale. If you're mixing data from two different scales (GQ and QUAL), neither of which are calibrated and they have totally different scales, then basically you're stuffed. That's the situation we have right now where GQ is for variants and QUAL is for GQ=0/0.

PS. Don't even get me started on the bits of VCF spec labelled as "Phred scale" where the lower it is the better the quality!

@lh3
Copy link
Member

lh3 commented Feb 9, 2024

the question here is whether using 0/0 + GQ=0 is breaking the vcf spec

The VCF spec is not the Bible. It can be revised. The question is not whether this breaks the spec, but whether this is a good feature. I think we have enough concerns to explicitly ban the use of 0/0 for missing genotype.

@d-cameron
Copy link
Contributor

I think we have enough concerns to explicitly ban the use of 0/0 for missing genotype.

We already do: If a call cannot be made for a sample at a given locus, . must be specified for each missing allele in the GT field.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants