Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

<*> allele vs <NON_REF> #352

Closed
lbergelson opened this issue Oct 15, 2018 · 12 comments · Fixed by #380
Closed

<*> allele vs <NON_REF> #352

lbergelson opened this issue Oct 15, 2018 · 12 comments · Fixed by #380

Comments

@lbergelson
Copy link
Member

In the VCF 4.3 spec the <*> is specified as the unspecified alt allele.

GATK uses the <NON_REF> allele instead for this purpose and treats <*> as a deprecated version of the * allele. This is obviously not matching the spec. Are there other software suites that generate GVCFs that use <*>?

@pd3 I'm wondering why <*> was selected as the canonical unspecified alt. Do other common callers produce gVCFs with <*>?

@pd3
Copy link
Member

pd3 commented Oct 16, 2018

Historically, samtools was the first caller to start using the unspecified allele and originally it was *. After this was codified in the specification as <*>, samtools/bcftools started producing <*> instead. I don't know why GATK uses <NON_REF> instead of <*>.

@lbergelson
Copy link
Member Author

Interesting, the history seems complicated. I did bit of digging to try and understand it.

Unfortunately GATK never got the message to change to <*>, and it doesn't seem likely to happen due to how many <NON_REF> gvcfs there are out there now.

It might be worth considering adding back <NON_REF> to the spec since practically it's used in petabytes of gvcf files and it's unlikely to go away.

@yfarjoun
Copy link
Contributor

yfarjoun commented Oct 17, 2018 via email

@nh13
Copy link
Member

nh13 commented Oct 17, 2018

@yfarjoun my best google-fu could not find a list of supported or unsupported deliverables for GATK, but searching for ‘<NON_REF>’ did yield a non-trivial number of tools that rely or leverage the allele. So likely there are gVCFs that are stored that have this allele, waiting to be combined in joint calling mid-project, or otherwise. I just don’t buy your argument.

I’d argue that GATK switch over to the spec and perhaps add an option to output the old allele for a while.

@yfarjoun
Copy link
Contributor

@nh13 I agree that GATK hasn't public about what is or isn't a deliverable, but we have said that there should be a consistency between the version that created the gVCF and the one that combines/genotypes them.

Regardless, we agree on the conclusion, that GATK can switch to the spec, rather than the spec changing.

@pd3
Copy link
Member

pd3 commented Oct 17, 2018

Yes, that would be great if GATK could switch to the spec.

@lbergelson You are right, completely forgot about the X and <X>.

@tfenne
Copy link
Member

tfenne commented Oct 17, 2018

@yfarjoun & @lbergelson I would think it's probably feasible to have GATK recognize <NON_REF> and <*> as the unspecified non-ref allele, in perpetuity, while switching to producing <*> at some point? There's no reason GATK can't read both of those alleles and treat them the same, no?

@yfarjoun
Copy link
Contributor

It is, of course, feasible, but I'm actually concerned about * vs. <*>. since we will have a hard time distinguishing (in speech) between "star allele" meaning "spanning deletion" and "star allele" meaning "unspecified allele"...

@tfenne
Copy link
Member

tfenne commented Oct 17, 2018

Ah, yes, good point.

@lbergelson
Copy link
Member Author

To make it even more confusing, it seems like at some point we used <*> to mean spanning deletion, and some code accepts either <*> or * as spanning deletions.

@pd3
Copy link
Member

pd3 commented Oct 18, 2018

@yfarjoun It can't be that hard, you just did it! ;-)

@trutane
Copy link

trutane commented Feb 4, 2019

I must admit, having <*> and * mean different things is a degree of subtlety that makes me nervous. There is already mounting FUD about it, with someone claiming it indicates homozygous reference sites.

I think there is ample opportunity for ambiguity in the v4.3 spec regarding how to interpret *'s in various contexts, comparing sections 1.6.1 vs 5.5:

1.6.1, Fixed fields, ALT:

...an angle-bracketed ID String (“<ID>”) or a breakend replacement string as described in the section on breakends. The ‘*’ allele is reserved to indicate that the allele is missing due to an overlapping deletion.

5.5, Representing unspecified alleles and REF-only blocks (gVCF):

A symbolic alternate allele <*> is used to represent this unspecified alternate allele.

If the spec docs were more clear about this distinction, and GATK et al. tools can be updated to be spec-consistent, I guess I could live with it, though it still might make me a bit queasy...

lbergelson added a commit to lbergelson/hts-specs that referenced this issue Feb 6, 2019
* Adding <NON_REF> to the spec and recommending its use over the <*> allele which is easily confused with *.
* closes samtools#352
lbergelson added a commit to lbergelson/hts-specs that referenced this issue May 30, 2019
* Adding <NON_REF> to the spec and recommending its use over the <*> allele which is easily confused with *.
* closes samtools#352
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants