Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for changes to Allele VariantContext, and Genotype Context for upstream deletion compatibility #806

Open
yfarjoun opened this issue Feb 14, 2017 · 8 comments

Comments

@yfarjoun
Copy link
Contributor

The current VCF spec allows for a * allele (no brackets):

"The ‘*’ allele is reserved to indicate that the allele is missing due to a upstream deletion."

Currently, Allele, VariantContext, and GenotypeContext do not have any special consideration towards the star allele, and therefore consider the following to be true:

Alelle
"*" is a SNP and in particular not SYMBOLIC

VariantContext
Ref = A Alt = C,* is of type SNP
Ref = A Alt = AC,* is of type MIXED

GenotypeContext
A*/* (meaning A as reference and star allele) is type HET
A/* (meaning A is alternate) is also type HET

This means that for large cohorts, for which star allele is becoming a fact of life, all variants will be MIXED HET genotypes. This is uninformative, and unacceptable (IMHO).

In particular, this is causing downstream problems, as we can see in the following picard issue:
broadinstitute/picard#555

I propose the following modification:

Alelle
"*" is a new allele type: UPSTREAM_DELETION, not SYMBOLIC, not a SNP

VariantContext
Ref = A Alt = C,* is of type SNP
Ref = A Alt = AC,* is of type INDEL
Ref = A Alt = * is a new type UPSTREAM_DELETION

GenotypeContext
A*/* (meaning A as reference and star allele) is type HOM_REF (since the deletion was already "counted" upstream.
A/* (meaning A is alternate) remains of type HET due to the non-ref A allele.

I'm happy to implement this, but I wanted to get community comments....

@tfenne
Copy link
Member

tfenne commented Feb 14, 2017

@yfarjoun Thanks for tackling this! I'm in agreement with most of what you propose but would like to nit-pick a couple of points if I may!

The latest released version of the spec I could find (4.3) has slightly different language than you quote: "The ‘*’ allele is reserved to indicate that the allele is missing due to a an overlapping deletion". I think that terminology is better because "upstream" in this terminology really just means "upstream in the file", because the deletion in question spans the current context. So I'd suggest calling the new allele type something else, suggestions include: SPANNING_DELETION, OVERLAPPING_DELETION or even something like NULL_ALLELE to indicate there is no allele (there is no spoon?) because the (possibly very local) region is not present on the chromosome in question.

By the same logic I would argue that your case of Ref = A Alt = * is really a NO_VARIATION because there are no alternative alleles present. In fact I think your first two examples point to an easy rule, that the type should simply be the type of the variant ignoring the * allele.

This raises further questions like if you have a SNP variant with four genotyped samples, but one of the samples has A/* as it's genotype - how many genotyped chromosomes do you have?

@lbergelson
Copy link
Member

lbergelson commented Feb 14, 2017

*/* as HOM_REF seems a bit weird to me. If I query a vcf at some position and see */* I don't think I want to be told that that's a HOM_REF site. I'd probably want to know that it's a homozygous deletion at that site. If the problem is that we want to avoid double counting things in the metrics then we should make it a different category entirely I think. Or just treat it as symbolic and let individual tools deal with it. Isn't it sort of just a shorthand for creating a named symbolic deletion and referencing it from somewhere else in the file?

SPANNING_DELETION would be consistent with the language we've been using in GATK already I think. I agree with Tim that UPSTREAM is strange.

@magicDGS
Copy link
Member

magicDGS commented Feb 15, 2017

I have one concern about this: if "the ‘*’ allele is reserved to indicate that the allele is missing due to a an overlapping deletion" it does not mean that this overlapping deletion was called or even present in the file. What I would like to point is that I'm in favor of naming it as an SPANNING_DELETION, but I am not sure about the change in VariantContext or GenotypeContext because they will remove the check for this "symbolic allele" present in the variant.

From my point of view, in GenotypeContext the A*/* (reference/star) is not an HOM_REF nor NO_VARIATION. This is an hemizygous position, so it cannot be called homozygous (only one chromosome is present) and there is variation in the copy number. The same is true for the A/* (alternative/star), that is also an hemizygous.

In the case of the VariantContext, I do not agree with the current behaviour for SNPs (Ref = A Alt = C,*): is it an SNP variation, but also there are the same problem as the genotypes. This is something more like MIXED, but I have also my doubts with that. Otherwise, I like the other changes regarding the INDEL for reference + indel + start (because the star also represents something like a deletion) and the SPANNING_DELETION for reference + star.

As conclusion, I support @lbergelson's idea of treat is just as symbolic and letting the dealing of it to the implementation of each tool/framework. There are too many things that could be confusing with the spanning deletion allele.

@tfenne
Copy link
Member

tfenne commented Feb 15, 2017

@magicDGS's points reminded me of something I've been thinking about, but which might be beyond the scope of this PR. Specifically that with larger callsets and/or more complicated genomic regions the idea of a VariantType is becoming less and less useful, and the GenotypeType is really the critical thing. I'm getting around this a fair amount if code by doing things like ctx.subContextFromSamples(oneSample).getType().

@yfarjoun
Copy link
Contributor Author

yfarjoun commented Feb 15, 2017

The problem with treating star as symbolic, is that I'm considering the case of callsets like ExAC where almost every variant will have a spanning deletion and so all the variant types will become SYMBOLIC...so that's not good either.

Alternatively, we could add a boolean to the VC hasSpanningDeletion() that will inform about that directly, while the getType() method will return a result ignoring the star allele.

Thus we will have:

Alelle
"*" is a new allele type: SPANNING_DELETION, not SYMBOLIC, not a SNP, not an INDEL

VariantContext
Ref = A Alt = C,* is of type SNP, hasSpanningDeletion is true
Ref = A Alt = AC,* is of type INDEL, hasSpanningDeletion is true
Ref = A Alt = * is a new type NO_VARIATION, hasSpanningDeletion is true

GenotypeContext
A*/* (meaning A as reference and star allele) is type HOM_REF, hasSpanningDeletion=true
A/* (meaning A is alternate) remains of type HET due to the non-ref A allele, hasSpanningDeletion=true
*/* (meaning both alleles are spanning deletions) is type NO_CALL, hasSpanningDeletion=true

@magicDGS
Copy link
Member

I see your point with SYMBOLIC, @yfarjoun. The case of VariantContextin combination with the method looks like the proper way of doing it.

Regarding the Genotype, I think that this is still not a good idea to assign HOM_REF or HET to an hemizygous call. What's about a new type in GenotypeType for the hemizygous case? In the case that this is changed, maybe / could be also define a new type or being UNAVAILABLE, but this requires a change in the documentation too.

Anyway, I have several non-related concerns about the design of genotypes in HTSJDK, so maybe my opinion is not that important in that case. I would rather prefer a Genotype design to be independent of the ploidy to avoid this kind of things, although it is true that most of the data is diploid.

@vdauwera
Copy link

I'm delighted to see this being discussed -- but I would point out that it will be crucial to make whatever nomenclature comes out of this agnostic of ploidy. The amount of data being produced for non diploid orgs is growing all the time (from bacterial genomes to agricultural staples; lots of cereals are polyploid, iirc) so we can't ignore it and should plan for it, otherwise we'll just have to revise the system again, which is disruptive.

@magicDGS
Copy link
Member

Apart of the cases that @vdauwera mentioned, it will be also important in pooled samples such cancer, virus or Pool-Seq datasets. Actually I was looking for discussing the diploid design of HTSJDK in a different issue, because this is just focused on the spanning deletion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants