Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Validate VariantContext AC and AF without genotypes #759
Conversation
yfarjoun
was assigned
by ronlevine
Nov 27, 2016
coveralls
commented
Nov 27, 2016
coveralls
commented
Nov 27, 2016
coveralls
commented
Nov 27, 2016
|
@yfarjoun Please review. |
ronlevine
assigned lbergelson and unassigned yfarjoun
Nov 29, 2016
|
@lbergelson Please review. |
lbergelson
requested changes
Nov 30, 2016
@ronlevine I think we can make a few optimizations. Might be worth it if this runs on every variant.
I think the function naming needs clarifying too. Ready to merge once those are addressed
| @@ -1233,7 +1233,21 @@ public void validateAlternateAlleles() { | ||
| throw new TribbleException.InternalCodecException(String.format("one or more of the ALT allele(s) for the record at position %s:%d are not observed at all in the sample genotypes", getContig(), getStart())); | ||
| } | ||
| + private void validateNumberOfItems(final String attributeKey, final int observedSize ) { |
lbergelson
Nov 30, 2016
Contributor
the naming here is not very helpful. Something like: private void validateAttributeIsExpectedSize(final String attributeKey, final int expectedSize) might would be clearer.
The observed vs reported size is very confusing in this context I think, I would definitely prefer actual vs expected.
or if you want to be really specific validateAttributeHasOneEntryForEveryAltAllele and name accordingly.
ronlevine
Nov 30, 2016
•
Contributor
The rest of the code was using statistics lingo, observed/reported but will change to actual/expected. I'd like the keep the name more generic so that it can be used for other annotations.
| @@ -1233,7 +1233,21 @@ public void validateAlternateAlleles() { | ||
| throw new TribbleException.InternalCodecException(String.format("one or more of the ALT allele(s) for the record at position %s:%d are not observed at all in the sample genotypes", getContig(), getStart())); | ||
| } | ||
| + private void validateNumberOfItems(final String attributeKey, final int observedSize ) { | ||
| + if ( hasAttribute(attributeKey) && observedSize > 0) { |
lbergelson
Nov 30, 2016
Contributor
swap the order of these checks here I think, the second is clearly cheaper
| + private void validateNumberOfItems(final String attributeKey, final int observedSize ) { | ||
| + if ( hasAttribute(attributeKey) && observedSize > 0) { | ||
| + Object object = getAttribute(attributeKey); | ||
| + int reportedSize = (object instanceof List ) ? ((List) object).size() : 1; |
lbergelson
Nov 30, 2016
Contributor
Is it safe to assume attributes are never other sorts of collections or arrays?
ronlevine
Nov 30, 2016
Contributor
As far as I know. For AC and AF, they are. If the method is used for other annotations and they are not, data type can be added.
magicDGS
Dec 1, 2016
Contributor
I did some work on getters for attribute lists in the VariantContext (see #712) and I think that this implementation could throw some errors if the AC or AF are set as an array (e.g., int[]). You can use here the method getAttributeAsList(attributeKey) to take into account this implementation. If the attribute is a single value, the returned value is a list with size = 1, so this also will simplify the code here.
magicDGS
Dec 1, 2016
Contributor
This method could be also later or, to avoid the list re-allocation, make the validateNumberOfItems() return the attribute list.
ronlevine
Dec 5, 2016
•
Contributor
@magicDGS Added:
if ( object.getClass().isArray() ) {
throw new TribbleException.InternalCodecException(String.format("the %s tag vallues cannot be an array at position %s:%d,", attributeKey, getContig(), getStart()));
}
| public void validateChromosomeCounts() { | ||
| + final int numberOfAlternateAlleles = alleles.size() - 1; |
lbergelson
Nov 30, 2016
Contributor
you could reuse numberOfAlternateAlleles down below where !getAlternateAlleles().isEmpty() is checked to save an unnecessary List allocation
| public void validateChromosomeCounts() { | ||
| + final int numberOfAlternateAlleles = alleles.size() - 1; | ||
| + validateNumberOfItems(VCFConstants.ALLELE_COUNT_KEY, numberOfAlternateAlleles); |
lbergelson
Nov 30, 2016
Contributor
Doesn't this make the two size checks + throws in the AC block below redundant? Can we remove those?
ronlevine
Dec 1, 2016
•
Contributor
The actualACs.size() != expectedACs.size() and actualACs.size() != 1 checks are now gone.
lbergelson
assigned ronlevine and unassigned lbergelson
Nov 30, 2016
| public void validateChromosomeCounts() { | ||
| + final int numberOfAlternateAlleles = alleles.size() - 1; | ||
| + validateNumberOfItems(VCFConstants.ALLELE_COUNT_KEY, numberOfAlternateAlleles); | ||
| + validateNumberOfItems(VCFConstants.ALLELE_FREQUENCY_KEY, numberOfAlternateAlleles); |
magicDGS
Dec 1, 2016
•
Contributor
The number of items for AF is validated here, but not the contents. Shouldn't this be included too?
coveralls
commented
Dec 1, 2016
| + if ( expectedSize > 0 && hasAttribute(attributeKey) ) { | ||
| + final Object object = getAttribute(attributeKey); | ||
| + if ( object.getClass().isArray() ) { | ||
| + throw new TribbleException.InternalCodecException(String.format("the %s tag vallues cannot be an array at position %s:%d,", attributeKey, getContig(), getStart())); |
magicDGS
Dec 5, 2016
Contributor
Why this should be the case? If someone use the following, it will blow up:
final VariantContext variant = new VariantContextBuilder(variantToCopy)
.setAttribute(VCFConstants.ALLELE_COUNT_KEY, new int[]{1, 3})
.make();But I don't think that it is incorrect to set a key that is expected to have more than one value as an array. Why not using the getter for list here? If it is a single value, it will return a singleton list; if it is a list and/or array, it will be converted to a list. Anyway, using the getter an then the size method will get the actual size independently if it is a list or not.
lbergelson
Dec 6, 2016
Contributor
@magicDGS That's an excellent point. I didn't even realize that getter existed.
coveralls
commented
Dec 5, 2016
| - public void validateReferenceBases(final Allele reportedReference, final Allele observedReference) { | ||
| - if ( reportedReference != null && !reportedReference.basesMatch(observedReference) ) { | ||
| - throw new TribbleException.InternalCodecException(String.format("the REF allele is incorrect for the record at position %s:%d, fasta says %s vs. VCF says %s", getContig(), getStart(), observedReference.getBaseString(), reportedReference.getBaseString())); | ||
| + public void validateReferenceBases(final Allele expectedReference, final Allele actualReference) { |
lbergelson
Dec 6, 2016
Contributor
@ronlevine I wouldn't change the reported/observed terminology in the whole file. Just in that one function I was talking about.
ronlevine
Dec 6, 2016
Contributor
OK. I thought using that terminology throughout the file seemed clearer.
| - throw new TribbleException.InternalCodecException(String.format("the Allele Count (AC) tag doesn't have the correct number of values for the record at position %s:%d, %d vs. %d", getContig(), getStart(), reportedACs.size(), observedACs.size())); | ||
| - for (int i = 0; i < observedACs.size(); i++) { | ||
| + final Object ac = getAttribute(VCFConstants.ALLELE_COUNT_KEY); | ||
| + if ( ac instanceof List ) { |
lbergelson
Dec 6, 2016
Contributor
Can we use that getter here too instead of special casing lists + not list?
|
@ronlevine If you want to get this merged in today, you could move the new AF validation to a second pull request and just do the initial changes. I think in any case now it should switch to using getAttributeAsList where appropriate to simplify the code. |
|
@lbergelson Sounds good. I'll do my best to take care of it tonight. |
|
@lbergelson Please take a look. I removed the validation of AF values, used the attribute getters and reverted the original code to observed/reported terminology. |
coveralls
commented
Dec 7, 2016
coveralls
commented
Dec 7, 2016
| + final List<Object> actualValues = getAttributeAsList(attributeKey); | ||
| + if (!actualValues.isEmpty()) { | ||
| + // always have at least one actual value | ||
| + final int expectedValuesSize = expectedSize > 0 ? expectedSize : expectedSize + 1; |
lbergelson
Dec 7, 2016
Contributor
@ronlevine I'm not sure I understand this. Why do you need to add 1 to it?
ronlevine
Dec 7, 2016
•
Contributor
If no alt alleles (expectSize == 0), and the annotation exists, there will be 1 value. For example, for 0/0, AC=0. The relevant code is line 1273.
lbergelson
Dec 7, 2016
Contributor
Could you change this to expectedSize > 0 ? expectedSize : 1 I think that's more clear.
Since this behavior is pretty specific to these attributes I might change the naming again, i.e. expectedSize -> numAlternateAlleles.
Sorry for the back and forth on this.
|
@ronlevine Looks good to me. Going to merge when tests finish. |
|
Rebasing. |
ronlevine commentedNov 27, 2016
Description
Implements #757.
Need to validate the number of elements in AC and AF without genotypes.
Checklist