Validate VariantContext AC and AF without genotypes #759

Merged
merged 1 commit into from Dec 7, 2016

Conversation

Projects
None yet
5 participants
Contributor

ronlevine commented Nov 27, 2016

Description

Implements #757.
Need to validate the number of elements in AC and AF without genotypes.

Checklist

  • Code compiles correctly
  • New tests covering changes and new functionality
  • All tests passing
  • Extended the README / documentation, if necessary
  • Is not backward compatible (breaks binary or source compatibility)

yfarjoun was assigned by ronlevine Nov 27, 2016

Coverage Status

Coverage increased (+0.002%) to 70.029% when pulling 7742aca on rhl_validate_ac_af_no_gt_757 into 6e4e875 on master.

Coverage Status

Coverage decreased (-0.001%) to 70.026% when pulling 57f00f3 on rhl_validate_ac_af_no_gt_757 into 6e4e875 on master.

Coverage Status

Coverage increased (+0.005%) to 70.033% when pulling 57f00f3 on rhl_validate_ac_af_no_gt_757 into 6e4e875 on master.

Contributor

ronlevine commented Nov 27, 2016

@yfarjoun Please review.

@ronlevine ronlevine assigned lbergelson and unassigned yfarjoun Nov 29, 2016

Contributor

ronlevine commented Nov 29, 2016

@lbergelson Please review.

@lbergelson

@ronlevine I think we can make a few optimizations. Might be worth it if this runs on every variant.

I think the function naming needs clarifying too. Ready to merge once those are addressed

@@ -1233,7 +1233,21 @@ public void validateAlternateAlleles() {
throw new TribbleException.InternalCodecException(String.format("one or more of the ALT allele(s) for the record at position %s:%d are not observed at all in the sample genotypes", getContig(), getStart()));
}
+ private void validateNumberOfItems(final String attributeKey, final int observedSize ) {
@lbergelson

lbergelson Nov 30, 2016

Contributor

the naming here is not very helpful. Something like: private void validateAttributeIsExpectedSize(final String attributeKey, final int expectedSize) might would be clearer.

The observed vs reported size is very confusing in this context I think, I would definitely prefer actual vs expected.

or if you want to be really specific validateAttributeHasOneEntryForEveryAltAllele and name accordingly.

@ronlevine

ronlevine Nov 30, 2016 edited

Contributor

The rest of the code was using statistics lingo, observed/reported but will change to actual/expected. I'd like the keep the name more generic so that it can be used for other annotations.

@@ -1233,7 +1233,21 @@ public void validateAlternateAlleles() {
throw new TribbleException.InternalCodecException(String.format("one or more of the ALT allele(s) for the record at position %s:%d are not observed at all in the sample genotypes", getContig(), getStart()));
}
+ private void validateNumberOfItems(final String attributeKey, final int observedSize ) {
+ if ( hasAttribute(attributeKey) && observedSize > 0) {
@lbergelson

lbergelson Nov 30, 2016

Contributor

swap the order of these checks here I think, the second is clearly cheaper

@ronlevine

ronlevine Nov 30, 2016

Contributor

Agreed. Done.

+ private void validateNumberOfItems(final String attributeKey, final int observedSize ) {
+ if ( hasAttribute(attributeKey) && observedSize > 0) {
+ Object object = getAttribute(attributeKey);
+ int reportedSize = (object instanceof List ) ? ((List) object).size() : 1;
@lbergelson

lbergelson Nov 30, 2016

Contributor

Is it safe to assume attributes are never other sorts of collections or arrays?

@ronlevine

ronlevine Nov 30, 2016

Contributor

As far as I know. For AC and AF, they are. If the method is used for other annotations and they are not, data type can be added.

@magicDGS

magicDGS Dec 1, 2016

Contributor

I did some work on getters for attribute lists in the VariantContext (see #712) and I think that this implementation could throw some errors if the AC or AF are set as an array (e.g., int[]). You can use here the method getAttributeAsList(attributeKey) to take into account this implementation. If the attribute is a single value, the returned value is a list with size = 1, so this also will simplify the code here.

@magicDGS

magicDGS Dec 1, 2016

Contributor

This method could be also later or, to avoid the list re-allocation, make the validateNumberOfItems() return the attribute list.

@ronlevine

ronlevine Dec 5, 2016 edited

Contributor

@magicDGS Added:

if ( object.getClass().isArray() ) {
    throw new TribbleException.InternalCodecException(String.format("the %s tag vallues cannot be an array at position %s:%d,", attributeKey, getContig(), getStart()));
}
public void validateChromosomeCounts() {
+ final int numberOfAlternateAlleles = alleles.size() - 1;
@lbergelson

lbergelson Nov 30, 2016

Contributor

you could reuse numberOfAlternateAlleles down below where !getAlternateAlleles().isEmpty() is checked to save an unnecessary List allocation

@ronlevine

ronlevine Dec 1, 2016

Contributor

Done.

public void validateChromosomeCounts() {
+ final int numberOfAlternateAlleles = alleles.size() - 1;
+ validateNumberOfItems(VCFConstants.ALLELE_COUNT_KEY, numberOfAlternateAlleles);
@lbergelson

lbergelson Nov 30, 2016

Contributor

Doesn't this make the two size checks + throws in the AC block below redundant? Can we remove those?

@ronlevine

ronlevine Dec 1, 2016 edited

Contributor

The actualACs.size() != expectedACs.size() and actualACs.size() != 1 checks are now gone.

@lbergelson lbergelson assigned ronlevine and unassigned lbergelson Nov 30, 2016

public void validateChromosomeCounts() {
+ final int numberOfAlternateAlleles = alleles.size() - 1;
+ validateNumberOfItems(VCFConstants.ALLELE_COUNT_KEY, numberOfAlternateAlleles);
+ validateNumberOfItems(VCFConstants.ALLELE_FREQUENCY_KEY, numberOfAlternateAlleles);
@magicDGS

magicDGS Dec 1, 2016 edited

Contributor

The number of items for AF is validated here, but not the contents. Shouldn't this be included too?

@ronlevine

ronlevine Dec 5, 2016

Contributor

It's out of scope but while I'm there, I might as well add it.

Coverage Status

Coverage decreased (-0.02%) to 70.009% when pulling 046380a on rhl_validate_ac_af_no_gt_757 into 6e4e875 on master.

+ if ( expectedSize > 0 && hasAttribute(attributeKey) ) {
+ final Object object = getAttribute(attributeKey);
+ if ( object.getClass().isArray() ) {
+ throw new TribbleException.InternalCodecException(String.format("the %s tag vallues cannot be an array at position %s:%d,", attributeKey, getContig(), getStart()));
@magicDGS

magicDGS Dec 5, 2016

Contributor

Why this should be the case? If someone use the following, it will blow up:

final VariantContext variant = new VariantContextBuilder(variantToCopy)
    .setAttribute(VCFConstants.ALLELE_COUNT_KEY, new int[]{1, 3})
    .make();

But I don't think that it is incorrect to set a key that is expected to have more than one value as an array. Why not using the getter for list here? If it is a single value, it will return a singleton list; if it is a list and/or array, it will be converted to a list. Anyway, using the getter an then the size method will get the actual size independently if it is a list or not.

@lbergelson

lbergelson Dec 6, 2016

Contributor

@magicDGS That's an excellent point. I didn't even realize that getter existed.

Coverage Status

Coverage increased (+0.4%) to 70.397% when pulling 1490f52 on rhl_validate_ac_af_no_gt_757 into 6e4e875 on master.

- public void validateReferenceBases(final Allele reportedReference, final Allele observedReference) {
- if ( reportedReference != null && !reportedReference.basesMatch(observedReference) ) {
- throw new TribbleException.InternalCodecException(String.format("the REF allele is incorrect for the record at position %s:%d, fasta says %s vs. VCF says %s", getContig(), getStart(), observedReference.getBaseString(), reportedReference.getBaseString()));
+ public void validateReferenceBases(final Allele expectedReference, final Allele actualReference) {
@lbergelson

lbergelson Dec 6, 2016

Contributor

@ronlevine I wouldn't change the reported/observed terminology in the whole file. Just in that one function I was talking about.

@ronlevine

ronlevine Dec 6, 2016

Contributor

OK. I thought using that terminology throughout the file seemed clearer.

- throw new TribbleException.InternalCodecException(String.format("the Allele Count (AC) tag doesn't have the correct number of values for the record at position %s:%d, %d vs. %d", getContig(), getStart(), reportedACs.size(), observedACs.size()));
- for (int i = 0; i < observedACs.size(); i++) {
+ final Object ac = getAttribute(VCFConstants.ALLELE_COUNT_KEY);
+ if ( ac instanceof List ) {
@lbergelson

lbergelson Dec 6, 2016

Contributor

Can we use that getter here too instead of special casing lists + not list?

@ronlevine

ronlevine Dec 6, 2016

Contributor

Yes.

Contributor

lbergelson commented Dec 6, 2016

@ronlevine If you want to get this merged in today, you could move the new AF validation to a second pull request and just do the initial changes. I think in any case now it should switch to using getAttributeAsList where appropriate to simplify the code.

Contributor

ronlevine commented Dec 6, 2016

@lbergelson Sounds good. I'll do my best to take care of it tonight.

Contributor

ronlevine commented Dec 7, 2016

@lbergelson Please take a look. I removed the validation of AF values, used the attribute getters and reverted the original code to observed/reported terminology.

Coverage Status

Coverage increased (+0.4%) to 70.386% when pulling 0c0cdea on rhl_validate_ac_af_no_gt_757 into 6e4e875 on master.

Coverage Status

Coverage increased (+0.4%) to 70.386% when pulling 3d23728 on rhl_validate_ac_af_no_gt_757 into 6e4e875 on master.

+ final List<Object> actualValues = getAttributeAsList(attributeKey);
+ if (!actualValues.isEmpty()) {
+ // always have at least one actual value
+ final int expectedValuesSize = expectedSize > 0 ? expectedSize : expectedSize + 1;
@lbergelson

lbergelson Dec 7, 2016

Contributor

@ronlevine I'm not sure I understand this. Why do you need to add 1 to it?

@ronlevine

ronlevine Dec 7, 2016 edited

Contributor

If no alt alleles (expectSize == 0), and the annotation exists, there will be 1 value. For example, for 0/0, AC=0. The relevant code is line 1273.

@lbergelson

lbergelson Dec 7, 2016

Contributor

Could you change this to expectedSize > 0 ? expectedSize : 1 I think that's more clear.

Since this behavior is pretty specific to these attributes I might change the naming again, i.e. expectedSize -> numAlternateAlleles.

Sorry for the back and forth on this.

@ronlevine

ronlevine Dec 7, 2016

Contributor

Done.

Contributor

lbergelson commented Dec 7, 2016

@ronlevine Looks good to me. Going to merge when tests finish.

Contributor

ronlevine commented Dec 7, 2016

Rebasing.

@ronlevine ronlevine Validate VariantContext AC and AF without genotypes
7e91886

Coverage Status

Coverage increased (+0.006%) to 70.386% when pulling 7e91886 on rhl_validate_ac_af_no_gt_757 into 4d0070b on master.

Coverage Status

Coverage increased (+0.006%) to 70.386% when pulling 7e91886 on rhl_validate_ac_af_no_gt_757 into 4d0070b on master.

@ronlevine ronlevine merged commit e69aff0 into master Dec 7, 2016

3 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details
coverage/coveralls Coverage increased (+0.4%) to 70.386%
Details

ronlevine deleted the rhl_validate_ac_af_no_gt_757 branch Dec 7, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment