Discounts reads that are non-filter-passing in DuplicateScoringStrategy #745

Merged
merged 4 commits into from Jan 10, 2017

Conversation

Projects
None yet
4 participants
Contributor

yfarjoun commented Nov 19, 2016 edited

Description

DuplicateScoringStrategy is a class used by Picard's MarkDuplicates for the purpose of choosing the "best" read/read-pair from a duplicate set. It has multiple strategies, but they all ignore the filter-flag. This means (as mentioned in broadinstitute/picard#690) that in some cases a duplicate-set will end up being represented by a filter-failing read, which will then be ignored by most downstream tools.

To fix this problem, this PR discounts the score of reads if they are failing the vendor flag. The other scores are capped to guarantee that the discount is large enough.

Checklist

  • Code compiles correctly
  • New tests covering changes and new functionality
  • All tests passing
  • Extended the README / documentation, if necessary
  • Is not backward compatible (breaks binary or source compatibility)

yfarjoun added some commits Nov 19, 2016

@yfarjoun yfarjoun Discounting the score for records that do not pass filters. This will…
… mean that failing records are very un-likely to be

chosen as the representative read in a set.
199c6c6
@yfarjoun yfarjoun - fixup
302796d
@yfarjoun yfarjoun - caping score to guarrantee that discount works
badb7af

Coverage Status

Coverage decreased (-0.006%) to 69.833% when pulling badb7af on yf_fix_MD_vis-a-vis_PF-Failing-Reads into 70783fc on master.

Coverage Status

Coverage decreased (-0.0003%) to 69.839% when pulling badb7af on yf_fix_MD_vis-a-vis_PF-Failing-Reads into 70783fc on master.

Contributor

yfarjoun commented Nov 19, 2016

@tfenne This changes the heart of MarkDuplicates, so I'd really appreciate a careful look, if you can.

tfenne was assigned by yfarjoun Dec 5, 2016

Contributor

yfarjoun commented Dec 13, 2016

@eitanbanks could you comment if you think this could introduce batch effects? the "phenotype" will be a little more depth than otherwise. Possibly concentrated in reads that are harder to read.

Contributor

eitanbanks commented Dec 14, 2016

I just don't see this being enough of a phenomenon to induce batch effects. How often does this change the best pair in a random bam? Seems like it would happen a small handful of times, so it shouldn't be enough to affect calls.

Contributor

yfarjoun commented Dec 14, 2016

@tfenne

@yfarjoun Mostly comments on formatting and clarity. The general thrust seems fine to me though.

- private static short getSumOfBaseQualities(final SAMRecord rec) {
- short score = 0;
+ private static int getSumOfBaseQualities(final SAMRecord rec) {
+
@tfenne

tfenne Dec 23, 2016

Owner

Bonus newline

for (final byte b : rec.getBaseQualities()) {
- if (b >= 15) score += b;
+ if (b >= 15 ) score += b;
@tfenne

tfenne Dec 23, 2016

Owner

Extra space before )

@@ -64,6 +65,8 @@ public static short computeDuplicateScore(final SAMRecord record, final ScoringS
/**
* Returns the duplicate score computed from the given fragment.
+ * value should be capped by Short.MAX_VALUE since the score from two reads will be
@tfenne

tfenne Dec 23, 2016

Owner

Capped to Short.MAX_VALUE / 2? Maybe where they are summed it should just use an int?

@yfarjoun

yfarjoun Dec 24, 2016

Contributor

and then what? the sorting collection holds a short and I don't want to pay the memory cost of changing that to an int...

@@ -87,9 +91,23 @@ public static short computeDuplicateScore(final SAMRecord record, final ScoringS
}
break;
case RANDOM:
- score += (short) (hasher.hashUnencodedChars(record.getReadName()) >> 16);
+ // start with a random number between Short.MIN_VALUE/4 and Short.MAX_VALUE/4
+ score += (short) (hasher.hashUnencodedChars(record.getReadName()) & 0b11_1111_1111_1111);
@tfenne

tfenne Dec 23, 2016

Owner

This seems unecessarily complicated. Why not just use java.util.Random::nextInt(Short.MAX_VALUE / 2)?

@yfarjoun

yfarjoun Dec 24, 2016

Contributor

I was simply keeping the logic of using mumor hash, as it's a better source of randomness than java.util.Random.

@yfarjoun

yfarjoun Dec 24, 2016

Contributor

Oh, I remember why you did this 😄, you wanted the same score for both mate of the same query name....so nextInt will not do. I'll add a comment to this effect.

}
+ // make sure that filter-failing records are heavily discounted. Dividing by 2 to be far from
@tfenne

tfenne Dec 23, 2016

Owner

This comment really applies to lines 109, but there is the handling of overflow and a comment regarding that between this comment and line 109, which is confusing. Can you move this comment down to immediately before line 109?

+
+ // a long read with high quality bases may overflow and
+ score = (short) Math.max(Short.MIN_VALUE, score);
+ score += record.getReadFailsVendorQualityCheckFlag() ? (short) (Short.MIN_VALUE / 2) : 0;
@tfenne

tfenne Dec 23, 2016

Owner

Also, from reading the comment I thought you were just going to do score = score / 2, but you're not. Either the comment needs to explain a little better, or the code needs to just divide the score.

@tfenne tfenne assigned yfarjoun and unassigned tfenne Dec 23, 2016

@yfarjoun yfarjoun - responding to review comments.
74e138f

@yfarjoun yfarjoun assigned tfenne and unassigned yfarjoun Dec 24, 2016

Coverage Status

Coverage increased (+0.7%) to 70.552% when pulling 74e138f on yf_fix_MD_vis-a-vis_PF-Failing-Reads into 70783fc on master.

Contributor

yfarjoun commented Jan 8, 2017

is this better now @tfenne ?

Owner

tfenne commented Jan 10, 2017

Works for me @yfarjoun. Merge away.

@tfenne tfenne assigned yfarjoun and unassigned tfenne Jan 10, 2017

@yfarjoun yfarjoun merged commit fd71c45 into master Jan 10, 2017

3 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details
coverage/coveralls Coverage decreased (-0.001%) to 69.832%
Details

yfarjoun deleted the yf_fix_MD_vis-a-vis_PF-Failing-Reads branch Jan 10, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment