Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Discounts reads that are non-filter-passing in DuplicateScoringStrategy #745
Conversation
coveralls
commented
Nov 19, 2016
coveralls
commented
Nov 19, 2016
|
@tfenne This changes the heart of MarkDuplicates, so I'd really appreciate a careful look, if you can. |
tfenne
was assigned
by yfarjoun
Dec 5, 2016
|
@eitanbanks could you comment if you think this could introduce batch effects? the "phenotype" will be a little more depth than otherwise. Possibly concentrated in reads that are harder to read. |
|
I just don't see this being enough of a phenomenon to induce batch effects. How often does this change the best pair in a random bam? Seems like it would happen a small handful of times, so it shouldn't be enough to affect calls. |
|
I tend to agree. just making sure that there isn't anything I haven't
thought about.
…
|
tfenne
requested changes
Dec 23, 2016
@yfarjoun Mostly comments on formatting and clarity. The general thrust seems fine to me though.
| - private static short getSumOfBaseQualities(final SAMRecord rec) { | ||
| - short score = 0; | ||
| + private static int getSumOfBaseQualities(final SAMRecord rec) { | ||
| + |
| for (final byte b : rec.getBaseQualities()) { | ||
| - if (b >= 15) score += b; | ||
| + if (b >= 15 ) score += b; |
| @@ -64,6 +65,8 @@ public static short computeDuplicateScore(final SAMRecord record, final ScoringS | ||
| /** | ||
| * Returns the duplicate score computed from the given fragment. | ||
| + * value should be capped by Short.MAX_VALUE since the score from two reads will be |
tfenne
Dec 23, 2016
Owner
Capped to Short.MAX_VALUE / 2? Maybe where they are summed it should just use an int?
yfarjoun
Dec 24, 2016
Contributor
and then what? the sorting collection holds a short and I don't want to pay the memory cost of changing that to an int...
| @@ -87,9 +91,23 @@ public static short computeDuplicateScore(final SAMRecord record, final ScoringS | ||
| } | ||
| break; | ||
| case RANDOM: | ||
| - score += (short) (hasher.hashUnencodedChars(record.getReadName()) >> 16); | ||
| + // start with a random number between Short.MIN_VALUE/4 and Short.MAX_VALUE/4 | ||
| + score += (short) (hasher.hashUnencodedChars(record.getReadName()) & 0b11_1111_1111_1111); |
tfenne
Dec 23, 2016
Owner
This seems unecessarily complicated. Why not just use java.util.Random::nextInt(Short.MAX_VALUE / 2)?
yfarjoun
Dec 24, 2016
Contributor
I was simply keeping the logic of using mumor hash, as it's a better source of randomness than java.util.Random.
yfarjoun
Dec 24, 2016
Contributor
Oh, I remember why you did this
| } | ||
| + // make sure that filter-failing records are heavily discounted. Dividing by 2 to be far from |
tfenne
Dec 23, 2016
Owner
This comment really applies to lines 109, but there is the handling of overflow and a comment regarding that between this comment and line 109, which is confusing. Can you move this comment down to immediately before line 109?
| + | ||
| + // a long read with high quality bases may overflow and | ||
| + score = (short) Math.max(Short.MIN_VALUE, score); | ||
| + score += record.getReadFailsVendorQualityCheckFlag() ? (short) (Short.MIN_VALUE / 2) : 0; |
tfenne
Dec 23, 2016
Owner
Also, from reading the comment I thought you were just going to do score = score / 2, but you're not. Either the comment needs to explain a little better, or the code needs to just divide the score.
tfenne
assigned yfarjoun and unassigned tfenne
Dec 23, 2016
yfarjoun
assigned tfenne and unassigned yfarjoun
Dec 24, 2016
coveralls
commented
Dec 24, 2016
|
is this better now @tfenne ? |
|
Works for me @yfarjoun. Merge away. |
yfarjoun commentedNov 19, 2016
•
edited
Description
DuplicateScoringStrategy is a class used by Picard's MarkDuplicates for the purpose of choosing the "best" read/read-pair from a duplicate set. It has multiple strategies, but they all ignore the filter-flag. This means (as mentioned in broadinstitute/picard#690) that in some cases a duplicate-set will end up being represented by a filter-failing read, which will then be ignored by most downstream tools.
To fix this problem, this PR discounts the score of reads if they are failing the vendor flag. The other scores are capped to guarantee that the discount is large enough.
Checklist