Added a "RANDOM" duplicate scoring strategy. #688

Merged
merged 1 commit into from Aug 19, 2016

Conversation

Projects
None yet
3 participants
Owner

tfenne commented Aug 19, 2016

Description

When evaluating error rates in data with UMIs and contrasting different approaches to duplicate marking, it is useful to have an unbiased read picked as the representative for a duplicate set instead of some heuristic for "best".

Checklist

  • Code compiles correctly
  • New tests covering changes and new functionality
  • All tests passing
  • Extended the README / documentation, if necessary
  • Is not backward compatible (breaks binary or source compatibility)
Owner

tfenne commented Aug 19, 2016

@yfarjoun This is a trivial little addition. There don't appear to be standalone tests for the DuplicateScoringStrategy, and I'm not really up for writing comprehensive tests to enable my ~3 line addition. Can you take a look and let me know what you think?

Coverage Status

Coverage decreased (-0.008%) to 68.799% when pulling 0c6a043 on tf_random_duplicate_scoring into c8202f2 on master.

@yfarjoun yfarjoun and 1 other commented on an outdated diff Aug 19, 2016

...in/java/htsjdk/samtools/DuplicateScoringStrategy.java
@@ -80,6 +81,8 @@ public static short computeDuplicateScore(final SAMRecord record, final ScoringS
score += SAMUtils.getMateCigar(record).getReferenceLength();
}
break;
+ case RANDOM:
+ score += (short) Math.sqrt(record.getReadName().hashCode());
@yfarjoun

yfarjoun Aug 19, 2016

Contributor

Since we already have the murmur3 hash in htsjdk, and given that we already agreed for other parts of the code that it is better than the default String.hashCode implementation, could you use murmur3 instead?

@tfenne

tfenne Aug 19, 2016

Owner

I can make that change if you'd like. I felt a bit funny about it because the enum would have to have an instance of Murmur3, which felt funny to me. But other than for the seed, Murmur3 is stateless so it would work. Thoughts?

@yfarjoun

yfarjoun Aug 19, 2016

Contributor

Since the seed is final I think you can do it however you like, in the
enum, or in DuplicateScoringStrategy. Not sure why it needs to go in the
enum.

On Fri, Aug 19, 2016 at 2:02 PM, Tim Fennell notifications@github.com
wrote:

In src/main/java/htsjdk/samtools/DuplicateScoringStrategy.java
#688 (comment):

@@ -80,6 +81,8 @@ public static short computeDuplicateScore(final SAMRecord record, final ScoringS
score += SAMUtils.getMateCigar(record).getReferenceLength();
}
break;

  •            case RANDOM:
    
  •                score += (short) Math.sqrt(record.getReadName().hashCode());
    

I can make that change if you'd like. I felt a bit funny about it because
the enum would have to have an instance of Murmur3, which felt funny to
me. But other than for the seed, Murmur3 is stateless so it would work.
Thoughts?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/samtools/htsjdk/pull/688/files/0c6a04385f080299424e60cd54ca4bd8d1c20115#r75525818,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ACnk0rUYgxokAty3327sU7gpeDiSL6Cnks5qhe-6gaJpZM4Joo0U
.

@tfenne

tfenne Aug 19, 2016

Owner

Doh - thanks for pointing that out. For some reason I was thinking DuplicateScoringStrategy was the enum. Will fix.

Owner

tfenne commented Aug 19, 2016

Requested change made @yfarjoun. Also since Murmur3 is more random, I switched to just right shifting 16 bits to get a short out, instead of sqrt'ing.

Coverage Status

Coverage decreased (-0.0002%) to 68.807% when pulling 1a56162 on tf_random_duplicate_scoring into c8202f2 on master.

@yfarjoun yfarjoun commented on the diff Aug 19, 2016

...in/java/htsjdk/samtools/DuplicateScoringStrategy.java
@@ -80,6 +86,8 @@ public static short computeDuplicateScore(final SAMRecord record, final ScoringS
score += SAMUtils.getMateCigar(record).getReferenceLength();
}
break;
+ case RANDOM:
+ score += (short) (hasher.hashUnencodedChars(record.getReadName()) >> 16);
@yfarjoun

yfarjoun Aug 19, 2016

Contributor

I'm curious if there's a reason you didn't opt to take the low order bits using & 0xFFFF ? Speed is not an issue here, I presume, but clarity?

@tfenne

tfenne Aug 19, 2016

Owner

No particular reason other than that >> 16 came to mind for me first.

@yfarjoun

yfarjoun Aug 19, 2016

Contributor

I think that there should be no difference in the randomness, due to how
murmur works (lots of >>>'s, which mixes up low and high order bits), so
I'm happy.

On Fri, Aug 19, 2016 at 2:54 PM, Tim Fennell notifications@github.com
wrote:

In src/main/java/htsjdk/samtools/DuplicateScoringStrategy.java
#688 (comment):

@@ -80,6 +86,8 @@ public static short computeDuplicateScore(final SAMRecord record, final ScoringS
score += SAMUtils.getMateCigar(record).getReferenceLength();
}
break;

  •            case RANDOM:
    
  •                score += (short) (hasher.hashUnencodedChars(record.getReadName()) >> 16);
    

No particular reason other than that >> 16 came to mind for me first.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/samtools/htsjdk/pull/688/files/1a5616268a3d8f0a69a734a7ab26a7ca3ef4f0b2#r75534218,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ACnk0lD5n5D0lZ7vnJ2e8oNbksmwEBuhks5qhfvWgaJpZM4Joo0U
.

@yfarjoun yfarjoun and 1 other commented on an outdated diff Aug 19, 2016

...in/java/htsjdk/samtools/DuplicateScoringStrategy.java
}
+ /** Hash used for the RANDOM scoring strategy. */
+ private static Murmur3 hasher = new Murmur3(1);
@yfarjoun

yfarjoun Aug 19, 2016

Contributor

final?

@tfenne

tfenne Aug 19, 2016

Owner

Done.

On Aug 19, 2016, at 2:53 PM, Yossi Farjoun notifications@github.com wrote:

In src/main/java/htsjdk/samtools/DuplicateScoringStrategy.java #688 (comment):

 }
  • /** Hash used for the RANDOM scoring strategy. */
  • private static Murmur3 hasher = new Murmur3(1);
    final?


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub https://github.com/samtools/htsjdk/pull/688/files/1a5616268a3d8f0a69a734a7ab26a7ca3ef4f0b2#r75534108, or mute the thread https://github.com/notifications/unsubscribe-auth/ABiN-kQGAN0-1Z0BymEovtIM2v79A87Uks5qhfurgaJpZM4Joo0U.

@tfenne tfenne Added a "RANDOM" duplicate scoring strategy.
1b0cdd9
Contributor

yfarjoun commented Aug 19, 2016

looks good. merge when tests pass. 👍

Coverage Status

Coverage decreased (-0.0002%) to 68.807% when pulling 1b0cdd9 on tf_random_duplicate_scoring into c8202f2 on master.

@tfenne tfenne merged commit 0e17e07 into master Aug 19, 2016

3 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details
coverage/coveralls Coverage decreased (-0.0002%) to 68.807%
Details

tfenne deleted the tf_random_duplicate_scoring branch Aug 19, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment