Adds a function isWithinHammingDistance that checks if two strings ar… #690

Merged
merged 4 commits into from Sep 20, 2016

Conversation

Projects
None yet
7 participants
Contributor

fleharty commented Aug 22, 2016 edited

Description

The motivation for this code is to add functionality to compare two strings and determine
their Hamming distance. Hamming distance is a common metric that is used to compare
strings. The two functions, hammingDistance, and isWithinHammingDistance provide a way
to check the actual Hamming distance, and whether or not two strings are within a given
Hamming distance of each other. isWithinHammingDistance is provided because there are
many cases where it is important to know only if two strings are within a given Hamming distance
and we can terminate early if the two strings are sufficiently dissimilar.

Checklist

  • Code compiles correctly
  • New tests covering changes and new functionality
  • All tests passing
  • Extended the README / documentation, if necessary
  • Is not backward compatible (breaks binary or source compatibility)

…e within a given Hamming distance

@fleharty fleharty Adds a function isWithinHammingDistance that checks if two strings ar…
…e within a given Hamming distance
4be2e8e

Coverage Status

Coverage increased (+0.009%) to 68.816% when pulling 4be2e8e on broadinstitute:mf_WithinHammingDistance into b81d036 on samtools:master.

Contributor

magicDGS commented Aug 22, 2016

Why don't extract also the method for computing the distance, @fleharty? It will be useful for other purposes...

@fleharty fleharty Adding hammingDistance function per magicDGS's comment
1991b11
Contributor

fleharty commented Aug 22, 2016

@magicDGS I agree, so I added the hammingDistance function.
I personally need the isWithinHammingDistance to be a little bit faster than calculating the hammingDistance and then checking if the the value is within a particular value, which is why I have that function.

Coverage Status

Coverage increased (+0.003%) to 68.81% when pulling 1991b11 on broadinstitute:mf_WithinHammingDistance into b81d036 on samtools:master.

Contributor

cmnbroad commented Aug 23, 2016

I'm wondering if these belong in htsjdk (I know there are similar things like this in the code, like levenshteinDistance, but...) they're not used by htsjdk anywhere, AFAICT and have no dependence on any htsjdk classes.

Contributor

fleharty commented Aug 23, 2016

@cmnbroad @yfarjoun Picard depends on htsjdk and there is a PR now in Picard that will make use of this function. Since htsjdk already had a stringutils, I was trying to reduce the possibility of code duplication in the future given that things like levenshteinDistance was there, but I see your point that levenshteinDistance isn't ever called within htsjdk.

Contributor

yfarjoun commented Aug 24, 2016

I think it's fine to have general-purpose utility functions like this in
htsjdk. there are many functions in htsjdk that are not called internally
(they should ofcourse be tested...) but they are used from other projects,
e.g. picard and GATK.

On Tue, Aug 23, 2016 at 5:40 PM, Mark Fleharty notifications@github.com
wrote:

@cmnbroad https://github.com/cmnbroad @yfarjoun
https://github.com/yfarjoun Picard depends on htsjdk and there is a PR
now in Picard that will make use of this function. Since htsjdk already had
a stringutils, I was trying to reduce the possibility of code duplication
in the future given that things like levenshteinDistance was there, but I
see your point that levenshteinDistance isn't ever called within htsjdk.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#690 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ACnk0oRFVQDvdTao1HDvBg6xMR6g6P4hks5qi2i3gaJpZM4Jp457
.

@lbergelson lbergelson and 1 other commented on an outdated diff Aug 25, 2016

src/main/java/htsjdk/samtools/util/StringUtil.java
@@ -545,4 +545,56 @@ public static int levenshteinDistance(final String string1, final String string2
return i;
}
+
+ /**
+ * Calculates the hamming distance between two strings s1 and s2.
@lbergelson

lbergelson Aug 25, 2016

Contributor

should hamming always be capitalized in the comments? It's inconsistent.

@lbergelson

lbergelson Aug 25, 2016

Contributor

Could you add a brief explanation of what the hamming distance is to the javadoc. I.e. "Calculates the Hamming distance (number of character mismatches) between two strings"

@fleharty

fleharty Aug 25, 2016

Contributor

fixed

@lbergelson lbergelson and 1 other commented on an outdated diff Aug 25, 2016

src/main/java/htsjdk/samtools/util/StringUtil.java
+ */
+ public static int hammingDistance(final String s1, final String s2) {
+ if (s1.length() != s2.length()) {
+ throw new IllegalArgumentException("Attempted to determine Hamming distance of strings with differing lengths.");
+ }
+ int measuredDistance = 0;
+ for (int i = 0;i < s1.length();i++) {
+ if (s1.charAt(i) != s2.charAt(i)) {
+ measuredDistance++;
+ }
+ }
+ return measuredDistance;
+ }
+
+ /**
+ * Determines if two strings s1 and s2 are within maxHammingDistance of ecah other using the Hamming Distance metric.
@lbergelson

lbergelson Aug 25, 2016

Contributor

typo ecah

@fleharty

fleharty Aug 25, 2016

Contributor

fixed

@lbergelson lbergelson and 1 other commented on an outdated diff Aug 25, 2016

src/main/java/htsjdk/samtools/util/StringUtil.java
@@ -545,4 +545,56 @@ public static int levenshteinDistance(final String string1, final String string2
return i;
}
+
+ /**
+ * Calculates the hamming distance between two strings s1 and s2.
+ * Since Hamming distance is not defined for strings of differing lengths, we throw an exception if
+ * the two strings are of different lengths.
+ *
+ * @param s1 The first string to compare
+ * @param s2 The second string to compare, note that if s1 and s2 are swapped the value returned will be identical.
+ * @return Hamming distance between s1 and s2.
+ * @throws IllegalArgumentException If the two strings have differing lengths.
+ */
+ public static int hammingDistance(final String s1, final String s2) {
+ if (s1.length() != s2.length()) {
+ throw new IllegalArgumentException("Attempted to determine Hamming distance of strings with differing lengths.");
@lbergelson

lbergelson Aug 25, 2016

Contributor

Could you include the actual length of the strings in the error message so it's easier to debug what happened?

@fleharty

fleharty Aug 25, 2016

Contributor

fixed

@lbergelson lbergelson and 1 other commented on an outdated diff Aug 25, 2016

src/test/java/htsjdk/samtools/util/StringUtilTest.java
+
+ @DataProvider(name="withinHammingDistanceExceptionProvider")
+ public Object[][] isWithinHammingDistanceException() {
+ return new Object[][] {
+ {"ATAC", "GCT", 3, true},
+ {"ATAC", "AT", 2, false},
+ {"ATAC", "T", 1, false},
+ {"", "GCAT", 0, false}
+ };
+ }
+
+ @Test(dataProvider = "hammingDistanceExceptionProvider", expectedExceptions = IllegalArgumentException.class)
+ public void testIsWithinHammingDistanceExceptions(final String s1, final String s2, final int maxHammingDistance) {
+ // We assert hammingDistance = 0, and isWithinHammingDistance = true because the values don't matter
+ // and we are checking to ensure that the IllegalArgumentException is thrown
+ Assert.assertEquals(StringUtil.hammingDistance(s1, s2), 0);
@lbergelson

lbergelson Aug 25, 2016

Contributor

This won't check the second case if the first one throws an exception. You should either have two different tests one for each case, (and don't use the asserts, just call the method directly), or you should have calls like this one after another.

 Assert.assertThrows(IllegalArgumentException.class,  () -> {StringUtil.hammingDistance(s1, s2);});
@fleharty

fleharty Aug 25, 2016

Contributor

Fixed.

Contributor

lbergelson commented Aug 25, 2016

@yfarjoun @cmnbroad What's the resolution here? Sounds like an ok for this to go forward?
@fleharty I made a few minor comments for you to address.

Contributor

cmnbroad commented Aug 25, 2016

It sounds like I'm the only one who thinks this is out of place here, so I defer to the community sentiment.

@nh13 nh13 commented on the diff Aug 25, 2016

src/main/java/htsjdk/samtools/util/StringUtil.java
@@ -545,4 +545,56 @@ public static int levenshteinDistance(final String string1, final String string2
return i;
}
+
+ /**
+ * Calculates the hamming distance between two strings s1 and s2.
+ * Since Hamming distance is not defined for strings of differing lengths, we throw an exception if
+ * the two strings are of different lengths.
+ *
+ * @param s1 The first string to compare
+ * @param s2 The second string to compare, note that if s1 and s2 are swapped the value returned will be identical.
+ * @return Hamming distance between s1 and s2.
+ * @throws IllegalArgumentException If the two strings have differing lengths.
+ */
+ public static int hammingDistance(final String s1, final String s2) {
@nh13

nh13 Aug 25, 2016

Contributor

Since likely this is going to be used to compare DNA sequences, you probably want in the doc stating that the comparison is case sensitive. Also, do you have a need to count or not count Ns as mismatches?

@fleharty

fleharty Aug 25, 2016

Contributor

@nh13 Added comment about case sensitive.
I'm also adding a note that matching Ns will not be counted as mismatches, since that isn't desired here, but I can see that someone else might want that.

@lbergelson

lbergelson Sep 15, 2016

Contributor

@fleharty Sorry for the very slow response. I think this note about N's is more confusing that it is clarifying. I would say something more like "This implementation is case sensitive and does not have any special treatment for DNA."

@nh13 nh13 commented on the diff Aug 25, 2016

src/main/java/htsjdk/samtools/util/StringUtil.java
+ }
+ }
+ return measuredDistance;
+ }
+
+ /**
+ * Determines if two strings s1 and s2 are within maxHammingDistance of ecah other using the Hamming Distance metric.
+ * Since Hamming distance is not defined for strings of differing lengths, we throw an exception if
+ * the two strings are of different lengths.
+ *
+ * @param s1 The first string to compare
+ * @param s2 The second string to compare, note that if s1 and s2 are swapped the value returned will be identical.
+ * @param maxHammingDistance The largest Hamming distance the strings can have for this function to return true.
+ * @return true if the two strings are within maxHammingDistance of each other, false otherwise.
+ * @throws IllegalArgumentException If the two strings have differing lengths.
+ */
@nh13

nh13 Aug 25, 2016

Contributor

Ditto about the doc usage for DNA bases.

@fleharty

fleharty Aug 25, 2016

Contributor

fixed

@nh13 nh13 and 1 other commented on an outdated diff Aug 25, 2016

src/test/java/htsjdk/samtools/util/StringUtilTest.java
+ }
+
+ @Test(dataProvider = "hammingDistanceExceptionProvider", expectedExceptions = IllegalArgumentException.class)
+ public void testIsWithinHammingDistanceExceptions(final String s1, final String s2, final int maxHammingDistance) {
+ // We assert hammingDistance = 0, and isWithinHammingDistance = true because the values don't matter
+ // and we are checking to ensure that the IllegalArgumentException is thrown
+ Assert.assertEquals(StringUtil.hammingDistance(s1, s2), 0);
+ Assert.assertEquals(StringUtil.isWithinHammingDistance(s1, s2, maxHammingDistance), true);
+ }
+
+ @DataProvider(name="hammingDistanceProvider")
+ public Object[][] hammingDistance() {
+ return new Object[][] {
+ {"ATAC", "GCAT", 3},
+ {"ATAGC", "ATAGC", 0},
+ {"ATAC", "atac", 4}, // Hamming distance is case sensitive
@nh13

nh13 Aug 25, 2016

Contributor

One test for non-DNA bases?

@fleharty

fleharty Aug 25, 2016

Contributor

fixed

@fleharty

fleharty Aug 29, 2016

Contributor

@nh13 I changed this to test for Hamming distance alone and not account for the fact that most the time people are interested in comparing DNA, and we might want to count matching Ns as mismatches. I'd just like to get your opinion on what you think should be done here.

For the UmiAwareMarkDupliates, I don't ever expect to encounter an N, so I think checking only Hamming distance is okay, but I can see that others might not be okay with this. If I set different behavior for matching Ns, I should probably also check to see if it is A, T, C, or G as well and throw an exception if an unexpected character is encountered.

Thoughts?

@nh13

nh13 Aug 29, 2016

Contributor

Lets defer adding support for matching valid bases for now, but just know that it may become useful in the future.

@nh13 nh13 and 1 other commented on an outdated diff Aug 25, 2016

src/test/java/htsjdk/samtools/util/StringUtilTest.java
+ {"ATAC", "GCAT", 0, false}
+ };
+ }
+
+ @Test(dataProvider = "withinHammingDistanceProvider")
+ public void testIsWithinHammingDistance(final String s1, final String s2, final int maxHammingDistance, final boolean expectedResult) {
+ Assert.assertEquals(StringUtil.isWithinHammingDistance(s1, s2, maxHammingDistance), expectedResult);
+ }
+
+ @DataProvider(name="withinHammingDistanceExceptionProvider")
+ public Object[][] isWithinHammingDistanceException() {
+ return new Object[][] {
+ {"ATAC", "GCT", 3, true},
+ {"ATAC", "AT", 2, false},
+ {"ATAC", "T", 1, false},
+ {"", "GCAT", 0, false}
@nh13

nh13 Aug 25, 2016

Contributor

The params are so nicely aligned above, how about here and below ;) ?

@fleharty

fleharty Aug 25, 2016

Contributor

fixed

@fleharty fleharty Addressing comments from Louis and Nils
7713404
Contributor

fleharty commented Aug 25, 2016

@lbergelson @nh13 Thanks for the comments, back to you.

Coverage Status

Coverage increased (+0.1%) to 68.944% when pulling 7713404 on broadinstitute:mf_WithinHammingDistance into b81d036 on samtools:master.

lbergelson was assigned by droazen Sep 13, 2016

@fleharty fleharty Responding to Louis' comment, DNA not treated special
bf1adc5

Coverage Status

Coverage increased (+0.2%) to 69.04% when pulling bf1adc5 on broadinstitute:mf_WithinHammingDistance into b81d036 on samtools:master.

@lbergelson lbergelson merged commit 224cfc1 into samtools:master Sep 20, 2016

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage increased (+0.2%) to 69.04%
Details
Contributor

lbergelson commented Sep 20, 2016

👍

fleharty deleted the broadinstitute:mf_WithinHammingDistance branch Sep 20, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment