ENH: hamming distance for string comparison #12027

fabianegli · 2020-05-05T13:19:52Z

DOC: Added language to indicate the new behaviour with strings.
ENH: Changed behaviour for string input. If two strings are supplied, the hamming distance will be computed for a character-wise comparison.
ENH: Better Error message for empty strings and arrays.
TST: Added tests for the new behaviour

Reference issue

Closes #11991.

What does this implement/fix?

The motivation for this PR is to make the hamming distance accept and calculate the hamming distance between strings based on a character-by-character comparison. This is a desired functionality for tasks where sequences are compared that are normally handled as strings, i.e. in biological sequences of proteins and DNA, but also in text processing applications.

This PR includes changes to the docstring of the hamming function to describe the changed behaviour and adds two examples to the doctest.

This PR contains a test for the new behaviour.

This PR contains code to improve the error handling if empty arrays or strings are supplied to the hamming function plus a test to check the improved error message.

The rationale behind changing the behaviour this way is the easier use of hamming for string comparison and not really useful functionality of the current behaviour. i.e. the current behaviour of hamming is as if the strings were directly compared and the result

>>> a="aa"
>>> b="bb"
>>> ab="ab"
>>> hamming(a, b) == int(a!=b)
True
>>> hamming(a, ab) == int(a!=ab)
True
>>> hamming([a], [b]) == int(a!=b)
True
>>> hamming([a], [ab]) == int(a!=ab)
True

The new behaviour will only change the result for not identical multi-letter strings:

>>> a="aa"
>>> b="bb"
>>> ab="ab"
>>> hamming(a, a) == int(a!=a)
True
>>> hamming(a, ab) == int(a!=ab) # This is the new behaviour hamming(a, ab) => 0.5
False
>>> hamming([a], [b]) == int(a!=b)
True
>>> hamming([a], [ab]) == int(a!=ab)
True

In a similar note to the previous argument, the use of hamming for scalars is equally pointless: a=1;b=2;hamming(a, b) == int(a!=b) for any scalar a and b. I think scalar arguments should always result in a ValueError. (Not implemented)

Additional information

The tests run successfully in a local Ddocker development container.

I was not able to build the docs on a Mac dev environment nor in the docker developer environment with the explanations in https://scipy.github.io/devdocs/dev/contributor/rendering_documentation.html#rendering-documentation

I am not sure if the change for string behaviour should be documented similar to the versionadded notation:

.. versionadded:: X.Y.Z

fabianegli · 2020-05-05T13:35:26Z

The error message of the ValueError for strings of unequal length and empty strings are somewhat misleading because they specifically mention the array input. The following adaptions would remove the now wrong specificity from the message:

    if u.shape != v.shape:
        raise ValueError("Both inputs must have equal lengths.")
    if u.shape == (0,):
        raise ValueError("The input must not be of length zero.")

fabianegli · 2020-05-05T14:50:54Z

Force pushed after rebase. Reason: the previous build failed for Arm64, Bionic, Python3.7.

Edit: this is still the case after the force push and is also the case for the master branch from which this branch was created (and rebased).

tylerjereddy · 2021-01-03T01:47:23Z

scipy/spatial/distance.py


    """
+    if isinstance(u, str) and isinstance(v, str):
+        u = list(u)
+        v = list(v)


I see this PR has been sitting for a while. There was some contention in the original issue proposing this change because NumPy by default treats the strings as scalars.

While I agree that the the default NumPy behavior is not "in the spirit" of what the hamming distance is typically used for (how many elementwise differences are there between two sequences of string or other types and not the differences between the composite scalars), we may need to exercise some caution if we want to have a change in behavior.

@WarrenWeckesser what about adding a new argument? I guess the question here is whether the burden on users is really so high that we can't for example just improve the docstring on this matter (i.e., suggest list conversion for elementwise string comparisons) vs. expanding the code/available options?

I absolutely agree with the caution. A deprecation warning to alert the user to the change would definitely be in order. Something like "The hamming distance will compute the character wise comparison of strings in future versions of scipy. To retain the current behavior of the string comparison the strings need to be array elements. For further information about the change see [link to this PR or some or the docs]."

Just because numpy treats strings as scalars, does not mean that putting the string into an array is the right thing to do, because the string argument violates the idea of the hamming distance computation by not providing an array in the first place. So the question is not how numpy treats strings, but how the hamming distance should be recovering from the contract violation. My argument is simply that the current handling is not intuitive and thus a suboptimal API.

Maybe it would be even better to just throw a ValueError if the argument are strings to avoid making assumptions altogether?

BTW: the jaccard distance already has a FutureWarning in the spirit of this PR:

>>> import scipy >>> scipy.__version__ '1.6.2' >>> from scipy.spatial.distance import jaccard >>> jaccard("aa", "ab") FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison 1.0

In the spirit of a uniform user experience within scipy, there probably should be a consistent handling of the case where strings are supplied to a distance metric. In this case this would mean to add a FutureWarning for a while and then change the behavior as proposed here - maybe in concert with all other distance metrics that are commonly used for string comparison.

h-vetinari · 2021-09-09T20:16:53Z

Just noting (in case the original issue gets overlooked once someone picks this PR) that this looks related - though I haven't looked closely - to the normalization issue mentioned here:

hamming doesn't compute the Hamming distance, but scales it dividing by the vector dimension. They are closely related, but are not quite the same. I would like to have matching and sokalmichener return the old value, but make hamming return the proper unscaled Hamming distance. Can this be happily done, or is it best to first add some form of deprecation warning?

h-vetinari · 2021-12-30T09:54:12Z

Just noting that there was some substantial further discussion on this in #11991 that should be taken into account if/once this PR is picked up again (and I'm aware that it's not pending on the PR author).

fabianegli force-pushed the hamming-by-character-str-comparison branch from c1126db to e306364 Compare May 5, 2020 14:44

miladsade96 added enhancement A new feature or improvement scipy.spatial labels May 5, 2020

rebase master

26abf1b

fabianegli force-pushed the hamming-by-character-str-comparison branch from e306364 to 26abf1b Compare May 17, 2020 19:07

Merge branch 'master' into hamming-by-character-str-comparison

cd50f4e

fabianegli mentioned this pull request May 24, 2020

Hamming distance returns unexpected value for string comparisons #11991

Open

tylerjereddy reviewed Jan 3, 2021

View reviewed changes

This was referenced Sep 9, 2021

ENH: add distance gufuncs to scipy.spatial #3163

Closed

Implement all scipy.spatial.distance metrics in _distance_pybind #14077

Open

lucascolley added the needs-decision Items that need further discussion before they are merged or closed label Jan 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: hamming distance for string comparison #12027

ENH: hamming distance for string comparison #12027

fabianegli commented May 5, 2020

fabianegli commented May 5, 2020

fabianegli commented May 5, 2020 •

edited

Loading

tylerjereddy Jan 3, 2021

fabianegli Sep 10, 2021 •

edited

Loading

fabianegli Sep 20, 2021 •

edited

Loading

h-vetinari commented Sep 9, 2021

h-vetinari commented Dec 30, 2021

ENH: hamming distance for string comparison #12027

Are you sure you want to change the base?

ENH: hamming distance for string comparison #12027

Conversation

fabianegli commented May 5, 2020

Reference issue

What does this implement/fix?

Additional information

fabianegli commented May 5, 2020

fabianegli commented May 5, 2020 • edited Loading

tylerjereddy Jan 3, 2021

Choose a reason for hiding this comment

fabianegli Sep 10, 2021 • edited Loading

Choose a reason for hiding this comment

fabianegli Sep 20, 2021 • edited Loading

Choose a reason for hiding this comment

h-vetinari commented Sep 9, 2021

h-vetinari commented Dec 30, 2021

fabianegli commented May 5, 2020 •

edited

Loading

fabianegli Sep 10, 2021 •

edited

Loading

fabianegli Sep 20, 2021 •

edited

Loading