Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: hamming distance for string comparison #12027

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

fabianegli
Copy link
Contributor

  • DOC: Added language to indicate the new behaviour with strings.
  • ENH: Changed behaviour for string input. If two strings are supplied, the hamming distance will be computed for a character-wise comparison.
  • ENH: Better Error message for empty strings and arrays.
  • TST: Added tests for the new behaviour

Reference issue

Closes #11991.

What does this implement/fix?

The motivation for this PR is to make the hamming distance accept and calculate the hamming distance between strings based on a character-by-character comparison. This is a desired functionality for tasks where sequences are compared that are normally handled as strings, i.e. in biological sequences of proteins and DNA, but also in text processing applications.

This PR includes changes to the docstring of the hamming function to describe the changed behaviour and adds two examples to the doctest.

This PR contains a test for the new behaviour.

This PR contains code to improve the error handling if empty arrays or strings are supplied to the hamming function plus a test to check the improved error message.

The rationale behind changing the behaviour this way is the easier use of hamming for string comparison and not really useful functionality of the current behaviour. i.e. the current behaviour of hamming is as if the strings were directly compared and the result

>>> a="aa"
>>> b="bb"
>>> ab="ab"
>>> hamming(a, b) == int(a!=b)
True
>>> hamming(a, ab) == int(a!=ab)
True
>>> hamming([a], [b]) == int(a!=b)
True
>>> hamming([a], [ab]) == int(a!=ab)
True

The new behaviour will only change the result for not identical multi-letter strings:

>>> a="aa"
>>> b="bb"
>>> ab="ab"
>>> hamming(a, a) == int(a!=a)
True
>>> hamming(a, ab) == int(a!=ab) # This is the new behaviour hamming(a, ab) => 0.5
False
>>> hamming([a], [b]) == int(a!=b)
True
>>> hamming([a], [ab]) == int(a!=ab)
True

In a similar note to the previous argument, the use of hamming for scalars is equally pointless: a=1;b=2;hamming(a, b) == int(a!=b) for any scalar a and b. I think scalar arguments should always result in a ValueError. (Not implemented)

Additional information

The tests run successfully in a local Ddocker development container.

I was not able to build the docs on a Mac dev environment nor in the docker developer environment with the explanations in https://scipy.github.io/devdocs/dev/contributor/rendering_documentation.html#rendering-documentation

I am not sure if the change for string behaviour should be documented similar to the versionadded notation:

.. versionadded:: X.Y.Z

@fabianegli
Copy link
Contributor Author

The error message of the ValueError for strings of unequal length and empty strings are somewhat misleading because they specifically mention the array input. The following adaptions would remove the now wrong specificity from the message:

    if u.shape != v.shape:
        raise ValueError("Both inputs must have equal lengths.")
    if u.shape == (0,):
        raise ValueError("The input must not be of length zero.")

@fabianegli fabianegli force-pushed the hamming-by-character-str-comparison branch from c1126db to e306364 Compare May 5, 2020 14:44
@fabianegli
Copy link
Contributor Author

fabianegli commented May 5, 2020

Force pushed after rebase. Reason: the previous build failed for Arm64, Bionic, Python3.7.

Edit: this is still the case after the force push and is also the case for the master branch from which this branch was created (and rebased).

@miladsade96 miladsade96 added enhancement A new feature or improvement scipy.spatial labels May 5, 2020
@fabianegli fabianegli force-pushed the hamming-by-character-str-comparison branch from e306364 to 26abf1b Compare May 17, 2020 19:07

"""
if isinstance(u, str) and isinstance(v, str):
u = list(u)
v = list(v)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this PR has been sitting for a while. There was some contention in the original issue proposing this change because NumPy by default treats the strings as scalars.

While I agree that the the default NumPy behavior is not "in the spirit" of what the hamming distance is typically used for (how many elementwise differences are there between two sequences of string or other types and not the differences between the composite scalars), we may need to exercise some caution if we want to have a change in behavior.

@WarrenWeckesser what about adding a new argument? I guess the question here is whether the burden on users is really so high that we can't for example just improve the docstring on this matter (i.e., suggest list conversion for elementwise string comparisons) vs. expanding the code/available options?

Copy link
Contributor Author

@fabianegli fabianegli Sep 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I absolutely agree with the caution. A deprecation warning to alert the user to the change would definitely be in order. Something like "The hamming distance will compute the character wise comparison of strings in future versions of scipy. To retain the current behavior of the string comparison the strings need to be array elements. For further information about the change see [link to this PR or some or the docs]."

Just because numpy treats strings as scalars, does not mean that putting the string into an array is the right thing to do, because the string argument violates the idea of the hamming distance computation by not providing an array in the first place. So the question is not how numpy treats strings, but how the hamming distance should be recovering from the contract violation. My argument is simply that the current handling is not intuitive and thus a suboptimal API.

Maybe it would be even better to just throw a ValueError if the argument are strings to avoid making assumptions altogether?

Copy link
Contributor Author

@fabianegli fabianegli Sep 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW: the jaccard distance already has a FutureWarning in the spirit of this PR:

>>> import scipy
>>> scipy.__version__
'1.6.2'
>>> from scipy.spatial.distance import jaccard
>>> jaccard("aa", "ab")
FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
1.0

In the spirit of a uniform user experience within scipy, there probably should be a consistent handling of the case where strings are supplied to a distance metric. In this case this would mean to add a FutureWarning for a while and then change the behavior as proposed here - maybe in concert with all other distance metrics that are commonly used for string comparison.

@h-vetinari
Copy link
Member

Just noting (in case the original issue gets overlooked once someone picks this PR) that this looks related - though I haven't looked closely - to the normalization issue mentioned here:

  1. hamming doesn't compute the Hamming distance, but scales it dividing by the vector dimension. They are closely related, but are not quite the same. I would like to have matching and sokalmichener return the old value, but make hamming return the proper unscaled Hamming distance. Can this be happily done, or is it best to first add some form of deprecation warning?

@h-vetinari
Copy link
Member

Just noting that there was some substantial further discussion on this in #11991 that should be taken into account if/once this PR is picked up again (and I'm aware that it's not pending on the PR author).

@lucascolley lucascolley added the needs-decision Items that need further discussion before they are merged or closed label Jan 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement needs-decision Items that need further discussion before they are merged or closed scipy.spatial
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Hamming distance returns unexpected value for string comparisons
5 participants