Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[difflib] Improve get_close_matches() to better match when casing of words are different #84072

Closed
brnglr mannequin opened this issue Mar 7, 2020 · 7 comments
Closed
Labels
3.10 only security fixes stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@brnglr
Copy link
Mannequin

brnglr mannequin commented Mar 7, 2020

BPO 39891
Nosy @malemburg, @tim-one, @rhettinger, @remilapeyre, @brnglr
PRs
  • bpo-39891: [difflib] add parameter 'ignorecase' to get_close_matches() #18983
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2020-06-24.00:23:18.021>
    created_at = <Date 2020-03-07.23:27:55.811>
    labels = ['type-feature', 'library', '3.10']
    title = '[difflib] Improve get_close_matches() to better match when casing of words are different'
    updated_at = <Date 2020-06-24.00:23:18.020>
    user = 'https://github.com/brnglr'

    bugs.python.org fields:

    activity = <Date 2020-06-24.00:23:18.020>
    actor = 'rhettinger'
    assignee = 'none'
    closed = True
    closed_date = <Date 2020-06-24.00:23:18.021>
    closer = 'rhettinger'
    components = ['Library (Lib)']
    creation = <Date 2020-03-07.23:27:55.811>
    creator = 'brian.gallagher'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 39891
    keywords = ['patch']
    message_count = 7.0
    messages = ['363618', '363626', '363655', '363659', '366026', '372009', '372221']
    nosy_count = 6.0
    nosy_names = ['lemburg', 'tim.peters', 'rhettinger', 'python-dev', 'remi.lapeyre', 'brian.gallagher']
    pr_nums = ['18983']
    priority = 'normal'
    resolution = 'rejected'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue39891'
    versions = ['Python 3.10']

    @brnglr
    Copy link
    Mannequin Author

    brnglr mannequin commented Mar 7, 2020

    Currently difflib's get_close_matches() doesn't match similar words that differ in their casing very well.

    Example:
    user@host:~$ python3
    Python 3.6.9 (default, Nov  7 2019, 10:44:02) 
    [GCC 8.3.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import difflib
    >>> difflib.get_close_matches("apple", "APPLE")
    []
    >>> difflib.get_close_matches("apple", "APpLe")
    []
    >>>

    These seem like they should be considered close matches for each other, given the SequenceMatcher used in difflib.py attempts to produce a "human-friendly diff" of two words in order to yield "intuitive difference reports".

    One solution would be for the user of the function to perform their own transformation of the supplied data, such as converting all strings to lower-case for example. However, it seems like this might be a surprise to a user of the function if they weren't aware of this limitation. It would be preferable to provide this functionality by default in my eyes.

    If this is an issue the relevant maintainer(s) consider worth pursuing, I'd love to try my hand at preparing a patch for this.

    @tim-one
    Copy link
    Member

    tim-one commented Mar 8, 2020

    If you pursue this, please introduce a new function for it. You immediately have an idea about how to change the current function precisely because it _doesn't_ try to guess what you really wanted. That lack of magic is valuable - you're not actually confused by what it does, because it doesn't do anything to the strings you give it ;-) By the same token, if you have a crisp idea of how it should treat strings instead, it's straightforward to write a wrapper that does so. The existing function won't fight you by trying to impose its own ideas.

    Guessing what people really wanted tends to become a bottomless pit. For example, do you know all the rules for what "case" even means in a Unicode world? What about diacritical marks? And so on. I don't.

    Not saying it shouldn't be pursued. Am saying it may be hard to reach consensus on what's "really" wanted. By some people, some of the time. Never - alas - by all people all the time.

    @tim-one tim-one added 3.9 only security fixes type-feature A feature request or enhancement labels Mar 8, 2020
    @malemburg
    Copy link
    Member

    It looks like Brian is expecting some kind of normalization of the strings before they enter the function, e.g. convert to lowercase, remove extra whitespace, convert diacritics to regular letters, combinations of such normalizations, etc.

    Since both "word" and "possibilities" would have to be normalized, I think it's better to let the application deal with this efficiently than try to come up with a new function or add a normalize keyword function parameter.

    @brnglr
    Copy link
    Mannequin Author

    brnglr mannequin commented Mar 8, 2020

    I agree that there is an appeal to leaving any normalization to the application and that trying guess what people want is a tough hole -- I hadn't even considered what casing would mean in a general sense for Unicode.

    I'm not entirely convinced that this should be pursued either, but I'll refine my proposal, provide a little context in which I thought it could be a problem and see what you guys think.

    1. Some code is written that assumes get_close_matches() will match on a case-insensitive basis. Only a small bit of testing is done because the functionality is provided by the standard library not the application code, so we throw a few examples like 'apple' and 'ape' and decide it is okay. We later on discover we have a bug because we actually need to match against 'AppLE' too.

    2. The extension I had in mind was to match on a case-insensitive basis for only the alphabet characters. I don't know much about Unicode, but there's definitely gotchas lurking in my previous statement (titlecase vs. uppercase) so copying the behaviour of string.upper()/string.lower() would seem reasonable to me. The functionality would only be extended to match the same strings it would anyways, but now ignore casing. We wouldn't be eliminating any existing matches. I guess this still has the potential to be a breaking change, since someone might indirectly be depending on this.

    For 1., not testing that your code can handle mixed case comparisons in the way you're assuming it will is probably your own fault. On the other hand, I think it is a reasonable assumption to think that get_close_matches() will match an uppercase/lowercase counterpart since the function's intent is to provide intuitive matches that "look right" to a human.

    Maybe this is more of a documentation issue than something that needs to be addressed in the code. If a caveat about the case sensitivity of the function is added to the documentation, then a developer can be aware of the limitation in order to provide any normalization they want in the application code.

    Let me know what you guys think.

    @brnglr
    Copy link
    Mannequin Author

    brnglr mannequin commented Apr 8, 2020

    Just giving this a bump, in case it has been forgotten about.

    I've posted a patch at #18983.

    It adds a new parameter "ignorecase" to get_close_matches() that, if set to True, will result in the SequenceMatcher treating any character case insensitively (as determined by str.lower()).

    The benefit to using this keyword, as opposed to letting the application handle the normalization, is that it saves on memory. If the application has to normalize and supply a separate list to get_close_matches(), then it ends up having to maintain a mapping between the original string and the normalized string. As an example:

    >>> from difflib import get_close_matches
    >>> word = 'apple'
    >>> possibilities = ['apPLE', 'APPLE', 'APE', 'Banana', 'Fruit', 'PEAR', 'CoCoNuT']
    >>> normalized_possibilities = {p.lower(): p for p in possibilities}
    >>> result = get_close_matches(word, normalized_possibilities.keys())
    >>> result
    ['apple', 'ape']
    >>> normalized_result = [normalized_possibilities[r] for r in result]
    >>> normalized_result
    ['APPLE', 'APE']

    By letting the SequenceMatcher handle the casing on the fly, we could potentially save large amounts of memory if someone was providing a huge list to get_close_matches.

    @remilapeyre
    Copy link
    Mannequin

    remilapeyre mannequin commented Jun 21, 2020

    I fell like it's a bit weird to have a new function just for ignoring case, will a new function be required for every possible normalization like removing accents. One possible make the API handle those use cases would be to have a keyword-argument for this:

    >>> difflib.get_close_matches('apple', ['APPLE'], normalization=str.lower)
    ['APPLE']

    Then it could work with other normalization too without requiring a new function every time:

    >>> difflib.get_close_matches('Remi', ['Rémi'], normalization=remove_a ccents)
    ['Rémi']

    @remilapeyre remilapeyre mannequin added stdlib Python modules in the Lib dir 3.10 only security fixes and removed 3.9 only security fixes labels Jun 21, 2020
    @rhettinger
    Copy link
    Contributor

    I concur with the other respondents that this is best left to the application code.

    Thank you for the suggestion, but I'll mark this as closed. Don't be deterred from making other suggestions :-)

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.10 only security fixes stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants