Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SequenceMatcher finds suboptimal sequenc #37551

Closed
yeti-dn mannequin opened this issue Nov 29, 2002 · 5 comments
Closed

SequenceMatcher finds suboptimal sequenc #37551

yeti-dn mannequin opened this issue Nov 29, 2002 · 5 comments
Assignees
Labels
stdlib Python modules in the Lib dir

Comments

@yeti-dn
Copy link
Mannequin

yeti-dn mannequin commented Nov 29, 2002

BPO 645629
Nosy @tim-one, @rhettinger

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = 'https://github.com/tim-one'
closed_at = <Date 2003-07-12.01:46:32.000>
created_at = <Date 2002-11-29.10:54:30.000>
labels = ['library']
title = 'SequenceMatcher finds suboptimal sequenc'
updated_at = <Date 2003-07-12.01:46:32.000>
user = 'https://bugs.python.org/yeti-dn'

bugs.python.org fields:

activity = <Date 2003-07-12.01:46:32.000>
actor = 'rhettinger'
assignee = 'tim.peters'
closed = True
closed_date = None
closer = None
components = ['Library (Lib)']
creation = <Date 2002-11-29.10:54:30.000>
creator = 'yeti-dn'
dependencies = []
files = []
hgrepos = []
issue_num = 645629
keywords = []
message_count = 5.0
messages = ['13487', '13488', '13489', '13490', '13491']
nosy_count = 3.0
nosy_names = ['tim.peters', 'rhettinger', 'yeti-dn']
pr_nums = []
priority = 'low'
resolution = 'wont fix'
stage = None
status = 'closed'
superseder = None
type = None
url = 'https://bugs.python.org/issue645629'
versions = ['Python 2.2']

@yeti-dn
Copy link
Mannequin Author

yeti-dn mannequin commented Nov 29, 2002

The algorithm used for approximate string matching
doesn't find the optimal edit sequence (it finds
longest blocks instead).

Example:

>>> from difflib import SequenceMatcher
>>> sm = SequenceMatcher()
>>> sm.set_seqs('axfot', 'aoftax')
>>> sm.ratio()
0.36363636363636365
>>> sm.get_matching_blocks()
[(0, 4, 2), (5, 6, 0)]
>>> sm.get_opcodes()
[('insert', 0, 0, 0, 4), ('equal', 0, 2, 4, 6),
('delete', 2, 5, 6, 6)]

What's wrong?

Levenshtein distance with weight 2 for item replacement
is only 5 (the weight 2 corresponds to what ratio() is
supposed to compute, the classic Levenshtein distance
is 4), so one would expect to get similarity (i.e.
ratio()) (11-5)/11 = 6/11 = 0.545454545454..., and not
only 4/11.

And really, the maximal matching blocks are:
[(0, 0, 1), (2, 2, 1), (4, 3, 1)]
and the minimal edit sequence is:
[('equal', 0, 1, 0, 1), ('replace', 1, 2, 1, 2),
('equal', 2, 3, 2, 3), ('delete', 3, 4, 3, 3),
('equal', 4, 5, 3, 4), ('insert', 5, 5, 4, 6)]

The impact of this ``feature'' on diff-like
applications may be even positive, beause the edit
sequence then consists of smaller number of operations
on lager chunks. Thus I'm not sure if this is
something which should be fixed. However, it should be
at least noted in the documentation the ratio()
function gives only a lower bound of the string
similarity (so people like me won't be tempted to use
it to check results of their own Levenshtein
distance/string similarity implementation).

@yeti-dn yeti-dn mannequin closed this as completed Nov 29, 2002
@yeti-dn yeti-dn mannequin assigned tim-one Nov 29, 2002
@yeti-dn yeti-dn mannequin added the stdlib Python modules in the Lib dir label Nov 29, 2002
@yeti-dn yeti-dn mannequin closed this as completed Nov 29, 2002
@yeti-dn yeti-dn mannequin assigned tim-one Nov 29, 2002
@yeti-dn yeti-dn mannequin added the stdlib Python modules in the Lib dir label Nov 29, 2002
@yeti-dn
Copy link
Mannequin Author

yeti-dn mannequin commented Nov 29, 2002

Logged In: YES
user_id=658986

Sorry, I've changed my mind. This definitely should be
fixed. In following strings finding `Observation'
effectively inhibits finding the much better match:

sm.set_seqs('Observation: What seems as a small glitch at
the first sight may have large impact',
'What-seems-as-a-small-glitch-at-the-first-sight-may-have-large-impact
(Observation)')

Unfortunately this probably means complete rewrite, I can't
see how the current algorithm could be changed to work in
this case (but I don't understand it 100%, so maybe...).

@tim-one
Copy link
Member

tim-one commented Nov 29, 2002

Logged In: YES
user_id=31435

Please read the docs first. This isn't the Levenshtein
algorithm, and it doesn't care about finding minimal edit
distance.

@yeti-dn
Copy link
Mannequin Author

yeti-dn mannequin commented Nov 30, 2002

Logged In: YES
user_id=658986

OK, I know it's not Levenshtein because I've read the
source, and I agree I should really read the docs first and
the source after, it's actually written in the docs.

However, the docs say

`This does not yield minimal edit sequences, but does tend
to yield matches that ``look right'' to people.'

This is not true -- see my last posting. Or perhaps you
really think matching `Observation' looks better to people
than matching the smaller parts from the rest of the string
(which I strongly doubt). Then please add a note the
matches it finds can be arbitrarily bad, at least (e.g., the
ratio between optimal and difflib's best match can be as big
as (N-1)^2/4N, where N is sequence length (of both sequecnes)).

Thank you.

@rhettinger
Copy link
Contributor

Logged In: YES
user_id=80475

I looked at possible wording changes but prefer the docs as
they stand. Others have found the docs to be perfectly
clear. Any more text about worse case ratios to minimal edit
distances would, IMO, be a distractor and make the docs less
clear.

Marking this one as won't fix and closing.

We do appreciate the report. Since this is an area of interest
for you, consider designing a plug-in replacement using the
Levenshtein algorithm and post it to the Vaults of Parnassus
or in ActiveState's cookbook. That would provide some
benefit to the occassional seeker of a minimal edit sequence.

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stdlib Python modules in the Lib dir
Projects
None yet
Development

No branches or pull requests

2 participants