-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking βSign up for GitHubβ, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up the generation of no-member suggestions #10277
Speed up the generation of no-member suggestions #10277
Conversation
Previously, edit distances were calculated for strings that had length differences greater than the maximum suggestion threshold.
Codecov ReportAll modified and coverable lines are covered by tests β
Additional details and impacted files@@ Coverage Diff @@
## main #10277 +/- ##
=======================================
Coverage 95.89% 95.89%
=======================================
Files 175 175
Lines 19073 19082 +9
=======================================
+ Hits 18290 18299 +9
Misses 783 783
π New features to boost your workflow:
|
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, a 5+% improvement ! That's nice.
It seems like a well known problems that we should not have to optimize ourselves though.
Would you mind comparing this with using difflib.SequenceMatcher
? I suggested something but it might not directly work. And I don't like adding dependencies but maybe using https://pypi.org/project/edlib/ would be worth it (it seems it's faster than difflib)..
One drawback is that it may be hard to map existing import difflib
difflib.get_close_matches("buff", ["buffer"], n=3, cutoff=0.75)
-> ['buffer']
difflib.get_close_matches("buff", ["buffer"], n=3, cutoff=0.80)
-> ['buffer']
difflib.get_close_matches("buff", ["buffer"], n=3, cutoff=0.80000001)
-> [] Here's a separate case where I would expect a match but don't see one: import difflib
difflib.get_close_matches("x", ["z"], n=3, cutoff=0.00001)
-> [] Python's own attribute suggestion code does not seem to use difflib: Here's the patch I used:diff --git a/pylint/checkers/typecheck.py b/pylint/checkers/typecheck.py
index edef5188b..ab280e38b 100644
--- a/pylint/checkers/typecheck.py
+++ b/pylint/checkers/typecheck.py
@@ -6,9 +6,8 @@
from __future__ import annotations
-import heapq
+import difflib
import itertools
-import operator
import re
import shlex
import sys
@@ -170,7 +169,7 @@ def _string_distance(seq1: str, seq2: str) -> int:
def _similar_names(
owner: SuccessfulInferenceResult,
- attrname: str | None,
+ attrname: str,
distance_threshold: int,
max_choices: int,
) -> list[str]:
@@ -179,31 +178,24 @@ def _similar_names(
The similar names are searched given a distance metric and only
a given number of choices will be returned.
"""
- possible_names: list[tuple[str, int]] = []
- names = _node_names(owner)
+ if not attrname:
+ return []
- for name in names:
- if name == attrname:
- continue
+ names = _node_names(owner)
+ possible_names = [name for name in names if name != attrname]
- distance = _string_distance(attrname or "", name)
- if distance <= distance_threshold:
- possible_names.append((name, distance))
+ ratio = distance_threshold / len(attrname)
+ if distance_threshold == 1 and ratio == 1:
+ ratio = 0.5
- # Now get back the values with a minimum, up to the given
- # limit or choices.
- picked = [
- name
- for (name, _) in heapq.nsmallest(
- max_choices, possible_names, key=operator.itemgetter(1)
- )
- ]
- return sorted(picked)
+ threshold = 1 - ratio - 0.0001
+ choices = difflib.get_close_matches(attrname, possible_names, max_choices, threshold)
+ return sorted(choices)
def _missing_member_hint(
owner: SuccessfulInferenceResult,
- attrname: str | None,
+ attrname: str,
distance_threshold: int,
max_choices: int,
) -> str:
@@ -1209,7 +1201,7 @@ accessed. Python regular expressions are accepted.",
if self.linter.config.missing_member_hint:
hint = _missing_member_hint(
owner,
- node.attrname,
+ node.attrname or "",
self.linter.config.missing_member_hint_distance,
self.linter.config.missing_member_max_choices,
) |
Neat. It seems traceback use levenshtein distance while difflib use an algo that "does not yield minimal edit sequences, but does tend to yield matches that βlook rightβ to people." (can't link the part of the doc specifically; https://docs.python.org/3/library/difflib.html) Which would make https://pypi.org/project/edlib/ a better candidate as it's an optimized levenshtein distance ? Or we can copy paste the traceback code and not deal with a dependency if there isn't a big difference in performance ? |
I re-ran the benchmarks on an updated system (which is a bit slower now): Baseline (main)
Current PR
traceback._levenshtein_distance
edlib
My preference is to move forward with the PR branch for these reasons:
Patches testedtraceback._levenshtein_distance patchdiff --git a/pylint/checkers/typecheck.py b/pylint/checkers/typecheck.py
index edef5188b..dddfa7da0 100644
--- a/pylint/checkers/typecheck.py
+++ b/pylint/checkers/typecheck.py
@@ -12,6 +12,7 @@ import operator
import re
import shlex
import sys
+import traceback
from collections.abc import Callable, Iterable
from functools import cached_property, singledispatch
from re import Pattern
@@ -170,7 +171,7 @@ def _string_distance(seq1: str, seq2: str) -> int:
def _similar_names(
owner: SuccessfulInferenceResult,
- attrname: str | None,
+ attrname: str,
distance_threshold: int,
max_choices: int,
) -> list[str]:
@@ -182,12 +183,15 @@ def _similar_names(
possible_names: list[tuple[str, int]] = []
names = _node_names(owner)
+ # Multiply by two because _levenshtein_distance considers move changes to have a cost of 2
+ max_cost = distance_threshold * 2
+
for name in names:
if name == attrname:
continue
- distance = _string_distance(attrname or "", name)
- if distance <= distance_threshold:
+ distance = traceback._levenshtein_distance(attrname, name, max_cost)
+ if distance <= max_cost:
possible_names.append((name, distance))
# Now get back the values with a minimum, up to the given
@@ -203,7 +207,7 @@ def _similar_names(
def _missing_member_hint(
owner: SuccessfulInferenceResult,
- attrname: str | None,
+ attrname: str,
distance_threshold: int,
max_choices: int,
) -> str:
@@ -1209,7 +1213,7 @@ accessed. Python regular expressions are accepted.",
if self.linter.config.missing_member_hint:
hint = _missing_member_hint(
owner,
- node.attrname,
+ node.attrname or "",
self.linter.config.missing_member_hint_distance,
self.linter.config.missing_member_max_choices,
) edlib patchdiff --git a/pylint/checkers/typecheck.py b/pylint/checkers/typecheck.py
index edef5188b..594d16b56 100644
--- a/pylint/checkers/typecheck.py
+++ b/pylint/checkers/typecheck.py
@@ -20,6 +20,7 @@ from typing import TYPE_CHECKING, Any, Literal, Union
import astroid
import astroid.exceptions
import astroid.helpers
+import edlib
from astroid import arguments, bases, nodes, util
from astroid.nodes import _base_nodes
from astroid.typing import InferenceResult, SuccessfulInferenceResult
@@ -170,7 +171,7 @@ def _string_distance(seq1: str, seq2: str) -> int:
def _similar_names(
owner: SuccessfulInferenceResult,
- attrname: str | None,
+ attrname: str,
distance_threshold: int,
max_choices: int,
) -> list[str]:
@@ -186,7 +187,7 @@ def _similar_names(
if name == attrname:
continue
- distance = _string_distance(attrname or "", name)
+ distance = edlib.align(attrname, name)["editDistance"]
if distance <= distance_threshold:
possible_names.append((name, distance))
@@ -203,7 +204,7 @@ def _similar_names(
def _missing_member_hint(
owner: SuccessfulInferenceResult,
- attrname: str | None,
+ attrname: str,
distance_threshold: int,
max_choices: int,
) -> str:
@@ -1209,7 +1210,7 @@ accessed. Python regular expressions are accepted.",
if self.linter.config.missing_member_hint:
hint = _missing_member_hint(
owner,
- node.attrname,
+ node.attrname or "",
self.linter.config.missing_member_hint_distance,
self.linter.config.missing_member_max_choices,
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the benchmarks ! I agree with your analysis.
This comment has been minimized.
This comment has been minimized.
ab59393
Previously, edit distances were calculated for strings that had length differences greater than the maximum suggestion threshold. Remove handling of missing-member-hint-distance = 0 Co-authored-by: Pierre Sassoulas <pierre.sassoulas@gmail.com> (cherry picked from commit 9a6168d)
π€ According to the primer, this change has no effect on the checked open source code. π€π This comment was generated for commit ab59393 |
Previously, edit distances were calculated for strings that had length differences greater than the maximum suggestion threshold. Remove handling of missing-member-hint-distance = 0 Co-authored-by: Pierre Sassoulas <pierre.sassoulas@gmail.com> (cherry picked from commit 9a6168d) Co-authored-by: correctmost <134317971+correctmost@users.noreply.github.com>
Type of Changes
Description
Previously, edit distances were calculated for strings that had length differences greater than the maximum suggestion threshold.
Related PR:
no-member
/c-extension-no-member
#9962Stats
I profiled yt-dlp (commit bd0a6681) with the following
.pylintrc
file:Before
pylint --recursive=y .
After
pylint --recursive=y .