-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance improvment of the Similarity checker #4565
Conversation
…is a performance bottleneck espcially when comparing two big files. Let's try a more efficient one...
…order to avoid modification of the same object when removing successives common lines (in remove_successive method).
…string for the same function
…lows to define __add__ dunder method to make operations clearer
…empty or is empty but corresponds to a docstring then the hash is the classical one. Otherwise the hash is randomized in order to be sure that two empty lines corresponding to import line are not considered equal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work you did here @hippo91 🚀 I'm not sure I fully understand it yet, but those docstrings really helped.
During testing, I have seen a reduction from 6-8min down to 5:30min (multiprocessing with 2 jobs). So it's definitely an improvement.
I left a few comments with suggestions which I think would make some parts even better. Mostly minor things, except for the return type of Similar._find_common
: I would suggest to use NamedTuples there, as it otherwise becomes quite difficult not to accidentally mix up the Tuple parameters.
pylint/checkers/similar.py
Outdated
@@ -93,15 +376,32 @@ def run(self): | |||
def _compute_sims(self): | |||
"""compute similarities in appended files""" | |||
no_duplicates = defaultdict(list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a type annotation for no_duplicates
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cdce8p i did it but i would like to have your opinion on it.
for more information, see https://pre-commit.ci
… in the __init__ method
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few more comments. Might be the last real changes needed before we can ship this. Really good work so far @hippo91 🚀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks @hippo91 🐬
@Pierre-Sassoulas What do you think about doing the v2.10
release next, so we can ship this one?
Well 2.9.4 is not out yet, I thought about releasing it when astroid 2.6.3 is out, then pylint 2.10 (with the xdg home changes and similarity checker performance). There's still blocker issues not yet fixed for 2.9.4 and I don't have much time to work on it this week-end. |
Steps
doc/whatsnew/<current release.rst>
.Description
This PR is an answer to #4120. The poor performance mentioned in the issue was due mainly to the Similarity checker and more specifically to the class Similar.
The checker has been deeply reworked. The algorithm has been changed. For comparing two files, the first step is too remove from those files the comments, docstrings, imports and function signatures according to the options selected.
The class
LineSet
, which already existed, has not being changed (or minor changes). Its main purpose is to hold the real lines of the corresponding file and the stripped lines of the same file, that is to say all the lines except those that has been removed.Thus the comparison algorithm focuses on the stripped lines collections of both files.
For each stripped line, the hash of it and N-1 successive stripped lines is computed, where N is the minimal duplicates line option. Each hash is associated to its starting index in both collection. Thus two collections of hashes is obtained, one for each file. This is the purpose of the
hash_lineset
function to compute such hashes.Equal hashes between collections means there are at least N common successive lines in both files.
If matching hashes are found for two (or more) successive line number then it means that there are in fact N+1 (or more) common successive lines. It is the purpose of the
remove_successives
function to deal with this situation.Other changes are mainly to adapt the new algorithm to legacy code.
The execution time can reduced by a factor as high as 60.
Type of Changes
Related Issue
Closes #4120