Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvment of the Similarity checker #4565

Merged
merged 84 commits into from Jul 28, 2021

Conversation

hippo91
Copy link
Contributor

@hippo91 hippo91 commented Jun 12, 2021

Steps

  • Add a ChangeLog entry describing what your PR does.
  • If it's a new feature or an important bug fix, add a What's New entry in
    doc/whatsnew/<current release.rst>.
  • Write a good description on what the PR does.

Description

This PR is an answer to #4120. The poor performance mentioned in the issue was due mainly to the Similarity checker and more specifically to the class Similar.

The checker has been deeply reworked. The algorithm has been changed. For comparing two files, the first step is too remove from those files the comments, docstrings, imports and function signatures according to the options selected.
The class LineSet, which already existed, has not being changed (or minor changes). Its main purpose is to hold the real lines of the corresponding file and the stripped lines of the same file, that is to say all the lines except those that has been removed.
Thus the comparison algorithm focuses on the stripped lines collections of both files.

For each stripped line, the hash of it and N-1 successive stripped lines is computed, where N is the minimal duplicates line option. Each hash is associated to its starting index in both collection. Thus two collections of hashes is obtained, one for each file. This is the purpose of the hash_lineset function to compute such hashes.

Equal hashes between collections means there are at least N common successive lines in both files.
If matching hashes are found for two (or more) successive line number then it means that there are in fact N+1 (or more) common successive lines. It is the purpose of the remove_successives function to deal with this situation.

Other changes are mainly to adapt the new algorithm to legacy code.

The execution time can reduced by a factor as high as 60.

Type of Changes

Type
🐛 Bug fix

Related Issue

Closes #4120

…is a performance bottleneck espcially when comparing two big files. Let's try a more efficient one...
…order to avoid modification of the same object when removing successives common lines (in remove_successive method).
…lows to define __add__ dunder method to make operations clearer
…empty or is empty but corresponds to a docstring then the hash is the classical one. Otherwise the hash is randomized in order to be sure that two empty lines corresponding to import line are not considered equal
Copy link
Member

@cdce8p cdce8p left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work you did here @hippo91 🚀 I'm not sure I fully understand it yet, but those docstrings really helped.
During testing, I have seen a reduction from 6-8min down to 5:30min (multiprocessing with 2 jobs). So it's definitely an improvement.

I left a few comments with suggestions which I think would make some parts even better. Mostly minor things, except for the return type of Similar._find_common: I would suggest to use NamedTuples there, as it otherwise becomes quite difficult not to accidentally mix up the Tuple parameters.

pylint/checkers/similar.py Outdated Show resolved Hide resolved
pylint/checkers/similar.py Outdated Show resolved Hide resolved
pylint/checkers/similar.py Outdated Show resolved Hide resolved
pylint/checkers/similar.py Outdated Show resolved Hide resolved
pylint/checkers/similar.py Outdated Show resolved Hide resolved
pylint/checkers/similar.py Outdated Show resolved Hide resolved
pylint/checkers/similar.py Outdated Show resolved Hide resolved
@@ -93,15 +376,32 @@ def run(self):
def _compute_sims(self):
"""compute similarities in appended files"""
no_duplicates = defaultdict(list)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a type annotation for no_duplicates?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cdce8p i did it but i would like to have your opinion on it.

pylint/checkers/similar.py Outdated Show resolved Hide resolved
pylint/checkers/similar.py Outdated Show resolved Hide resolved
pylint/checkers/similar.py Outdated Show resolved Hide resolved
pylint/checkers/similar.py Outdated Show resolved Hide resolved
pylint/checkers/similar.py Outdated Show resolved Hide resolved
pylint/checkers/similar.py Outdated Show resolved Hide resolved
Copy link
Member

@cdce8p cdce8p left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few more comments. Might be the last real changes needed before we can ship this. Really good work so far @hippo91 🚀

ChangeLog Outdated Show resolved Hide resolved
ChangeLog Outdated Show resolved Hide resolved
doc/whatsnew/2.9.rst Outdated Show resolved Hide resolved
pylint/checkers/similar.py Outdated Show resolved Hide resolved
pylint/checkers/similar.py Outdated Show resolved Hide resolved
pylint/checkers/similar.py Outdated Show resolved Hide resolved
pylint/checkers/similar.py Outdated Show resolved Hide resolved
pylint/checkers/similar.py Outdated Show resolved Hide resolved
pylint/checkers/similar.py Outdated Show resolved Hide resolved
pylint/checkers/similar.py Outdated Show resolved Hide resolved
Copy link
Member

@cdce8p cdce8p left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @hippo91 🐬

@Pierre-Sassoulas What do you think about doing the v2.10 release next, so we can ship this one?

@Pierre-Sassoulas
Copy link
Member

Well 2.9.4 is not out yet, I thought about releasing it when astroid 2.6.3 is out, then pylint 2.10 (with the xdg home changes and similarity checker performance). There's still blocker issues not yet fixed for 2.9.4 and I don't have much time to work on it this week-end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate-code Related to code duplication checker Enhancement ✨ Improvement to a component performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pylint 2.7 and 2.8 are very slow
5 participants