Performance improvment of the Similarity checker #4565

hippo91 · 2021-06-12T11:15:40Z

Steps

Add a ChangeLog entry describing what your PR does.
If it's a new feature or an important bug fix, add a What's New entry in
doc/whatsnew/<current release.rst>.
Write a good description on what the PR does.

Description

This PR is an answer to #4120. The poor performance mentioned in the issue was due mainly to the Similarity checker and more specifically to the class Similar.

The checker has been deeply reworked. The algorithm has been changed. For comparing two files, the first step is too remove from those files the comments, docstrings, imports and function signatures according to the options selected.
The class LineSet, which already existed, has not being changed (or minor changes). Its main purpose is to hold the real lines of the corresponding file and the stripped lines of the same file, that is to say all the lines except those that has been removed.
Thus the comparison algorithm focuses on the stripped lines collections of both files.

For each stripped line, the hash of it and N-1 successive stripped lines is computed, where N is the minimal duplicates line option. Each hash is associated to its starting index in both collection. Thus two collections of hashes is obtained, one for each file. This is the purpose of the hash_lineset function to compute such hashes.

Equal hashes between collections means there are at least N common successive lines in both files.
If matching hashes are found for two (or more) successive line number then it means that there are in fact N+1 (or more) common successive lines. It is the purpose of the remove_successives function to deal with this situation.

Other changes are mainly to adapt the new algorithm to legacy code.

The execution time can reduced by a factor as high as 60.

Type of Changes

	Type
✓	🐛 Bug fix

Related Issue

Closes #4120

…is a performance bottleneck espcially when comparing two big files. Let's try a more efficient one...

…order to avoid modification of the same object when removing successives common lines (in remove_successive method).

…d lines.

…string for the same function

…lows to define __add__ dunder method to make operations clearer

…docstring

…empty or is empty but corresponds to a docstring then the hash is the classical one. Otherwise the hash is randomized in order to be sure that two empty lines corresponding to import line are not considered equal

… similar_perf

cdce8p

Great work you did here @hippo91 🚀 I'm not sure I fully understand it yet, but those docstrings really helped.
During testing, I have seen a reduction from 6-8min down to 5:30min (multiprocessing with 2 jobs). So it's definitely an improvement.

I left a few comments with suggestions which I think would make some parts even better. Mostly minor things, except for the return type of Similar._find_common: I would suggest to use NamedTuples there, as it otherwise becomes quite difficult not to accidentally mix up the Tuple parameters.

pylint/checkers/similar.py

cdce8p · 2021-06-20T00:26:52Z

pylint/checkers/similar.py

@@ -93,15 +376,32 @@ def run(self):
    def _compute_sims(self):
        """compute similarities in appended files"""
        no_duplicates = defaultdict(list)


Can you add a type annotation for no_duplicates?

@cdce8p i did it but i would like to have your opinion on it.

pylint/checkers/similar.py

for more information, see https://pre-commit.ci

… in the __init__ method

… similar_perf

for more information, see https://pre-commit.ci

… similar_perf

cdce8p

Left a few more comments. Might be the last real changes needed before we can ship this. Really good work so far @hippo91 🚀

ChangeLog

doc/whatsnew/2.9.rst

pylint/checkers/similar.py

cdce8p

LGTM! Thanks @hippo91 🐬

@Pierre-Sassoulas What do you think about doing the v2.10 release next, so we can ship this one?

Pierre-Sassoulas · 2021-07-10T17:12:35Z

Well 2.9.4 is not out yet, I thought about releasing it when astroid 2.6.3 is out, then pylint 2.10 (with the xdg home changes and similarity checker performance). There's still blocker issues not yet fixed for 2.9.4 and I don't have much time to work on it this week-end.

hippo91 added 30 commits April 28, 2021 11:58

Adds execution time measurements

d2ffe45

Remove @Profile decorator

a75b740

Changes the whole algorithm. The old one, while being very readable, …

76a9171

…is a performance bottleneck espcially when comparing two big files. Let's try a more efficient one...

Merge branch 'master' into similar_perf

871ba0e

Use a copy of SuccessiveLinesLimits in the all_couples collection in …

1d8984d

…order to avoid modification of the same object when removing successives common lines (in remove_successive method).

Remove old algorithm (dead code now)

6e43eb1

Removes dead code

1a3a3c2

Creates the LineSpecifs type, to be clearer when manipulating strippe…

c5baeff

…d lines.

Adds type hint in the stripped_lines function signature. Modifies doc…

c2edc8c

…string for the same function

LineSetStartCouple is now a classic class (no more NamedTuple). It al…

ef388cf

…lows to define __add__ dunder method to make operations clearer

Typo

78b7bf5

Adds __repr__ method to SuccessiveLinesLimits class. Also update the …

7bb1a45

…docstring

Adds docstring and typing imports

3e03ff0

Begins the cleaning of now unused features of the LineSet class

b04ea09

Adds the stripped_lines property in the LineSet class

ecdab95

Adds real_lines property in the LineSet class

4d4f82b

Reactivates assert

ea82d04

Merge branch 'master' into similar_perf

55706ff

Merge branch 'master' into similar_perf

4f35b96

Merge branch 'similar_perf' of https://github.com/hippo91/pylint into…

7a114f4

… similar_perf

Merge branch 'similar_perf' of https://github.com/hippo91/pylint into…

29adf51

… similar_perf

Merge branch 'similar_perf' of https://github.com/hippo91/pylint into…

cceddd8

… similar_perf

Renames from_file_to_dict function into hash_lineset

ee56042

Moving check_sim function and adds docstring

cc29a55

Moving some type definitions and adding comments

df81ef1

Adds the LineType enum

ca7c2fb

Fixes comment and docstring

25a269b

Adds type hints

2b87e11

Adds some enumerations into LineType enum class

ec9eca3

Fixing a non space white character

45b4490

cdce8p requested changes Jun 20, 2021

View reviewed changes

Pierre-Sassoulas reviewed Jun 20, 2021

View reviewed changes

pylint/checkers/similar.py Outdated Show resolved Hide resolved

pylint/checkers/similar.py Outdated Show resolved Hide resolved

pylint/checkers/similar.py Outdated Show resolved Hide resolved

pylint/checkers/similar.py Outdated Show resolved Hide resolved

hippo91 and others added 5 commits June 20, 2021 16:07

Takes into accound the remarks of cdce8p

914633b

[pre-commit.ci] auto fixes from pre-commit.com hooks

9798a96

for more information, see https://pre-commit.ci

Merge branch 'master' into similar_perf

5e9459f

Takes into account cdce8p remarks (first part)

4241e67

Takes into account cdce8p remarks (part 2)

1acc21a

Pierre-Sassoulas modified the milestones: 2.9.0, 2.10.0 Jun 29, 2021

Pierre-Sassoulas mentioned this pull request Jun 30, 2021

duplicate-code: similarities is not working in pipenv virtual env #2965

Closed

Pierre-Sassoulas and others added 8 commits July 1, 2021 14:24

Merge branch 'main' into similar_perf

3c7d12d

Merge branch 'main' into similar_perf

649e700

Fixes tests after merge

64fdfa7

The parameters of the SImilarChecker are read from configuration also…

c77cd82

… in the __init__ method

Merge branch 'similar_perf' of https://github.com/hippo91/pylint into…

5635bd1

… similar_perf

[pre-commit.ci] auto fixes from pre-commit.com hooks

b7b9de2

for more information, see https://pre-commit.ci

Formats and corrects types hint

a1fd33b

Merge branch 'similar_perf' of https://github.com/hippo91/pylint into…

65a20b6

… similar_perf

cdce8p requested changes Jul 4, 2021

View reviewed changes

hippo91 added 3 commits July 10, 2021 12:35

Merge branch 'main' into similar_perf

276918c

Takes into account cdce8p remarks

5d61e10

Adapts the new unittest to the new output of the algorithm

d3ef4fb

cdce8p approved these changes Jul 10, 2021

View reviewed changes

Pierre-Sassoulas added 2 commits July 28, 2021 20:52

Merge branch 'main' into similar_perf

916ec65

Move similarity changelog to 2.10

9ca2dd4

Pierre-Sassoulas merged commit 1d1619e into pylint-dev:main Jul 28, 2021

hippo91 mentioned this pull request Aug 8, 2021

Pylint 2.7 and 2.8 are very slow #4120

Closed

Pierre-Sassoulas mentioned this pull request Jun 18, 2022

Pylint jobs wait a lot start #6978

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance improvment of the Similarity checker #4565

Performance improvment of the Similarity checker #4565

hippo91 commented Jun 12, 2021

cdce8p left a comment

cdce8p Jun 20, 2021

hippo91 Jun 27, 2021

cdce8p left a comment

cdce8p left a comment

Pierre-Sassoulas commented Jul 10, 2021

Performance improvment of the Similarity checker #4565

Performance improvment of the Similarity checker #4565

Conversation

hippo91 commented Jun 12, 2021

Steps

Description

Type of Changes

Related Issue

cdce8p left a comment

Choose a reason for hiding this comment

cdce8p Jun 20, 2021

Choose a reason for hiding this comment

hippo91 Jun 27, 2021

Choose a reason for hiding this comment

cdce8p left a comment

Choose a reason for hiding this comment

cdce8p left a comment

Choose a reason for hiding this comment

Pierre-Sassoulas commented Jul 10, 2021