Optimize getting node tokens #8450

fatkodima · 2020-08-04T16:21:42Z

Ran on 30k files.

Before

(1) 74659  (    6.8%)  RuboCop::Cop::Layout::SpaceInsideReferenceBrackets#on_send
(2) 39346  (    3.6%)  RuboCop::Cop::Layout::SpaceInsideArrayLiteralBrackets#on_array
    38575  (    3.5%)  RuboCop::Cop::Style::RedundantSelf#on_block
    33935  (    3.1%)  RuboCop::Cop::Layout::SpaceBeforeFirstArg#on_send
    26757  (    2.4%)  RuboCop::Cop::Layout::FirstArgumentIndentation#on_send
    26271  (    2.4%)  RuboCop::Cop::StringHelp#on_str
    24590  (    2.2%)  RuboCop::Cop::Layout::DefEndAlignment#on_send
    23826  (    2.2%)  RuboCop::Cop::Layout::IndentationConsistency#on_begin
(3) 17870  (    1.6%)  RuboCop::Cop::Layout::SpaceInsideHashLiteralBraces#on_hash
    16582  (    1.5%)  RuboCop::Cop::MultilineExpressionIndentation#on_send
    15623  (    1.4%)  RuboCop::Cop::Interpolation#on_dstr
......

After

    45373  (    3.7%)  RuboCop::Cop::Style::RedundantSelf#on_block
    40617  (    3.3%)  RuboCop::Cop::Layout::SpaceBeforeFirstArg#on_send
    36688  (    3.0%)  RuboCop::Cop::Layout::FirstArgumentIndentation#on_send
    31966  (    2.6%)  RuboCop::Cop::StringHelp#on_str
    31289  (    2.5%)  RuboCop::Cop::Layout::DefEndAlignment#on_send
    29682  (    2.4%)  RuboCop::Cop::Layout::IndentationConsistency#on_begin
(1) 25490  (    2.1%)  RuboCop::Cop::Layout::SpaceInsideReferenceBrackets#on_send
    23245  (    1.9%)  RuboCop::Cop::MultilineExpressionIndentation#on_send
    18734  (    1.5%)  RuboCop::Cop::Style::AccessModifierDeclarations#on_send
    17297  (    1.4%)  RuboCop::Cop::Layout::ArgumentAlignment#on_send
    16772  (    1.4%)  RuboCop::Cop::Style::FormatStringToken#on_str
    15592  (    1.3%)  RuboCop::Cop::Style::NumericPredicate#on_send
    14655  (    1.2%)  RuboCop::Cop::Layout::LineLength#on_potential_breakable_node
    14547  (    1.2%)  RuboCop::Cop::Layout::DotPosition#on_send
    13943  (    1.1%)  RuboCop::Cop::Layout::EmptyLinesAroundAccessModifier#on_send
    13836  (    1.1%)  RuboCop::Cop::Layout::SpaceAroundOperators#on_send
    13180  (    1.1%)  RuboCop::Cop::Style::TrailingCommaInArguments#on_send
    12772  (    1.0%)  RuboCop::Cop::Style::ConditionalAssignment#on_send
    12746  (    1.0%)  RuboCop::Cop::Metrics::BlockLength#on_block
    11655  (    0.9%)  RuboCop::Cop::Style::ZeroLengthPredicate#on_send
    11447  (    0.9%)  RuboCop::Cop::Style::MethodCallWithoutArgsParentheses#on_send
    11365  (    0.9%)  RuboCop::Cop::CheckAssignment#on_send
    11055  (    0.9%)  RuboCop::Cop::Layout::IndentationWidth#on_send
    10931  (    0.9%)  RuboCop::Cop::MethodComplexity#on_def
    10492  (    0.9%)  RuboCop::Cop::MethodComplexity#on_def
    10311  (    0.8%)  RuboCop::Cop::Style::InverseMethods#on_send
    10222  (    0.8%)  RuboCop::Cop::StringHelp#on_str
    10211  (    0.8%)  RuboCop::Cop::Layout::ClosingParenthesisIndentation#on_send
    10131  (    0.8%)  RuboCop::Cop::Interpolation#on_dstr
    9837  (    0.8%)  RuboCop::Cop::Style::SignalException#on_send
    9732  (    0.8%)  RuboCop::Cop::Layout::BlockAlignment#on_block
    9620  (    0.8%)  RuboCop::Cop::Style::RedundantSort#on_send
(2) 9319  (    0.8%)  RuboCop::Cop::Layout::SpaceInsideArrayLiteralBrackets#on_array
    8877  (    0.7%)  RuboCop::Cop::Style::EmptyLiteral#on_send
    8427  (    0.7%)  RuboCop::Cop::Style::RedundantException#on_send
    8300  (    0.7%)  RuboCop::Cop::Style::ExpandPathArguments#on_send
    8288  (    0.7%)  RuboCop::Cop::Lint::SafeNavigationChain#on_send
    7771  (    0.6%)  RuboCop::Cop::CheckAssignment#on_send
    7580  (    0.6%)  RuboCop::Cop::Style::EvalWithLocation#on_send
    7485  (    0.6%)  RuboCop::Cop::Style::NonNilCheck#on_send

Improvement: 9-10%

marcandre · 2020-08-04T18:29:11Z

Awesome!
It's unfortunate that tokens aren't always sorted, I didn't know this 🙇‍♂️

I'm ok to merge this, but one easy thing we really should be doing is putting the caching / searching as methods of ProcessedSource. This would have two advantages: methods not being accessible to cops (the fact that your mixin's methods are private makes no difference) and the cache doesn't have to be rebuilt for different cops (although I imagine they might not access the same nodes?). WDYT?

Also: did you try sorting the tokens by positions?

marcandre · 2020-08-04T18:30:48Z

lib/rubocop/cop/tokens_util.rb

+
+    # rubocop:disable Metrics/AbcSize
+    def tokens(node)
+      @tokens ||= {}


I realize you just copy-pasted, but {}.compare_by_identity would be nicer to use.

fatkodima · 2020-08-04T18:43:55Z

and the cache doesn't have to be rebuilt for different cops (although I imagine they might not access the same nodes?)

Yes, looks like the cache not needed in this case. I have also removed such a cache https://github.com/rubocop-hq/rubocop/pull/8450/files#diff-938b46da0d108c71380ac9427b83dde0L47-L56 and didn't notice any difference. In my previous PRs, where I tried to reduce memory, I've seen that method allocated enough memory (can't remember how much), so a win here also.

WDYT?

Yes, will move those methods to ProcessedSource.

Also: did you try sorting the tokens by positions?

This will always work in O(nlogn), while original works in O(n) and the proposed approach almost always will work in O(logn + #node_tokens). Or I'm missing something?

marcandre · 2020-08-04T19:14:07Z

This will always work in O(nlogn)

Correct

while original works in O(n)

Yes, each search is O(n), but the number of searches is also roughly linear with the number of tokens, so overall it was O(n^2), and now with bsearch it's O(n log(n)) overall, so maybe a sort (assuming it's once per ProcessedSource and not per cop) is similar speed (and simpler), or it could be slower or faster, I don't know.

Probably better would simply be to index the tokens by start and end position and do a straight lookups, since we will should always find tokens that start and end exactly where we are looking to, right? Building the indices would be O(n) and each lookup would be O(1).

fatkodima · 2020-08-04T19:24:46Z

but the number of searches is also roughly linear with the number of tokens

Why? I would say, it is a constant. Most of the time, we are not performing searches for every token, but just for some of them, like in Lint/SpaceInsideArrayLiteralBrackets for example, for the whole file we may be searching only for just [ and ].

marcandre · 2020-08-04T19:38:26Z

but the number of searches is also roughly linear with the number of tokens

Why? I would say, it is a constant. Most of the time, we are not performing searches for every token, but just for some of them, like in Lint/SpaceInsideArrayLiteralBrackets for example, for the whole file we may be searching only for just [ and ].

Right, but how many [ ... ] literals will you find in a Ruby file of n bytes, on average? I imagine that the best approximation is linear with n. The size in bytes, the number of tokens, the number of def nodes, of comments, of lines of codes, of string literals, of array literals, etc., all of these should be roughly linear to one another, even if they are not directly related...

What do you think of my idea of indexing the begin and end position of tokens?

fatkodima · 2020-08-04T20:01:58Z

What do you think of my idea of indexing the begin and end position of tokens?

I'm still have doubt that this will make things faster. Will implement all 3 approaches (original, sorting and indexing) and report results here.

marcandre · 2020-08-04T20:18:20Z

I'm still have doubt that this will make things faster. Will implement all 3 approaches (original, sorting and indexing) and report results here.

That would be great, but I wouldn't want you to feel forced; as stated previously, I'm 👍 to merge this as is.

fatkodima · 2020-08-05T01:08:55Z

but I wouldn't want you to feel forced

No problems 😄

Ok, I have tested 3 implementations and was a bit surprised as I get almost identical results for them. I rechecked, I was testing 3 different versions, not the same one, so everything ok here.

With sorting version, I get the most concise code and it does not allocate extra space, like in indexing version, so I would implement that and submit a PR into rubocop-ast tomorrow.

marcandre · 2020-08-05T01:53:56Z

Ok, I have tested 3 implementations and was a bit surprised as I get almost identical results for them. I rechecked, I was testing 3 different versions, not the same one, so everything ok here.

Cool! Glad you checked 👍

bbatsov · 2020-08-05T06:56:02Z

Good work!

marcandre · 2020-08-05T12:27:12Z

Good work!

🤣 we'll be picking a different implementation, but it won't be difficult to revert this

bbatsov · 2020-08-05T12:38:17Z

Yeah, I saw this, but I'm thinking of cutting a release later today, so that's going to benefit our users in the mean time.

Optimize getting node tokens

ecc91af

marcandre reviewed Aug 4, 2020

View reviewed changes

bbatsov merged commit c29441f into rubocop:master Aug 5, 2020

fatkodima mentioned this pull request Aug 5, 2020

Add ProcessedSource#tokens_within, ProcessedSource#first_token_of and ProcessedSource#last_token_of rubocop/rubocop-ast#92

Merged

fatkodima mentioned this pull request Sep 16, 2020

Use token helpers from rubocop-ast #8729

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize getting node tokens #8450

Optimize getting node tokens #8450

fatkodima commented Aug 4, 2020 •

edited

marcandre commented Aug 4, 2020

marcandre Aug 4, 2020

fatkodima commented Aug 4, 2020

marcandre commented Aug 4, 2020

fatkodima commented Aug 4, 2020

marcandre commented Aug 4, 2020

fatkodima commented Aug 4, 2020

marcandre commented Aug 4, 2020

fatkodima commented Aug 5, 2020

marcandre commented Aug 5, 2020

bbatsov commented Aug 5, 2020

marcandre commented Aug 5, 2020

bbatsov commented Aug 5, 2020

Optimize getting node tokens #8450

Optimize getting node tokens #8450

Conversation

fatkodima commented Aug 4, 2020 • edited

Before

After

marcandre commented Aug 4, 2020

marcandre Aug 4, 2020

Choose a reason for hiding this comment

fatkodima commented Aug 4, 2020

marcandre commented Aug 4, 2020

fatkodima commented Aug 4, 2020

marcandre commented Aug 4, 2020

fatkodima commented Aug 4, 2020

marcandre commented Aug 4, 2020

fatkodima commented Aug 5, 2020

marcandre commented Aug 5, 2020

bbatsov commented Aug 5, 2020

marcandre commented Aug 5, 2020

bbatsov commented Aug 5, 2020

fatkodima commented Aug 4, 2020 •

edited