bpo-43014: Improve performance of tokenize by 20-30% #24311

asottile · 2021-01-24T08:38:31Z

https://bugs.python.org/issue43014

isidentical

Great optimization, though there are 2 concerns of mine;

For people who are not using tokenize module to generate tokens (like detect_encoding/open are the most common functions), they'd have to pay this cost
Also, even though breaking them is somewhat OK, there are wild usages out there that monkeypatches the PseduoToken to change the behavior (add new tokens) of tokenize module.

Maybe there is a solution that would both optimize this, and also don't cause any new regressions for normal users (something like @lru_cache to _compile maybe?)

asottile · 2021-01-24T08:50:31Z

I initially approached this with lru_cache, however the function call alone accounts for 6% of the execution so the performance gains aren't as significant

isidentical · 2021-01-24T08:53:35Z

I initially approached this with lru_cache, however the function call alone accounts for 6% of the execution so the performance gains aren't as significant

Maybe we could set it to a global (_PSEDUO_TOKEN_RE = None, if ... is None: _PSEDUO_TOKEN_RE = compile())?

asottile · 2021-01-24T08:58:23Z

I initially approached this with lru_cache, however the function call alone accounts for 6% of the execution so the performance gains aren't as significant

Maybe we could set it to a global (_PSEDUO_TOKEN_RE = None, if ... is None: _PSEDUO_TOKEN_RE = compile())?

from my tests this performs the same as the lru_cache approach (within a few 1s of ms -- error noise). the lru_cache approach seems a reasonable middle ground (and also avoids recompiling the triple-quoted-string regexes over and over as well)

isidentical · 2021-01-24T09:23:42Z

Thanks a lot @asottile!

pablogsal · 2021-01-24T17:24:17Z

Lib/tokenize.py

@@ -95,6 +96,7 @@ def _all_string_prefixes():
                result.add(''.join(u))
    return result

+@functools.lru_cache


I think we should limit explicitly (I am aware there is a default of 128) the max size of the cache in this call so this doesn't blow up for some super intensive cases.

In the re module we use the same trick and we have 512 as the max size of the cache.

I commented on the other PR, there's only 5 possible regex strings that go through this call

Roger. Then is fine, although I would have preferred to manually pre-compile those and avoid importing functools

that was my original patch, it increased import time of this module by ~20ms

How much is importing functools? Also, we can still go that way if we do the compilation lazily, but I don't think is worth it.

it is already imported via tokenize => re => functools so 0

Oh, then I retract my concern. This is an absolutely better approach. Thanks for following with me :)

the-knights-who-say-ni added the CLA signed label Jan 24, 2021

bedevere-bot added the awaiting review label Jan 24, 2021

isidentical reviewed Jan 24, 2021

View reviewed changes

Improve performance of tokenize by 20-30%

2025476

asottile changed the title ~~bpo-43014: Improve performance of tokenize by 25-35%~~ bpo-43014: Improve performance of tokenize by 20-30% Jan 24, 2021

asottile force-pushed the faster_tokenize_bpo-43014 branch from bc2dc35 to 2025476 Compare January 24, 2021 08:59

isidentical approved these changes Jan 24, 2021

View reviewed changes

bedevere-bot added awaiting merge and removed awaiting review labels Jan 24, 2021

isidentical merged commit 15bd9ef into python:master Jan 24, 2021

bedevere-bot removed the awaiting merge label Jan 24, 2021

asottile deleted the faster_tokenize_bpo-43014 branch January 24, 2021 16:33

pablogsal reviewed Jan 24, 2021

View reviewed changes

matusvalo mentioned this pull request Feb 21, 2021

Long lines of spaces cause exponential runtime pylint-dev/pylint#4062

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

bpo-43014: Improve performance of tokenize by 20-30% #24311

bpo-43014: Improve performance of tokenize by 20-30% #24311

Uh oh!

asottile commented Jan 24, 2021 •

edited by bedevere-bot

Loading

Uh oh!

isidentical left a comment

Uh oh!

asottile commented Jan 24, 2021

Uh oh!

isidentical commented Jan 24, 2021

Uh oh!

asottile commented Jan 24, 2021

Uh oh!

isidentical commented Jan 24, 2021

Uh oh!

pablogsal Jan 24, 2021 •

edited

Loading

Uh oh!

pablogsal Jan 24, 2021

Uh oh!

asottile Jan 24, 2021

Uh oh!

pablogsal Jan 24, 2021

Uh oh!

asottile Jan 24, 2021

Uh oh!

pablogsal Jan 24, 2021

Uh oh!

asottile Jan 24, 2021

Uh oh!

pablogsal Jan 24, 2021

Uh oh!

Uh oh!

Uh oh!

bpo-43014: Improve performance of tokenize by 20-30% #24311

bpo-43014: Improve performance of tokenize by 20-30% #24311

Uh oh!

Conversation

asottile commented Jan 24, 2021 • edited by bedevere-bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

isidentical left a comment

Choose a reason for hiding this comment

Uh oh!

asottile commented Jan 24, 2021

Uh oh!

isidentical commented Jan 24, 2021

Uh oh!

asottile commented Jan 24, 2021

Uh oh!

isidentical commented Jan 24, 2021

Uh oh!

pablogsal Jan 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pablogsal Jan 24, 2021

Choose a reason for hiding this comment

Uh oh!

asottile Jan 24, 2021

Choose a reason for hiding this comment

Uh oh!

pablogsal Jan 24, 2021

Choose a reason for hiding this comment

Uh oh!

asottile Jan 24, 2021

Choose a reason for hiding this comment

Uh oh!

pablogsal Jan 24, 2021

Choose a reason for hiding this comment

Uh oh!

asottile Jan 24, 2021

Choose a reason for hiding this comment

Uh oh!

pablogsal Jan 24, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

asottile commented Jan 24, 2021 •

edited by bedevere-bot

Loading

pablogsal Jan 24, 2021 •

edited

Loading