Skip to content

Conversation

asottile
Copy link
Contributor

@asottile asottile commented Jan 24, 2021

Copy link
Member

@isidentical isidentical left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great optimization, though there are 2 concerns of mine;

  • For people who are not using tokenize module to generate tokens (like detect_encoding/open are the most common functions), they'd have to pay this cost
  • Also, even though breaking them is somewhat OK, there are wild usages out there that monkeypatches the PseduoToken to change the behavior (add new tokens) of tokenize module.

Maybe there is a solution that would both optimize this, and also don't cause any new regressions for normal users (something like @lru_cache to _compile maybe?)

@asottile
Copy link
Contributor Author

I initially approached this with lru_cache, however the function call alone accounts for 6% of the execution so the performance gains aren't as significant

@isidentical
Copy link
Member

I initially approached this with lru_cache, however the function call alone accounts for 6% of the execution so the performance gains aren't as significant

Maybe we could set it to a global (_PSEDUO_TOKEN_RE = None, if ... is None: _PSEDUO_TOKEN_RE = compile())?

@asottile
Copy link
Contributor Author

I initially approached this with lru_cache, however the function call alone accounts for 6% of the execution so the performance gains aren't as significant

Maybe we could set it to a global (_PSEDUO_TOKEN_RE = None, if ... is None: _PSEDUO_TOKEN_RE = compile())?

from my tests this performs the same as the lru_cache approach (within a few 1s of ms -- error noise). the lru_cache approach seems a reasonable middle ground (and also avoids recompiling the triple-quoted-string regexes over and over as well)

@asottile asottile changed the title bpo-43014: Improve performance of tokenize by 25-35% bpo-43014: Improve performance of tokenize by 20-30% Jan 24, 2021
@asottile asottile force-pushed the faster_tokenize_bpo-43014 branch from bc2dc35 to 2025476 Compare January 24, 2021 08:59
@isidentical
Copy link
Member

Thanks a lot @asottile!

@asottile asottile deleted the faster_tokenize_bpo-43014 branch January 24, 2021 16:33
@@ -95,6 +96,7 @@ def _all_string_prefixes():
result.add(''.join(u))
return result

@functools.lru_cache
Copy link
Member

@pablogsal pablogsal Jan 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should limit explicitly (I am aware there is a default of 128) the max size of the cache in this call so this doesn't blow up for some super intensive cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the re module we use the same trick and we have 512 as the max size of the cache.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I commented on the other PR, there's only 5 possible regex strings that go through this call

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Roger. Then is fine, although I would have preferred to manually pre-compile those and avoid importing functools

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that was my original patch, it increased import time of this module by ~20ms

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much is importing functools? Also, we can still go that way if we do the compilation lazily, but I don't think is worth it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is already imported via tokenize => re => functools so 0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, then I retract my concern. This is an absolutely better approach. Thanks for following with me :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants