gh-128223 - Minor updates to tokenize.py, cleaning up imports and added a comment. #128224

podrybaum · 2024-12-24T14:40:07Z

I don't know that this really required an issue, but I went ahead and created one just in case. The changes are relatively trivial and functionality has not changed, but the implementation of the changes did take a bit of thought, and I had several failing tests to fix after my first attempt. I'm currently working on some code that depends on the tokenizer, so I figured while I was in there looking through it, I might as well contribute to cleaning it up a bit.

I have developed what I believe is a novel new parsing algorithm and in some early testing, it dramatically outperforms the parser generated by Pegen. It's not quite ready for release yet, as I'm currently re-working Pegen itself to implement the new algorithm for some more testing. (The algorithm is quite trivial, generating code for it turns out not to be). So that will come along... sometime. My hope is that I can do a thorough enough integration to make the transition, if the community here agrees with me that it's a good idea, very simple.

This PR simply cleans up the mess that has happened with the imports in the tokenize module, reducing all of this:

from token import *
from token import EXACT_TOKEN_TYPES
...

import token
__all__ = token.__all__ + ["tokenize", "generate_tokens", "detect_encoding",
                           "untokenize", "TokenInfo", "open", "TokenError"]
del token

to simply:

import token

# kind of a hack, but allows us to avoid importing the token module 3 times
globals().update({name: getattr(token, name) for name in token.__all__})
# ruff: noqa: F821

explicitly pulling the objects from token into globals was done to keep the API intact, because other modules in the standard library depend on (for instance) tokenize.ENCODING

I further elected to change all other uses of "token" as a name in the module to "tok" rather than delete the reference as was being done previously. My first scan through the module had me confused, as I missed seeing the del statement and couldn't figure out what was being accessed in the "token" namespace.

Apart from that, I added a brief comment to the _generate_tokens_from_c_tokenizer method, to better explain what's happening, as it confused me at first.

def _generate_tokens_from_c_tokenizer(source, encoding=None, extra_tokens=False):
    """Tokenize a source reading Python code as unicode strings using the internal C tokenizer"""
    if encoding is None:
        it = _tokenize.TokenizerIter(source, extra_tokens=extra_tokens)
    else:
        it = _tokenize.TokenizerIter(
            source, encoding=encoding, extra_tokens=extra_tokens
        )
    try:
        for info in it:
            yield TokenInfo._make(info)
    except SyntaxError as e:
        # Error messages raised by the tokenizer are subclasses of SyntaxError
        # So we should pass those through.  If we get an actual SyntaxError(the base class)
        # transform it into a TokenError
        if type(e) is not SyntaxError:
            raise e from None
        msg = _transform_msg(e.msg)
        raise TokenError(msg, (e.lineno, e.offset)) from None

Since the declaration of the SyntaxError subclasses involved is not local to the module, it easy to be confused by the if block here, and quite tempting to delete it, as a cursory glance can easily lead one to think the condition will never evaluate to True.

P.S. I read the contribution guide as this is my first contribution to CPython, and followed the link to the licensing in an attempt to sign the agreement. The document it links to suggests that the process is automatic upon one's first contribution, but I have not seen anything asking me to sign an agreement up to this point. Perhaps it will follow the submission of this pull request. If not, please direct me to where I need to handle that, and I will get it done quickly.

all tests passing, blurb to follow.

Issue: Cleaning up tokenize module imports #128223

ghost · 2024-12-24T14:40:10Z

All commit authors signed the Contributor License Agreement.

bedevere-app · 2024-12-24T14:40:14Z

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

picnixz

Please revert all changes that were done because of some ruff format or black. In addition, is there a real performance gain upon importing tokenize or not? we generally don't accept cosmetic PRs if they don't have a noticeable impact on performances or if this is justified (though it's left at the discretion of the module's maintainer(s)).

picnixz · 2024-12-24T14:59:48Z

Lib/tokenize.py

-               'Skip Montanaro, Raymond Hettinger, Trent Nelson, '
-               'Michael Foord')
+__author__ = "Ka-Ping Yee <ping@lfw.org>"
+__credits__ = (


Please do not use ruff format on the file. Cosmetic changes are not accepted in general.

Lib/tokenize.py

sobolevn

I propose to close the PR, because there's no real runtime effect of the change. If there is, please provide perf stats. In case if the impact is significant, all other changes would have to be reverted.

bedevere-app · 2024-12-24T21:34:41Z

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

serhiy-storchaka · 2024-12-24T21:47:36Z

We usually do not accept pure cosmetic changes.

podrybaum · 2024-12-24T22:15:38Z

Acknowledged. While cprofile showed some slight gains in execution speed, data from perf reflects the opposite, these changes actually slow the module down slightly. The formatting was unintentional, I had format on save enabled and didn't even think about that. I assumed there would be some easy increase to gain by not importing the token module 3 separate times, but with caching, it seems there is nothing to be gained. I had forgotten to mention that I had also updated the legacy string formatting calls to fstrings, which I had been led to believe should be preferred for performance reasons, but according to perf, those are slightly slower too (at least on my system. shrug) Thanks for the time in taking a look.

Minor updates, code cleanup

ca0070c

podrybaum requested review from lysnikolaou and pablogsal as code owners December 24, 2024 14:40

bedevere-app bot mentioned this pull request Dec 24, 2024

Cleaning up tokenize module imports #128223

Closed

bedevere-app bot added the awaiting review label Dec 24, 2024

📜🤖 Added by blurb_it.

3b6c84e

picnixz requested changes Dec 24, 2024

View reviewed changes

bedevere-app bot added awaiting core review and removed awaiting review labels Dec 24, 2024

sobolevn requested changes Dec 24, 2024

View reviewed changes

bedevere-app bot added awaiting changes and removed awaiting core review labels Dec 24, 2024

serhiy-storchaka closed this Dec 24, 2024

podrybaum deleted the trivial-tokenize-updates branch December 24, 2024 22:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

gh-128223 - Minor updates to tokenize.py, cleaning up imports and added a comment. #128224

gh-128223 - Minor updates to tokenize.py, cleaning up imports and added a comment. #128224

Uh oh!

podrybaum commented Dec 24, 2024 •

edited by bedevere-app bot

Loading

Uh oh!

ghost commented Dec 24, 2024 •

edited by ghost

Loading

Uh oh!

bedevere-app bot commented Dec 24, 2024

Uh oh!

picnixz left a comment

Uh oh!

picnixz Dec 24, 2024

Uh oh!

Uh oh!

sobolevn left a comment

Uh oh!

bedevere-app bot commented Dec 24, 2024

Uh oh!

serhiy-storchaka commented Dec 24, 2024

Uh oh!

podrybaum commented Dec 24, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

gh-128223 - Minor updates to tokenize.py, cleaning up imports and added a comment. #128224

gh-128223 - Minor updates to tokenize.py, cleaning up imports and added a comment. #128224

Uh oh!

Conversation

podrybaum commented Dec 24, 2024 • edited by bedevere-app bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghost commented Dec 24, 2024 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bedevere-app bot commented Dec 24, 2024

Uh oh!

picnixz left a comment

Choose a reason for hiding this comment

Uh oh!

picnixz Dec 24, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sobolevn left a comment

Choose a reason for hiding this comment

Uh oh!

bedevere-app bot commented Dec 24, 2024

Uh oh!

serhiy-storchaka commented Dec 24, 2024

Uh oh!

podrybaum commented Dec 24, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

podrybaum commented Dec 24, 2024 •

edited by bedevere-app bot

Loading

ghost commented Dec 24, 2024 •

edited by ghost

Loading