Skip to content

Conversation

podrybaum
Copy link

@podrybaum podrybaum commented Dec 24, 2024

I don't know that this really required an issue, but I went ahead and created one just in case. The changes are relatively trivial and functionality has not changed, but the implementation of the changes did take a bit of thought, and I had several failing tests to fix after my first attempt. I'm currently working on some code that depends on the tokenizer, so I figured while I was in there looking through it, I might as well contribute to cleaning it up a bit.

I have developed what I believe is a novel new parsing algorithm and in some early testing, it dramatically outperforms the parser generated by Pegen. It's not quite ready for release yet, as I'm currently re-working Pegen itself to implement the new algorithm for some more testing. (The algorithm is quite trivial, generating code for it turns out not to be). So that will come along... sometime. My hope is that I can do a thorough enough integration to make the transition, if the community here agrees with me that it's a good idea, very simple.

This PR simply cleans up the mess that has happened with the imports in the tokenize module, reducing all of this:

from token import *
from token import EXACT_TOKEN_TYPES
...

import token
__all__ = token.__all__ + ["tokenize", "generate_tokens", "detect_encoding",
                           "untokenize", "TokenInfo", "open", "TokenError"]
del token

to simply:

import token

# kind of a hack, but allows us to avoid importing the token module 3 times
globals().update({name: getattr(token, name) for name in token.__all__})
# ruff: noqa: F821

explicitly pulling the objects from token into globals was done to keep the API intact, because other modules in the standard library depend on (for instance) tokenize.ENCODING

I further elected to change all other uses of "token" as a name in the module to "tok" rather than delete the reference as was being done previously. My first scan through the module had me confused, as I missed seeing the del statement and couldn't figure out what was being accessed in the "token" namespace.

Apart from that, I added a brief comment to the _generate_tokens_from_c_tokenizer method, to better explain what's happening, as it confused me at first.

def _generate_tokens_from_c_tokenizer(source, encoding=None, extra_tokens=False):
    """Tokenize a source reading Python code as unicode strings using the internal C tokenizer"""
    if encoding is None:
        it = _tokenize.TokenizerIter(source, extra_tokens=extra_tokens)
    else:
        it = _tokenize.TokenizerIter(
            source, encoding=encoding, extra_tokens=extra_tokens
        )
    try:
        for info in it:
            yield TokenInfo._make(info)
    except SyntaxError as e:
        # Error messages raised by the tokenizer are subclasses of SyntaxError
        # So we should pass those through.  If we get an actual SyntaxError(the base class)
        # transform it into a TokenError
        if type(e) is not SyntaxError:
            raise e from None
        msg = _transform_msg(e.msg)
        raise TokenError(msg, (e.lineno, e.offset)) from None

Since the declaration of the SyntaxError subclasses involved is not local to the module, it easy to be confused by the if block here, and quite tempting to delete it, as a cursory glance can easily lead one to think the condition will never evaluate to True.

P.S. I read the contribution guide as this is my first contribution to CPython, and followed the link to the licensing in an attempt to sign the agreement. The document it links to suggests that the process is automatic upon one's first contribution, but I have not seen anything asking me to sign an agreement up to this point. Perhaps it will follow the submission of this pull request. If not, please direct me to where I need to handle that, and I will get it done quickly.

all tests passing, blurb to follow.

@ghost
Copy link

ghost commented Dec 24, 2024

All commit authors signed the Contributor License Agreement.
CLA signed

@bedevere-app
Copy link

bedevere-app bot commented Dec 24, 2024

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

Copy link
Member

@picnixz picnixz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert all changes that were done because of some ruff format or black. In addition, is there a real performance gain upon importing tokenize or not? we generally don't accept cosmetic PRs if they don't have a noticeable impact on performances or if this is justified (though it's left at the discretion of the module's maintainer(s)).

'Skip Montanaro, Raymond Hettinger, Trent Nelson, '
'Michael Foord')
__author__ = "Ka-Ping Yee <ping@lfw.org>"
__credits__ = (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not use ruff format on the file. Cosmetic changes are not accepted in general.

Copy link
Member

@sobolevn sobolevn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose to close the PR, because there's no real runtime effect of the change. If there is, please provide perf stats. In case if the impact is significant, all other changes would have to be reverted.

@bedevere-app
Copy link

bedevere-app bot commented Dec 24, 2024

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

@serhiy-storchaka
Copy link
Member

We usually do not accept pure cosmetic changes.

@podrybaum
Copy link
Author

Acknowledged. While cprofile showed some slight gains in execution speed, data from perf reflects the opposite, these changes actually slow the module down slightly. The formatting was unintentional, I had format on save enabled and didn't even think about that. I assumed there would be some easy increase to gain by not importing the token module 3 separate times, but with caching, it seems there is nothing to be gained. I had forgotten to mention that I had also updated the legacy string formatting calls to fstrings, which I had been led to believe should be preferred for performance reasons, but according to perf, those are slightly slower too (at least on my system. shrug) Thanks for the time in taking a look.

@podrybaum podrybaum deleted the trivial-tokenize-updates branch December 24, 2024 22:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants