-
-
Notifications
You must be signed in to change notification settings - Fork 33.2k
gh-128223 - Minor updates to tokenize.py, cleaning up imports and added a comment. #128224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool. If this change has little impact on Python users, wait for a maintainer to apply the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revert all changes that were done because of some ruff format
or black
. In addition, is there a real performance gain upon importing tokenize
or not? we generally don't accept cosmetic PRs if they don't have a noticeable impact on performances or if this is justified (though it's left at the discretion of the module's maintainer(s)).
'Skip Montanaro, Raymond Hettinger, Trent Nelson, ' | ||
'Michael Foord') | ||
__author__ = "Ka-Ping Yee <ping@lfw.org>" | ||
__credits__ = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do not use ruff format
on the file. Cosmetic changes are not accepted in general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose to close the PR, because there's no real runtime effect of the change. If there is, please provide perf stats. In case if the impact is significant, all other changes would have to be reverted.
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
We usually do not accept pure cosmetic changes. |
Acknowledged. While cprofile showed some slight gains in execution speed, data from perf reflects the opposite, these changes actually slow the module down slightly. The formatting was unintentional, I had format on save enabled and didn't even think about that. I assumed there would be some easy increase to gain by not importing the token module 3 separate times, but with caching, it seems there is nothing to be gained. I had forgotten to mention that I had also updated the legacy string formatting calls to fstrings, which I had been led to believe should be preferred for performance reasons, but according to perf, those are slightly slower too (at least on my system. shrug) Thanks for the time in taking a look. |
I don't know that this really required an issue, but I went ahead and created one just in case. The changes are relatively trivial and functionality has not changed, but the implementation of the changes did take a bit of thought, and I had several failing tests to fix after my first attempt. I'm currently working on some code that depends on the tokenizer, so I figured while I was in there looking through it, I might as well contribute to cleaning it up a bit.
I have developed what I believe is a novel new parsing algorithm and in some early testing, it dramatically outperforms the parser generated by Pegen. It's not quite ready for release yet, as I'm currently re-working Pegen itself to implement the new algorithm for some more testing. (The algorithm is quite trivial, generating code for it turns out not to be). So that will come along... sometime. My hope is that I can do a thorough enough integration to make the transition, if the community here agrees with me that it's a good idea, very simple.
This PR simply cleans up the mess that has happened with the imports in the tokenize module, reducing all of this:
to simply:
explicitly pulling the objects from token into globals was done to keep the API intact, because other modules in the standard library depend on (for instance) tokenize.ENCODING
I further elected to change all other uses of "token" as a name in the module to "tok" rather than delete the reference as was being done previously. My first scan through the module had me confused, as I missed seeing the del statement and couldn't figure out what was being accessed in the "token" namespace.
Apart from that, I added a brief comment to the _generate_tokens_from_c_tokenizer method, to better explain what's happening, as it confused me at first.
Since the declaration of the SyntaxError subclasses involved is not local to the module, it easy to be confused by the if block here, and quite tempting to delete it, as a cursory glance can easily lead one to think the condition will never evaluate to True.
P.S. I read the contribution guide as this is my first contribution to CPython, and followed the link to the licensing in an attempt to sign the agreement. The document it links to suggests that the process is automatic upon one's first contribution, but I have not seen anything asking me to sign an agreement up to this point. Perhaps it will follow the submission of this pull request. If not, please direct me to where I need to handle that, and I will get it done quickly.
all tests passing, blurb to follow.