New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Various changes, works with current Pygments now. #40
Conversation
All builds are now equivalent to previous "wide unicode".
Allow non-empty regexes that match the empty string if there is a state transition associated. These cases cannot use `default` since they usually contain an assertion.
We dont need words() to generate optimal regexes, it would make it too complex.
Start with file:line: and then the further details. This lets us automatically jump to these from IDEs.
In re.UNICODE mode (which is the default on Py3), the builtin charclasses \wWsSdD match more than just ASCII. The logic can't handle this currently, so disable the checker. Also, the way negated classes are handled (by building the full complement set) will have to be changed to avoid storing millions of Unicode characters over and over.
def check_wide_unicode(reg, errs): | ||
num = '121' | ||
level = logging.WARNING | ||
msg = 'Wide unicode causes problems in narrow builds' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code tested that patterns in Pygments that needed to use unirange
in Pygments actually do. It looks like this isn't necessary anymore after PEP 393, so the unirange
in pygments/util.py
can also go away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that's already gone.
def check_charclass_simplify(reg, errs): | ||
num = '123' | ||
level = logging.WARNING | ||
msg = 'Regex can be written more simply: %s -> %s' | ||
|
||
if any(ord(c) > 255 for c in reg.raw): | ||
if any(ord(c) > 255 for c in reg.raw) or reg.effective_flags & re.UNICODE: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is ok; I'll follow up with a more permanent fix before cutting a release.
# should be using default(). | ||
if not isinstance(raw_pat[1], Token.__class__): | ||
return | ||
if raw_pat[0] != '' and len(raw_pat) > 2: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you include a couple of lines of comment, I don't follow this change. Is this for the callback functions mentioned a few lines above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's for patterns that effect a state change with a zero-width match. Something like (r"(?=xyz)", Text, "#pop")
, I don't think there's a better way to write it, and I don't think it's harmful since it will not lead to infinite looping in the same state.
I'll move the comments around a bit and clarify them.
Hi Tim, nice to hear from you! This PR is not very focused, as it contains all the things I fixed and changed while working on integrating regexlint in the Pygments CI workflow. But I hope most commits messages are clear enough :) |
Just pushed a 2.0; I tried and abandoned a fix for unicode charclass in time. |
No description provided.