Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various changes, works with current Pygments now. #40

Merged
merged 17 commits into from Sep 7, 2020

Conversation

birkenfeld
Copy link
Contributor

No description provided.

All builds are now equivalent to previous "wide unicode".
Allow non-empty regexes that match the empty string if there
is a state transition associated.  These cases cannot use
`default` since they usually contain an assertion.
We dont need words() to generate optimal regexes, it would
make it too complex.
Start with file:line: and then the further details.  This lets us
automatically jump to these from IDEs.
In re.UNICODE mode (which is the default on Py3), the builtin
charclasses \wWsSdD match more than just ASCII.  The logic can't
handle this currently, so disable the checker.

Also, the way negated classes are handled (by building the full
complement set) will have to be changed to avoid storing millions
of Unicode characters over and over.
def check_wide_unicode(reg, errs):
num = '121'
level = logging.WARNING
msg = 'Wide unicode causes problems in narrow builds'
Copy link
Owner

@thatch thatch Sep 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code tested that patterns in Pygments that needed to use unirange in Pygments actually do. It looks like this isn't necessary anymore after PEP 393, so the unirange in pygments/util.py can also go away.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that's already gone.

def check_charclass_simplify(reg, errs):
num = '123'
level = logging.WARNING
msg = 'Regex can be written more simply: %s -> %s'

if any(ord(c) > 255 for c in reg.raw):
if any(ord(c) > 255 for c in reg.raw) or reg.effective_flags & re.UNICODE:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ok; I'll follow up with a more permanent fix before cutting a release.

# should be using default().
if not isinstance(raw_pat[1], Token.__class__):
return
if raw_pat[0] != '' and len(raw_pat) > 2:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you include a couple of lines of comment, I don't follow this change. Is this for the callback functions mentioned a few lines above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's for patterns that effect a state change with a zero-width match. Something like (r"(?=xyz)", Text, "#pop"), I don't think there's a better way to write it, and I don't think it's harmful since it will not lead to infinite looping in the same state.

I'll move the comments around a bit and clarify them.

@birkenfeld
Copy link
Contributor Author

Hi Tim, nice to hear from you! This PR is not very focused, as it contains all the things I fixed and changed while working on integrating regexlint in the Pygments CI workflow. But I hope most commits messages are clear enough :)

@thatch thatch merged commit d948c7e into thatch:master Sep 7, 2020
@thatch
Copy link
Owner

thatch commented Sep 16, 2020

Just pushed a 2.0; I tried and abandoned a fix for unicode charclass in time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants