Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

syntax: accept {,n} as an equivalent to {0,n} #1086

Merged
merged 7 commits into from Mar 26, 2024

Conversation

plusvic
Copy link
Contributor

@plusvic plusvic commented Sep 7, 2023

Most regular expression engines don't accept the {,n} syntax, but some other do it (namely Python's re library). This introduces a new parser configuration option that enables the {,n} syntax as equivalent to {0,n}. This option is disabled by default and not exposed in the regex create.

I understand this change may imply a deviation from the goals of the regex-syntax create. If the purpose of the regex-syntax crate is exposing the parser used by the regex create and supporting a very specific regex syntax flavor, then this PR can be closed. However, I'm bringing this PR into consideration because I'm using regex-syntax in project that requires support for the {,n} syntax for backward-compatibility with existing regular expressions. The alternative for me is maintaining a fork of regex-syntax forever.

Also, I think empty_min_range is not a very good name, but I couldn't come up with a better name. Better naming alternatives are welcomed, even if this PR is not finally merged.

Most regular expression engines don't accept the `{,n}` syntax, but some other do it (namely Python's `re` library). This introduces a new parser configuration option that enables the `{,n}` syntax.
@BurntSushi
Copy link
Member

The alternative for me is maintaining a fork of regex-syntax forever.

Out of curiosity, is this the only additional syntax you need to support and everything else is in sync? I would be surprised if so.

I ask this because if this is literally the only thing, then I can see how "maintain a fork of regex-syntax forever" would be a somewhat significant motivation to merge this or something like it. But if you're going to need to maintain a fork forever anyway because of other incompatibilities, then it might make more sense to just close this.

Another possible route here is to just unconditionally enable this and support it in regex proper. Namely, {,n} is invalid syntax today. I think the reason why I made it invalid despite supporting {n,} syntax is that in the latter version, it's not possible to express otherwise (since it's unbounded). But in the former version, you can write an equivalent {0,n} which I think reads better.

@plusvic
Copy link
Contributor Author

plusvic commented Sep 7, 2023

Actually there are a few other incompatibilities, but either they are less prevalent in real-life regexps, or they are actually benefitial in the long term.

One case is the support for the unscaped { character in a context where it is not a repetition. In other regexps engines this regexp is ok {abc}, { and } behaves here as a literals. And that's the behaviour my legacy program had. However, I'm observing that are a lot of places where people are using { as literals in cases where they didn't intent it, with regexps like (foo){1-3}, where the real intention was (foo){1,3} (but they didn't notice because (foo){1-3} is perfectly valid). So, being more strict and forcing { to be escaped makes sense.

In the case of {,n} however, it's much more widespread, and there's no real benefit in dropping its support. Users will consider it a nuisance, and there's no good argument in favor of the removal. That's why I decided to support it. But my plan is using regex-syntax with as few changes as possible, and so far this is the only change I've identified.

Maybe we can simply hold on this change until I advance a little more. If I eventually need some changes like this one, then I must go through the fork path, if not, I can retake this PR and even provide the full implementation
for {,n} in regexp if you think it makes more sense. I was so used to Python's regexp that I got suprised that this is not supported by most of the regexp engines out there.

@BurntSushi
Copy link
Member

BurntSushi commented Sep 7, 2023

Yeah I mean I would argue that the benefit here is that {0,n} is easier to read. But I understand the nuisance argument. That's ultimately why I ended up relenting and allowing superfluous escapes of non-meta punctuation characters (e.g., \/ used to be invalid syntax). As far as things like (foo){1-3} go, yeah things like that were on my mind by requiring that { be escaped.

I like the idea of holding off until you advance a bit more. If this turns out to literally be the only thing you need then I might be able to get on board with that.

Now `RepetitionCountUnclosed` error has precedence over `RepetitionCountDecimalEmpty`, even if the `{,n}` syntax is not accepted. I think this is even desirable, as `RepetitionCountUnclosed` describes the situation better when an unclosed `{` is found in the regex.
@plusvic
Copy link
Contributor Author

plusvic commented Feb 1, 2024

@BurntSushi I would like to rescue this PR for your consideration again. After using regex-syntax for a while I haven't identified any additional change that I would need. The only feature that I like to include is support for the {,n} syntax, either as an opt-in feature or as the default behaviour.

This PR adds support for {,n} as an opt-in via ParserBuilder::new().empty_min_range(true), but if you prefer I can remove the configuration option and make it the default. I'm still unconvinced about the empty_min_range name (maybe allow_empty_min_range is better?)

A side-effect is that some errors change from RepetitionCountDecimalEmpty to RepetitionCountUnclosed (even when this feature disabled), but I would say that it's also a good thing, as RepetitionCountUnclosed describes the situation better when the regex contains something like foo.{.

@plusvic
Copy link
Contributor Author

plusvic commented Mar 15, 2024

@BurntSushi any comments on this?

@plusvic
Copy link
Contributor Author

plusvic commented Mar 26, 2024

Hi @BurntSushi,

I would like to publish the crate I was working on, which depends on a modified version of regex-syntax that includes this change. So far, I was using a git dependency in my Cargo.toml file, pointing to my clone of this repository. However, I just discovered that I can't publish a crate that has git dependencies, as all published crates must depend on crates that are also published in crates.io. So, either I publish my own clone of regex-syntax (which doesn't make too much as it would clutter crates.io), or I include the whole regex-syntax source code inside my own crate, which I would like to avoid if possible.

If you are ok with merging this change that would save my day, if not, I think next my best option is including regex-syntax as part of my own crate. I just need some feedback from you so I can make an informed decision on how to proceed.

Not being allowed to publish a crate that depends on some public GitHub repository (even if not published in crates.io) was quite unexpected too me :(

@BurntSushi
Copy link
Member

Yeah, sorry about my absence here. I've been focused on other things. I think my main decision point here is whether I just want to enable it for everyone (so don't make it conditional), or to keep it as you have here via an option. I do kind of feel like overall the syntax {,n} is an aberration and it would be better to not have it, which means I think it should be an option like you have here.

@BurntSushi BurntSushi merged commit f5d0b69 into rust-lang:master Mar 26, 2024
16 checks passed
@BurntSushi
Copy link
Member

This is on crates.io in regex-syntax 0.8.3.

@plusvic
Copy link
Contributor Author

plusvic commented Mar 26, 2024

That was quick! Thank you so much.

plusvic added a commit to VirusTotal/yara-x that referenced this pull request Mar 26, 2024
Now that rust-lang/regex#1086 was merged, I can stop depending on my own clone and can rely on the official implementation of `regex-syntax`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants