Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

better support for empty patterns #677

Closed
wants to merge 4 commits into from
Closed

better support for empty patterns #677

wants to merge 4 commits into from

Conversation

v-gb
Copy link
Contributor

@v-gb v-gb commented May 11, 2020

I propose fixing the known issue in compile.rs where empty patterns + alternations can cause compilation errors or be compiled into a wrong automaton.

I understand that the plan is to replace compile.rs with a better version, but this doesn't seem to be about to happen, and I think the fix is not an undue amount of work. If possible, I would prefer fixing regex than work around these issues (it's easy to avoid them in hand written regexes but when generating regexes, it's more annoying).

About the change itself: all I do is compile Empty into Ok(None) instead of Ok(Patch{ entry: the-next-instruction, ..}) (as that is not correct in general) and follow the type system. I tried fixing this in other ways with less diff, but I think this version is simpler: the diff is bigger (mostly due to changes types and indentation, which are easy to review) but the performance is clearly ok and the type system forces me to consider where changes need to happen. The kind of diff self.c(expr)? -> self.c(expr)?.unwrap_or(self.next_inst()); is saying "at this call site, the old behavior worked".

The same automation should be generated in the case that were previously correctly and successfully compiled, except a few corner cases that involve empty subexpressions, where we may generate fewer instructions (like (?:)?).

@v-gb
Copy link
Contributor Author

v-gb commented May 11, 2020

For the test failure:

thread 'hir::translate::tests::analysis_is_alternation_literal' panicked at 'assertion failed: t(r"").is_alternation_literal()', regex-syntax/src/hir/translate.rs:3127:9

should the documentation of hir and implementation of exec.rs be changed to accept Empty nodes in is_alternation_literal?

@BurntSushi
Copy link
Member

@sliquister Good question. The docs suggest that "" shouldn't be an alternation literal, so I'd go with what you have and adjust the tests as appropriate. Updating the docs to add "" to the list of negative examples would also be good.

v-gb added 3 commits May 11, 2020 11:55
To avoid this assertion in tests when empty alternations are allowed:

   internal error: entered unreachable code: expected literal or concat, got Hir { kind: Empty, info: HirInfo { bools: 1795 } }', src/exec.rs:1568:18

The code in exec.rs relies on the documented invariant for is_alternation_literal:

        /// ... This is only
        /// true when this HIR expression is either itself a `Literal` or a
        /// concatenation of only `Literal`s or an alternation of only `Literal`s.
Some were rejected explicitly with this "alternations cannot currently
contain empty sub-expressions" error. Some were causing alternations
to be compiled into automaton matching a different regex.

The root of the issue is that the compilation of the Empty pattern
returns Patch { entry: self.insts.len(), .. }, that is it claims its
entry point is the next instruction. This only works if the next
instruction is unconditionally executed, which is not true for
alternations.

So we return None instead, and let the callers decide how to handle an
empty regex.
@v-gb
Copy link
Contributor Author

v-gb commented May 11, 2020

Thanks, the tests all pass now (and I've learned of the existence of cargo test --all).

@v-gb
Copy link
Contributor Author

v-gb commented May 27, 2020

Have you had a chance to consider this? (I'm trying to decide if I need to workaround the underlying issue)

Copy link
Member

@BurntSushi BurntSushi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks so much! Sorry about the delay.

@BurntSushi
Copy link
Member

This PR is in regex 1.3.8 on crates.io.

As a bonus, this also fixed #659!

@v-gb v-gb deleted the alternation-big-fixes branch May 28, 2020 16:11
@v-gb
Copy link
Contributor Author

v-gb commented May 28, 2020

Marvelous, thank you! (especially with this change being released immediately)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants