Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overloading precedence marker (e.g. (pattern)) with 'posix' group markers (also parenthesis) causes loss of full functionality #308

Closed
dtp555-1212 opened this issue Sep 14, 2020 · 6 comments

Comments

@dtp555-1212
Copy link

dtp555-1212 commented Sep 14, 2020

The overloading of the posix group boundaries markers (e.g. (group) ) and the normal precedence group (e.g. also parenthesis) make all but trivial patterns impractical/impossible to define. (i.e. most non-trivial patterns contain the normal precedence markers, thus seem to be incompatible with the posix group markers. And the trivial cases are forced to be known by their position rather than the pattern type they matched.)

Is there (or could there be) a way to ‘override/redefine’ the markers (e.g. () ) that posix group uses, thus freeing up the use of the normal precedence markers (e.g. () ) along with the posix submatch groups?

Better yet, the optimal solution would look something like @x(pattern) … and when that pattern is matched, a callback (macro) sends back the start, end, a whatever X was given… This would overcome the limitation of stag, by allowing repetition (like mtag), and additionally, would improve the efficiency over the mtag method (e.g. not having function calls ‘during’ the pattern match phase, like mtag, but only at the end (or when accept of sub patterns) like posix (and stag)… and would overcome mtag not allowing the same tag being used more than once in the solution… In summary, it finds the boundaries, then communicates not only ‘where’ they are but ‘what’ they bound.

If there is ‘any’ way that the current system is capable of doing this, please let me know. If not, what is the feasibility of enhancing the capability to handle this use case efficiently and simply. If you have any question on what I mean, please let me know, and I will try to clarify.

Thanks

@dtp555-1212 dtp555-1212 changed the title Overloading 'posix' group markers with 'precedence' marker causes loss of full functionalty Overloading precedence marker (e.g. (pattern)) with 'posix' group markers (also parenthesis) causes loss of full functionality Sep 14, 2020
@skvadrik
Copy link
Owner

Hi @dtp555-1212, there are two subproblems here:

  • adding named capturing groups that would allow ordinary parenthesized group to be non-capturing
  • the "callback" idea that would allow repeated capturing groups to be matched more efficiently

The two subproblems are orthogonal, so let's consider each of them separately.

Named capturing groups are a good idea, and not very difficult to implement. They can be used both with POSIX longest-match semantics, and with Perl leftmost-greedy semantics. The main question here is syntax. I like your suggestion @name(...), but before deciding on something it is a good idea to consider existing syntax like (<name> ... ), and also it should not introduce any syntactic ambiguity. There also must be a way to refer to the opening and closing parentheses of each named group, so instead of a tag name must be a pair of tags (a struct perhaps).

The "callback" idea is problematic. Maybe I don't fully understand what you suggest, but I don't think it can improve m-tags efficiency. Saving repeated matches requires storing submatch values on all repetitions along the way, and that is what YYMTAGP and YYMTAGN are for. They don't have to be implemented as functions --- instead, they can be C macros or free-form pieces of code that are inlined by re2c. But they have to store the necessary information during matching. Using trie-like data structure allows to implement that efficiently.

@dtp555-1212
Copy link
Author

Thanks for your quick reply, and being open to the concepts. I totally agree with your wanting to nail down a consistent, unambiguous syntax. I am open to anything that would solve the use case.

A named ‘closing’ parenthesis would certainly disambiguate it, but it could also be that be a rule that the parenthesis must be perfectly nested, which would automatically disambiguate the matching parenthesis. (That would be sufficient for ‘my’ use case, but I guess you ‘could’ considering allowing ‘overlapping’ regions, but that seems like it may cause a lot of complications.

As for the mtag inefficiencies, I alluded to … from what I see, that macro is instantiated ‘a lot’. Since per the mtag example code, that turns into a function call, that could add a lot of bulk and overhead to the code, for large grammars. I ‘think’ since a callback (a macro) would only get inserted for ‘accepts’ I suspect that would be very infrequent compared to what I see in some grammars that use the mtag protocol. Also, the current mtag protocol doesn’t seem to have a way to communicate the ‘what’, only the 'where' (and even the where seems to count on the ‘many’ instances of the YYMYAGP & YYMYAGN macros, as each call only is a start, end, null, not the entire span in one shot). As envisioned, only at the ‘accept’ of the pattern would the callback (a macro) need to be called with the start, end, and the ‘name’ of the pattern. I guess it ‘could’ be called repeatedly for the ‘minimal’ through all potential matches, if there were multiple ambiguous solutions, and then the application could decide what to such as extend the range, branch, etc… (because it would have all the data to do so). However, with lookahead, I suspect most well formed grammars are going to resolve to ideally a single solution (or very few solutions).

In the end, being able to achieve this use case ‘anyway’ is better than no way. I love the re2c program. Because the performance has been improved over time, I am trying more and more complex grammars. Thanks for all your hard work! I will obviously make myself available as a resource for reviewing/testing any proposal you suggest for a solution.

Thanks again

@dtp555-1212
Copy link
Author

P.S. Here was an idea for a ‘failed attempt’ at a workaround that may spark an idea… I tried to get the addresses of the tag, so I could refer to the address when the mtag was reported. The suffered from two issues… There was not a 1-to-1 correspondence for the ‘t’ being sent via the macro. And, the internal yytxx variable associated with the tag was changing whenever there was a change to the grammar.

Another observation is that the posix form looks very efficient internally (no function calls triggered by macro expansion…) and knows all the sub matches at the conclusion of the function. That looks like the ideal place to be able to instantiate a macro that would convey back the start, end and ‘name’ of the spans, rather than insert positionally into the yypmatch array.

Hope that helps. (But ignore, if you have another straightforward plan in mind)

@skvadrik
Copy link
Owner

Unfortunately, submatch extraction is a more complex problem than it might seem.

I ‘think’ since a callback (a macro) would only get inserted for ‘accepts’ I suspect that would be very infrequent compared to what I see in some grammars that use the mtag protocol.

This assumption is wrong. It is not possible to wait until the lexer is in accepting state and then save all m-tags, because by that time the necessary information is not available anymore. When a lexer is in accepting state, how is it to know what parts of the matched input correspond to different tags? Right, it needs to save tags along the way. For s-tags this means saving a pointer, and for m-tags it means saving a list of pointers.

The suffered from two issues… There was not a 1-to-1 correspondence for the ‘t’ being sent via the macro. And, the internal yytxx variable associated with the tag was changing whenever there was a change to the grammar.

And this is another misunderstanding. There is no 1-to-1 corresponding between tags and tag variables for a reason.

First, it may be impossible to implement one tag (s-tag or m-tag) with just one variable. Under the hood, every path in DFA is a bunch of parallel NFA paths (recall that a DFA state is a set of NFA states). Each NFA path may have its own tagged transitions for the same tag, and as a result one tag may have different values on different NFA paths that all go to the same DFA state. At the end, only one of the parallel NFA paths comes to an accepting state and "wins", and the final tag value is its value on that winning NFA path. But it is unknown which path is going to win until the end of match, therefore it is necessary to save all possible tag values. That's how one tag ends up having multiple versions in one DFA state. In re2c terminology, the maximum number of different versions of the same tag in a DFA state is called the tag's "degree of nondeterminism", and there is a warning -Wnondeterministic-tags which warns about tags with degree 2 or more.

Second, one variable may be enough to implement multiple tags. This becomes quite obvious if you think of cases like a* @x @y b*, where tags x and y are both in the same place. But there are less obvious cases, where different tags do not interfere with each other, and can both use the same tag variable. In truth, the relation between tags and tag variables is similar to the relation between C-level variables and CPU registers (and re2c allocates tag variables with an algorithm similar to register allocation).

That's why tag variables change so much when you change the grammar, and re2c tries to hide these internal details.

If you are interested in the gory details of submatch extraction, there are papers about it:

@dtp555-1212
Copy link
Author

Thanks for the insights... To be clear, there may have been some been some comingling of ideas between the posix and mtag paradigm. I think I understand why the mtag paradigm does what is does, and it makes sense why it is done the way it is done. However, in the 'posix' paradigm, all the examples of the code generated when using that (so far), 'appears' to make the assignment to the yypmatch array at the conclusion of the routine. (Maybe I just haven't seen a counter example.) As such, at the conclusion of the routine, it knows the final resolved start and end for each submatch. If that is true, then sending the answers back via a callback macro rather than inserting them into yypmatch, 'seems' like a plausible alternative, but as mentioned, any solution is fine. If those observations don't help to lead to one, that is ok. I am not trying to enforce how you do it. I was just trying to help by sparking some additional ideas. Hope that helps.

Thanks again for your thought and consideration of this use case.

skvadrik added a commit that referenced this issue Mar 24, 2023
Allow the use of named definitions that cause implicit grouping and
circumvent the absence of syntax for non-capturing groups. This is not
allowed by the POSIX standard, but this feature is useful in practice
and many regexp engines support it.

Update the tests that should no longer result into a compilation error.

This addresses bugs #438 "POSIX captures processing" and #308
"Overloading precedence marker (e.g. (pattern)) with 'posix' group
markers (also parenthesis) causes loss of full functionality".
skvadrik added a commit that referenced this issue Mar 25, 2023
This is useful in POSIX mode (with --posix-captures option) in order to
avoid unnecessary captures and reduce the amount of tags.

The new syntax for non-capturing groups is `(? ...)`.

Add tests for the new feature. Note that test results coincide with the
corresponding test for implicit groups (where the same group is not
parenthesized at all, but implicit grouping is achieved with the help
of a named definition).

This addresses bugs #438 "POSIX captures processing" and #308
"Overloading precedence marker (e.g. (pattern)) with 'posix' group
markers (also parenthesis) causes loss of full functionality".
@skvadrik
Copy link
Owner

skvadrik commented Mar 25, 2023

I added syntax (? ...) for non-capturing groups: 1edd25d.

skvadrik added a commit that referenced this issue Mar 25, 2023
In the case of POSIX disambiguation (with --posix-captures option) it is
necessary to add structural tags around non-capturing groups, otherwise
disambiguation will be incorrect.

Commit f519385 that allowed the use of
implicit non-capturing groups didn't properly add structural parentheses
around implicit capturing groups (which are needed if the group contains
nested capturing groups).

Commit 1edd25d that added explicit
non-capturing groups repeated the same error, only for the newly added
syntax `(? ...)`.

This resulted in `(? "a"* | ("aa"))*` being parsed incorrectly, as if
leftmost greedy disambiguation was used instead of POSIX disambiguation,
giving priority to the shorter first alternative `"a"` rather than the
longer match `"aa"` as POSIX requires.

This commit ensures correctness for both explicit and implicit
non-capturing groups by adding proper structural tags for the case of
AST sub-node `REF`. It is expected that it will bring back some of the
overhead on POSIX disambiguation reduced by the introduction of
non-capturing groups.

Related bugs: #438, #308.
@skvadrik skvadrik closed this as completed Jul 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants