Skip to content

POSIX captures processing #438

@dtp555-1212

Description

@dtp555-1212

POSIX captures processing

I have a ‘re’ file that processes reasonably quick and without error when I don’t use the ‘-P’ flags to enable POSIX captures, but when enabling it, I notice 2 behaviors… It gets substantially slower (e.g. minutes vs seconds) and in the worst case I am getting a crash due to ‘bad_alloc’).

The third behavior is that when the POSIX captures are enabled, a message is issue that ‘implicit groupings’ is forbidden. This leads me to a theoretical enhancement, that may lead to a speed improvement and smaller results as well.

I ‘think’ that the implicit group rule is arbitrary. I understand why such a thing would exist, but there may be a way to accomplish both.

A named definition e.g.
num = [0-9]+;

num { return 1; }

is way to facilitate reuse and readability, rather than …

[0-9]+ { return 1; }

Using a named definition has other benefits as well, as it also conceptually serves as way to ‘group’ without using the () and their defined POSIX capture meaning.

In the case of the POSIX captures, by enforcing the explicit grouping, I think it has a wasteful side effect. (e.g. it ‘must’ track the subpattern start and stop for every named definition).

Since it is possible to create a valid and unambiguous grammar, without the extra explicit grouping (e.g. num = ([0-0+); ) … Forcing the explicit, takes away some potential. Using the () only when you want to explicitly gather the substring would reduce the size of the output, speed the processing, and reduce the size of the yypmatch to only what is desired to be saved.

For example…
num = [0-9]+;

(num) ‘ ‘ (num) { return 1; }

Using the () only when and where you want to capture, gives full control, and reduces the waste. Of course, this is a toy example, but you can see the ‘big’ negative effect even with a relatively small grammar (e.g. your unicode_indentifier.re example)… without the -P it processes in seconds, but with the -P (and after you add the () to satisfy the program it takes minutes… it also, results in 6 yynmatch (since it is tracking all the subpatterns; which in this case are only there to facilitate the definition and not a desired ‘keeper’ sub pattern) rather than just the 1 that bounds the identifier.

I love what you have done, and hopefully such a change is possible, and it can result in a dramatic speed up of both re2c processing as well as the runtime result.

Thanks for your consideration

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions