-
Notifications
You must be signed in to change notification settings - Fork 198
Description
POSIX captures processing
I have a ‘re’ file that processes reasonably quick and without error when I don’t use the ‘-P’ flags to enable POSIX captures, but when enabling it, I notice 2 behaviors… It gets substantially slower (e.g. minutes vs seconds) and in the worst case I am getting a crash due to ‘bad_alloc’).
The third behavior is that when the POSIX captures are enabled, a message is issue that ‘implicit groupings’ is forbidden. This leads me to a theoretical enhancement, that may lead to a speed improvement and smaller results as well.
I ‘think’ that the implicit group rule is arbitrary. I understand why such a thing would exist, but there may be a way to accomplish both.
A named definition e.g.
num = [0-9]+;
num { return 1; }
is way to facilitate reuse and readability, rather than …
[0-9]+ { return 1; }
Using a named definition has other benefits as well, as it also conceptually serves as way to ‘group’ without using the () and their defined POSIX capture meaning.
In the case of the POSIX captures, by enforcing the explicit grouping, I think it has a wasteful side effect. (e.g. it ‘must’ track the subpattern start and stop for every named definition).
Since it is possible to create a valid and unambiguous grammar, without the extra explicit grouping (e.g. num = ([0-0+); ) … Forcing the explicit, takes away some potential. Using the () only when you want to explicitly gather the substring would reduce the size of the output, speed the processing, and reduce the size of the yypmatch to only what is desired to be saved.
For example…
num = [0-9]+;
(num) ‘ ‘ (num) { return 1; }
Using the () only when and where you want to capture, gives full control, and reduces the waste. Of course, this is a toy example, but you can see the ‘big’ negative effect even with a relatively small grammar (e.g. your unicode_indentifier.re example)… without the -P it processes in seconds, but with the -P (and after you add the () to satisfy the program it takes minutes… it also, results in 6 yynmatch (since it is tracking all the subpatterns; which in this case are only there to facilitate the definition and not a desired ‘keeper’ sub pattern) rather than just the 1 that bounds the identifier.
I love what you have done, and hopefully such a change is possible, and it can result in a dramatic speed up of both re2c processing as well as the runtime result.
Thanks for your consideration