RegularParser improvements #72

thunderer · 2018-12-08T20:22:28Z

Results are based on Blackfire runs of test data provided in #71. Current master finishes in 12.5s and uses 514MiBs of memory.

improved lexer regex generator, now non-token input parts are reported as a single token instead of per-character matches. Blackfire reports ~8s and 337MiB respectively,
backtracks now rely on their offsets only instead of per-backtrack token storage. Blackfire reports over 143x performance gain, ~85ms and ~9MiB respectively,
during verification phase I accidentally replicated On large HTML files, RegularParser does not fire any handler. #70 and to fix it I needed to greatly simplify the lexer regex and switch the token comparison method from named capture groups to string comparison, which ate a significant part of the performance gains mentioned above. Still, the final result is 850ms and ~47MiB. Not as good as I hoped, but I'm still happy with the result,
inlined RegularParser::content() body inside RegularParser::shortcode() which halves the nesting level requirement causing problems in RegularParser = Uncaught Error: Maximum function nesting level of '256' reached, aborting! #71,
there is no way of detecting the current nesting level and safely manipulating its value to avoid the \Error being thrown in the middle of parsing process, therefore I'm disabling the xdebug.max_nesting_level setting it to -1 and restoring its value afterwards. AFAIK PHP does not have the nesting level limit, it's strictly XDebug mechanism and the error is reproducible only when XDebug is enabled with its default 256 nesting limit. Disabling XDebug or increasing the nesting limit (test data required 1024, 512 after inlining content()).
lexer was reverted to named capture groups, for some reason slightly reordered alternations do not cause memory limit errors anymore.

That's my kind of well spent Saturday afternoon (and Sunday).

… characters, added opcode optimized function aliases

…orage, profiler reports over 143x performance improvement for RegularParser, a well spent Saturday afternoon

thunderer · 2018-12-08T21:11:18Z

Unfortunately, after additional manual tests, I think I accidentally reproduced #70. For some reason there are only 9 shortcodes returned from #71's input. Still working on it.

thunderer · 2018-12-08T23:48:23Z

I found and fixed #70: preg_match_all() silently failed and returned only the small part of expected results. Unfortunately, the currently used lexer regex is way too complex for large inputs and named capture based tokenizer won't work. Maybe PCRE2 from PHP 7.3 will help in some way, I will need to check that. I simplified it to complete bare bones (basic alternation with grouped \s+ and [\w-]+) and the mechanism was switched to string compare token matching, which ate significant part of the "143x performance gain" mentioned above. Still, from baseline we're -93% in time (850ms) and -91% in memory (~47MiB).

… small part of expected results, named captures based tokenizer was replaced with string matching which eats significant portion of previous performance gains, minor RegularParser and Processor improvements

rhukster · 2018-12-09T00:48:24Z

Thanks for your efforts @thunderer, unfortunately, the loop error/out of memory error has returned with the latest commit.

During this process, i noticed that xdebug was partially to blame. With xdebug on (how i run 99% of the time), I get this error, but with it off, I don't. I think a lot of other people run with xdebug on in their dev environments at least, hence the number of issues reported.

But with the latest commit, the original issue remains.

…alved, simplified lookahead() method body as type is always specified

… and restore it afterwards: there is no reliable way of detecting the current nesting level and safely manipulating its value in runtime to avoid process-breaking Error being thrown

…longer able to reproduce PCRE_JIT_STACKLIMIT_ERROR

thunderer · 2018-12-09T16:00:16Z

@rhukster I think I've got both issues right this time, but this needs some explanation.

Nesting You're right that nesting level errors are XDebug's "fault" as PHP itself does not have such limit, and that's why I decided to disable xdebug.max_nesting_level by setting it to -1 at the beginning of RegularParser::parse() and restoring the previous value just before returning the result. I also implemented another change (inlined content()) which halved the required nesting level, but that's as far as I can go now. This issue should be resolved now.

Memory Memory limit is a different story - it's possible to change it during runtime (just like the nesting case above), but I don't think it's the right solution. I don't control the size or complexity of target project's data, therefore I can only make my library use as little memory as possible, while being aware that users will always be able to construct the input large enough for it to fail. Your example from #71 uses ~47MiB so it is well under the 128MiB limit reported in getgrav/grav-plugin-shortcode-core#53 and gantry/gantry5#2426 . This issue should also be resolved now, but if you still experience memory limit issues please provide the exact input and the amount of memory it takes to process the data in your environment so we can compare the results and analyze where the differences come from. For large enough inputs the only solution is to increase the amount of available memory.

rhukster · 2018-12-09T18:04:50Z

Tested and can confirm the latest changes resolve the loop/memory issues I was seeing for large HTML and also with XDebug on. Looks good to me!

thunderer added 2 commits December 8, 2018 19:17

improved parser regex to report single tokens for series of non-token…

16308a3

… characters, added opcode optimized function aliases

backtracks now rely on their offsets only, removed backtrack token st…

495c211

…orage, profiler reports over 143x performance improvement for RegularParser, a well spent Saturday afternoon

thunderer added the patch label Dec 8, 2018

thunderer self-assigned this Dec 8, 2018

fixed issue #70, preg_match_all silently failed and returned only the…

2f52e54

… small part of expected results, named captures based tokenizer was replaced with string matching which eats significant portion of previous performance gains, minor RegularParser and Processor improvements

thunderer force-pushed the parser-improvements branch from 6ef3344 to 2f52e54 Compare December 9, 2018 00:09

thunderer added 3 commits December 9, 2018 12:40

inlined content() private method in RegularParser, nesting level is h…

590dfbb

…alved, simplified lookahead() method body as type is always specified

disable xdebug.max_nesting_level during RegularParser parsing process…

df9330a

… and restore it afterwards: there is no reliable way of detecting the current nesting level and safely manipulating its value in runtime to avoid process-breaking Error being thrown

reverted token lexer to named capture groups, for some reason I'm no …

39440f1

…longer able to reproduce PCRE_JIT_STACKLIMIT_ERROR

thunderer merged commit ccf16e8 into master Dec 9, 2018

thunderer deleted the parser-improvements branch December 9, 2018 23:07

thunderer mentioned this pull request Dec 10, 2018

Parameter and BBCode simple values with syntax tokens #66

Open

thunderer mentioned this pull request Dec 18, 2018

Shortcodes ignored with multiple nested levels #47

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RegularParser improvements #72

RegularParser improvements #72

thunderer commented Dec 8, 2018 •

edited

Loading

thunderer commented Dec 8, 2018

thunderer commented Dec 8, 2018

rhukster commented Dec 9, 2018

thunderer commented Dec 9, 2018

rhukster commented Dec 9, 2018

RegularParser improvements #72

RegularParser improvements #72

Conversation

thunderer commented Dec 8, 2018 • edited Loading

thunderer commented Dec 8, 2018

thunderer commented Dec 8, 2018

rhukster commented Dec 9, 2018

thunderer commented Dec 9, 2018

rhukster commented Dec 9, 2018

thunderer commented Dec 8, 2018 •

edited

Loading