Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RegularParser improvements #72

Merged
merged 6 commits into from
Dec 9, 2018
Merged

RegularParser improvements #72

merged 6 commits into from
Dec 9, 2018

Conversation

thunderer
Copy link
Owner

@thunderer thunderer commented Dec 8, 2018

Results are based on Blackfire runs of test data provided in #71. Current master finishes in 12.5s and uses 514MiBs of memory.

  • improved lexer regex generator, now non-token input parts are reported as a single token instead of per-character matches. Blackfire reports ~8s and 337MiB respectively,
  • backtracks now rely on their offsets only instead of per-backtrack token storage. Blackfire reports over 143x performance gain, ~85ms and ~9MiB respectively,
  • during verification phase I accidentally replicated On large HTML files, RegularParser does not fire any handler. #70 and to fix it I needed to greatly simplify the lexer regex and switch the token comparison method from named capture groups to string comparison, which ate a significant part of the performance gains mentioned above. Still, the final result is 850ms and ~47MiB. Not as good as I hoped, but I'm still happy with the result,
  • inlined RegularParser::content() body inside RegularParser::shortcode() which halves the nesting level requirement causing problems in RegularParser = Uncaught Error: Maximum function nesting level of '256' reached, aborting! #71,
  • there is no way of detecting the current nesting level and safely manipulating its value to avoid the \Error being thrown in the middle of parsing process, therefore I'm disabling the xdebug.max_nesting_level setting it to -1 and restoring its value afterwards. AFAIK PHP does not have the nesting level limit, it's strictly XDebug mechanism and the error is reproducible only when XDebug is enabled with its default 256 nesting limit. Disabling XDebug or increasing the nesting limit (test data required 1024, 512 after inlining content()).
  • lexer was reverted to named capture groups, for some reason slightly reordered alternations do not cause memory limit errors anymore.

That's my kind of well spent Saturday afternoon (and Sunday).

… characters, added opcode optimized function aliases
…orage, profiler reports over 143x performance improvement for RegularParser, a well spent Saturday afternoon
@thunderer
Copy link
Owner Author

Unfortunately, after additional manual tests, I think I accidentally reproduced #70. For some reason there are only 9 shortcodes returned from #71's input. Still working on it.

@thunderer
Copy link
Owner Author

I found and fixed #70: preg_match_all() silently failed and returned only the small part of expected results. Unfortunately, the currently used lexer regex is way too complex for large inputs and named capture based tokenizer won't work. Maybe PCRE2 from PHP 7.3 will help in some way, I will need to check that. I simplified it to complete bare bones (basic alternation with grouped \s+ and [\w-]+) and the mechanism was switched to string compare token matching, which ate significant part of the "143x performance gain" mentioned above. Still, from baseline we're -93% in time (850ms) and -91% in memory (~47MiB).

… small part of expected results, named captures based tokenizer was replaced with string matching which eats significant portion of previous performance gains, minor RegularParser and Processor improvements
@rhukster
Copy link

rhukster commented Dec 9, 2018

Thanks for your efforts @thunderer, unfortunately, the loop error/out of memory error has returned with the latest commit.

During this process, i noticed that xdebug was partially to blame. With xdebug on (how i run 99% of the time), I get this error, but with it off, I don't. I think a lot of other people run with xdebug on in their dev environments at least, hence the number of issues reported.

But with the latest commit, the original issue remains.

…alved, simplified lookahead() method body as type is always specified
… and restore it afterwards: there is no reliable way of detecting the current nesting level and safely manipulating its value in runtime to avoid process-breaking Error being thrown
…longer able to reproduce PCRE_JIT_STACKLIMIT_ERROR
@thunderer
Copy link
Owner Author

@rhukster I think I've got both issues right this time, but this needs some explanation.

Nesting You're right that nesting level errors are XDebug's "fault" as PHP itself does not have such limit, and that's why I decided to disable xdebug.max_nesting_level by setting it to -1 at the beginning of RegularParser::parse() and restoring the previous value just before returning the result. I also implemented another change (inlined content()) which halved the required nesting level, but that's as far as I can go now. This issue should be resolved now.

Memory Memory limit is a different story - it's possible to change it during runtime (just like the nesting case above), but I don't think it's the right solution. I don't control the size or complexity of target project's data, therefore I can only make my library use as little memory as possible, while being aware that users will always be able to construct the input large enough for it to fail. Your example from #71 uses ~47MiB so it is well under the 128MiB limit reported in getgrav/grav-plugin-shortcode-core#53 and gantry/gantry5#2426 . This issue should also be resolved now, but if you still experience memory limit issues please provide the exact input and the amount of memory it takes to process the data in your environment so we can compare the results and analyze where the differences come from. For large enough inputs the only solution is to increase the amount of available memory.

@rhukster
Copy link

rhukster commented Dec 9, 2018

Tested and can confirm the latest changes resolve the loop/memory issues I was seeing for large HTML and also with XDebug on. Looks good to me!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants