Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YAML] Improve performance of YAML parser #34058

Merged

Conversation

@NamelessCoder
Copy link
Contributor

NamelessCoder commented Oct 22, 2019

Q A
Branch? master
Bug fix? no
New feature? no
Deprecations? no
License MIT

Optimise various methods and conditions to use best
performing alternatives where possible. Roughly:

  • Uses methods that do not copy memory, e.g. strncmp
    as alternative for strpos matching beginning of string.
  • Switches order of some conditions to put the cheapest
    checks first in order.
  • Checks input before calling trim() - despite the function
    returning the same string as input, it still costs memory
    and introduces unnecessary overhead.
  • Extracts variables for repeated identical function calls.
  • Uses negative substring offsets instead of strlen + substr.
  • Replaces single-char substr usages with substring access.

Profiling method

Profiled using a custom script which splits and parses all provided fixture files from the YAML component's test directory, then profiled this through Blackfire and identified frequent method calls.

Refactoring strategy

Most important change: switching strpos to strncmp to avoid scanning a full (and potentially very long) YAML line for occurrence of a substring.

Whenever possible, I've gone for PHP methods that do not copy memory and replaced some instances of function calls which could be replaced with substring access.

In methods which are called frequently I've gone for guard clauses to prevent further processing if a YAML line is, for example, empty. Such as avoiding trim() on already empty lines.

Profiling results

A Blackfire profiling delta can be seen on https://blackfire.io/profiles/compare/90fd3005-8b9f-4534-8bd8-1e66832bf247/graph. Taken with 200 samples which render every YAML fixture from the component's test dir.

@nicolas-grekas

This comment has been minimized.

Copy link
Member

nicolas-grekas commented Oct 22, 2019

Can you please ensure you had opcache enabled when doing the profiling?
On the cli it's disabled by default, so this change might still be nice for the console, but maybe it changes little when opcache is used? I'd like to know :)
opcache.enable_cli=1

@nicolas-grekas nicolas-grekas added this to the next milestone Oct 22, 2019
@NamelessCoder

This comment has been minimized.

Copy link
Contributor Author

NamelessCoder commented Oct 22, 2019

Absolutely, I will do some further profiling and maybe throw some more complex files at it to test the difference in long lines particularly.

And fix the various style/test problems, of course ;)

src/Symfony/Component/Yaml/Inline.php Outdated Show resolved Hide resolved
src/Symfony/Component/Yaml/Inline.php Outdated Show resolved Hide resolved
src/Symfony/Component/Yaml/Inline.php Outdated Show resolved Hide resolved
src/Symfony/Component/Yaml/Parser.php Outdated Show resolved Hide resolved
@NamelessCoder

This comment has been minimized.

Copy link
Contributor Author

NamelessCoder commented Oct 22, 2019

Can you please ensure you had opcache enabled when doing the profiling?

The difference is much less pronounced with opcache enabled, but should increase with YAML data sets containing long lines.

https://blackfire.io/profiles/compare/4db7ae64-4fa7-4102-b6e3-16aea9b8c9e9/graph

(profile done with fixes for issues mentioned by @stof applied, which has increased the calls to strncmp - still working on resolving 12 failing tests)

@NamelessCoder NamelessCoder force-pushed the NamelessCoder:features/yaml-performance branch 2 times, most recently from bea0fd7 to 70e3a6c Oct 23, 2019
@NamelessCoder

This comment has been minimized.

Copy link
Contributor Author

NamelessCoder commented Oct 23, 2019

Updated profile comparison:

https://blackfire.io/profiles/compare/60e240e4-6fc6-4101-ba57-2eb7d3acf05b/graph

With opcache the optimisations don't have as much impact, however, the costs incurred by strpos vs. strncmp are visible even with the current test fixtures which don't contain long strings. Profiling continues here.

Hopefully I've taken care of all the review comments raised.

Should you decide against the PR on the grounds that it moves a lot of code for a relatively small benefit I will of course understand. I'm hoping to prove a more pronounced difference with some more real-life-like fixtures.

(thanks btw everyone for a good experience with my first Symfony PR!)

src/Symfony/Component/Yaml/Inline.php Outdated Show resolved Hide resolved
src/Symfony/Component/Yaml/Parser.php Outdated Show resolved Hide resolved
src/Symfony/Component/Yaml/Parser.php Outdated Show resolved Hide resolved
src/Symfony/Component/Yaml/Parser.php Outdated Show resolved Hide resolved
src/Symfony/Component/Yaml/Parser.php Outdated Show resolved Hide resolved
src/Symfony/Component/Yaml/Parser.php Outdated Show resolved Hide resolved
src/Symfony/Component/Yaml/Parser.php Show resolved Hide resolved
@nicolas-grekas

This comment has been minimized.

Copy link
Member

nicolas-grekas commented Dec 11, 2019

(rebase + retarget needed for master)

@nicolas-grekas nicolas-grekas changed the base branch from 4.4 to master Feb 7, 2020
@nicolas-grekas nicolas-grekas force-pushed the NamelessCoder:features/yaml-performance branch from 70e3a6c to 806f76a Feb 7, 2020
Optimise various methods and conditions to use best
performing alternatives where possible. Roughly:

* Uses methods that do not copy memory, e.g. strncmp
  as alternative for strpos matching beginning of string.
* Switches order of some conditions to put the cheapest
  checks first in order.
* Checks input before calling trim() - despite the function
  returning the same string as input, it still costs memory
  and introduces unnecessary overhead.
* Extracts variables for repeated identical function calls.
* Uses negative substring offsets instead of strlen + substr.
* Replaces single-char substr usages with substring access.
@nicolas-grekas nicolas-grekas force-pushed the NamelessCoder:features/yaml-performance branch from 806f76a to 7a7c966 Feb 7, 2020
Copy link
Member

nicolas-grekas left a comment

I applied @xabbuh suggestions.

@nicolas-grekas

This comment has been minimized.

Copy link
Member

nicolas-grekas commented Feb 7, 2020

Thank you @NamelessCoder.

nicolas-grekas added a commit that referenced this pull request Feb 7, 2020
This PR was merged into the 5.1-dev branch.

Discussion
----------

[YAML] Improve performance of YAML parser

| Q             | A
| ------------- | ---
| Branch?       | master
| Bug fix?      | no
| New feature?  | no
| Deprecations? | no
| License       | MIT

Optimise various methods and conditions to use best
performing alternatives where possible. Roughly:

* Uses methods that do not copy memory, e.g. strncmp
  as alternative for strpos matching beginning of string.
* Switches order of some conditions to put the cheapest
  checks first in order.
* Checks input before calling trim() - despite the function
  returning the same string as input, it still costs memory
  and introduces unnecessary overhead.
* Extracts variables for repeated identical function calls.
* Uses negative substring offsets instead of strlen + substr.
* Replaces single-char substr usages with substring access.

Profiling method
-----------------

Profiled using a custom script which splits and parses all provided `fixture` files from the YAML component's test directory, then profiled this through Blackfire and identified frequent method calls.

Refactoring strategy
--------------------

Most important change: switching strpos to strncmp to avoid scanning a full (and potentially very long) YAML line for occurrence of a substring.

Whenever possible, I've gone for PHP methods that do not copy memory and replaced some instances of function calls which could be replaced with substring access.

In methods which are called frequently I've gone for guard clauses to prevent further processing if a YAML line is, for example, empty. Such as avoiding trim() on already empty lines.

Profiling results
----------------

A Blackfire profiling delta can be seen on https://blackfire.io/profiles/compare/90fd3005-8b9f-4534-8bd8-1e66832bf247/graph. Taken with 200 samples which render every YAML fixture from the component's test dir.

Commits
-------

7a7c966 [YAML] Improve performance of YAML parser
@nicolas-grekas nicolas-grekas merged commit 7a7c966 into symfony:master Feb 7, 2020
1 of 3 checks passed
1 of 3 checks passed
continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
Details
continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
fabbot.io Your code looks good.
Details
@NamelessCoder

This comment has been minimized.

Copy link
Contributor Author

NamelessCoder commented Feb 7, 2020

Thank you so much for driving this one home @nicolas-grekas, I'm particularly happy to see the null-coalesce changes!

@NamelessCoder NamelessCoder deleted the NamelessCoder:features/yaml-performance branch Feb 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

7 participants
You can’t perform that action at this time.