Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize PHP html_entity_decode function #18092

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

ArtUkrainskiy
Copy link

Improvements affect the C function traverse_for_entities:

  • Use memchr to search for '&' instead of scanning character by character.
  • Use memchr to locate ';' to determine potential entity boundaries instead of process_named_entity_html, avoiding unnecessary per-character validations.
  • Use memcpy instead of character-by-character copying.
  • Refactor code for improved structure and readability.

Benchmark for 4K-character strings :

--------------------------------------------------------------------------------------------------
|                  Test | html_entity_decode avg(ns) | html_entity_decode_new avg(ns) |  diff(%) |
--------------------------------------------------------------------------------------------------
| 200 valid entity      |                       5640 |                           2607 |  146.34% |
--------------------------------------------------------------------------------------------------
| 200 invalid entity    |                       5294 |                           2094 |  152.82% |
--------------------------------------------------------------------------------------------------
| 200 &                 |                       4016 |                           1754 |  128.96% |
--------------------------------------------------------------------------------------------------
| String endswith &     |                       2294 |                            170 | 1249.41% |
--------------------------------------------------------------------------------------------------

All tests are passed!

…mize scanning for '&' and ';' using memchr

Use memcpy instead of character-by-character copying

language
@ArtUkrainskiy ArtUkrainskiy force-pushed the html_entity_decode/improve-memchr branch from d166abe to 66f5709 Compare March 16, 2025 17:52
char *output_ptr = ZSTR_VAL(output);
int doctype = flags & ENT_HTML_DOC_TYPE_MASK;

assert(*input_end == '\0');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you change the input parameter to a zend_string* you'd be guaranteed this. Moreover, please use ZEND_ASSERT() instead.


unsigned code = 0, code2 = 0;
const char *entity_end_ptr = NULL;
int valid_entity = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
int valid_entity = 1;
bool valid_entity = true;

@bukka
Copy link
Member

bukka commented Mar 17, 2025

@Girgias are you going to review the logic as well? Just checking if I should look into this or if you are happy to handle it all?

@Girgias
Copy link
Member

Girgias commented Mar 17, 2025

@Girgias are you going to review the logic as well? Just checking if I should look into this or if you are happy to handle it all?

Please do review the logic, I only had a cursory glance :)

@bukka
Copy link
Member

bukka commented Mar 17, 2025

Ok I will check it out next week if no one is quicker.

@ArtUkrainskiy ArtUkrainskiy force-pushed the html_entity_decode/improve-memchr branch from 9b3e96d to f093c30 Compare March 17, 2025 16:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants