Optimize PHP html_entity_decode function #18092

ArtUkrainskiy · 2025-03-16T17:38:04Z

Improvements affect the C function traverse_for_entities:

Use memchr to search for '&' instead of scanning character by character.
Use memchr to locate ';' to determine potential entity boundaries instead of process_named_entity_html, avoiding unnecessary per-character validations.
Use memcpy instead of character-by-character copying.
Refactor code for improved structure and readability.

Benchmark for 4K-character strings :

--------------------------------------------------------------------------------------------------
|                  Test | html_entity_decode avg(ns) | html_entity_decode_new avg(ns) |  diff(%) |
--------------------------------------------------------------------------------------------------
| 200 valid entity      |                       5640 |                           2607 |  146.34% |
--------------------------------------------------------------------------------------------------
| 200 invalid entity    |                       5294 |                           2094 |  152.82% |
--------------------------------------------------------------------------------------------------
| 200 &                 |                       4016 |                           1754 |  128.96% |
--------------------------------------------------------------------------------------------------
| String endswith &     |                       2294 |                            170 | 1249.41% |
--------------------------------------------------------------------------------------------------

All tests are passed!

…mize scanning for '&' and ';' using memchr Use memcpy instead of character-by-character copying language

Girgias · 2025-03-17T13:49:57Z

ext/standard/html.c

+    char *output_ptr        = ZSTR_VAL(output);
+    int doctype             = flags & ENT_HTML_DOC_TYPE_MASK;
+
+    assert(*input_end == '\0');


If you change the input parameter to a zend_string* you'd be guaranteed this. Moreover, please use ZEND_ASSERT() instead.

Girgias · 2025-03-17T13:50:15Z

ext/standard/html.c

+
+        unsigned code = 0, code2 = 0;
+        const char *entity_end_ptr = NULL;
+        int valid_entity = 1;


Suggested change

int valid_entity = 1;

bool valid_entity = true;

bukka · 2025-03-17T14:18:21Z

@Girgias are you going to review the logic as well? Just checking if I should look into this or if you are happy to handle it all?

Girgias · 2025-03-17T14:23:11Z

@Girgias are you going to review the logic as well? Just checking if I should look into this or if you are happy to handle it all?

Please do review the logic, I only had a cursory glance :)

bukka · 2025-03-17T14:32:04Z

Ok I will check it out next week if no one is quicker.

fix logic

ArtUkrainskiy requested a review from bukka as a code owner March 16, 2025 17:38

github-actions bot added the Extension: standard label Mar 16, 2025

ArtUkrainskiy closed this Mar 16, 2025

ArtUkrainskiy reopened this Mar 16, 2025

Refactor traverse_for_entities (used in unescape_html_entities): Opti…

66f5709

…mize scanning for '&' and ';' using memchr Use memcpy instead of character-by-character copying language

ArtUkrainskiy force-pushed the html_entity_decode/improve-memchr branch from d166abe to 66f5709 Compare March 16, 2025 17:52

Girgias reviewed Mar 17, 2025

View reviewed changes

CR, refactoring, codestyle

f093c30

fix logic

ArtUkrainskiy force-pushed the html_entity_decode/improve-memchr branch from 9b3e96d to f093c30 Compare March 17, 2025 16:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize PHP html_entity_decode function #18092

Optimize PHP html_entity_decode function #18092

ArtUkrainskiy commented Mar 16, 2025

Girgias Mar 17, 2025

Girgias Mar 17, 2025

bukka commented Mar 17, 2025 •

edited

Loading

Girgias commented Mar 17, 2025

bukka commented Mar 17, 2025

Optimize PHP html_entity_decode function #18092

Are you sure you want to change the base?

Optimize PHP html_entity_decode function #18092

Conversation

ArtUkrainskiy commented Mar 16, 2025

Girgias Mar 17, 2025

Choose a reason for hiding this comment

Girgias Mar 17, 2025

Choose a reason for hiding this comment

bukka commented Mar 17, 2025 • edited Loading

Girgias commented Mar 17, 2025

bukka commented Mar 17, 2025

bukka commented Mar 17, 2025 •

edited

Loading