Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization for htmlspecialchars function #18126

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

ArtUkrainskiy
Copy link

@ArtUkrainskiy ArtUkrainskiy commented Mar 21, 2025

Optimization for htmlspecialchars function.

A dedicated php_htmlspecialchars function instead of the “universal” php_escape_html_entities_ex.
We work with ASCII-compatible encodings, we can employ byte-by-byte scanning and a lookup table to identify special characters. For c < 0x80, the lookup table is used; for potentially multi-byte characters, we continue to rely on get_next_char.
This approach provides a noticeable performance improvement for ASCII strings and some improvement for multi-byte strings due to more optimized logic.

Important!
We have changed the entity validation logic so that the maximum length of an entity name can no longer exceed LONGEST_ENTITY_LENGTH.

Benchmarks 4k char strings:
htmlspecialchars with UTF-8 encoding

 ---------------------------------------------------------------------------------------------------------
 |                      Test |  htmlspecialchars avg(ns) | htmlspecialchars_new avg(ns) |        diff(%) |
 ---------------------------------------------------------------------------------------------------------
 | 300 specialchars          |                     40657 |                        11559 |        251.73% |
 ---------------------------------------------------------------------------------------------------------
 | 300 spchrs. with entities |                     40986 |                        11550 |        254.86% |
 ---------------------------------------------------------------------------------------------------------
 | Clean ASCII string        |                     38348 |                         7401 |        418.15% |
 ---------------------------------------------------------------------------------------------------------
 | Cyrillic                  |                     50003 |                        38302 |         30.55% |
 ---------------------------------------------------------------------------------------------------------
 | Chinese                   |                     55860 |                        46308 |         20.63% |
 ---------------------------------------------------------------------------------------------------------
 | Japanese                  |                     56832 |                        46991 |         20.94% |
 ---------------------------------------------------------------------------------------------------------

htmlspecialchars with lang-specific encoding. Without encoding hint

------------------------------------------------------------------------------------------------------
|                      Test  |  htmlspecialchars avg(ns) | htmlspecialchars_new avg(ns) |    diff(%) |
-----------------------------------------------------------------------------------------------------
| Cyrillic CP1251            |                     41518 |                        35148 |     18.12% |
-----------------------------------------------------------------------------------------------------
| Chinese  Big5              |                     57282 |                        47005 |     21.86% |
-----------------------------------------------------------------------------------------------------
| Japanese SJIS              |                     54703 |                        49887 |      9.65% |
-----------------------------------------------------------------------------------------------------

htmlspecialchars with lang-specific encoding. With encoding hint

------------------------------------------------------------------------------------------------------
|                      Test  |  htmlspecialchars avg(ns) | htmlspecialchars_new avg(ns) |    diff(%) |
-----------------------------------------------------------------------------------------------------
| Cyrillic CP1251            |                     37185 |                        29197 |     27.36% |
-----------------------------------------------------------------------------------------------------
| Chinese  Big5              |                     48877 |                        38191 |     27.98% |
-----------------------------------------------------------------------------------------------------
| Japanese SJIS              |                     46282 |                        38317 |     20.79% |
-----------------------------------------------------------------------------------------------------

We may need more benchmarks.
This is not the final optimization, as get_next_char remains suboptimal for this function due to extra character-detection steps that aren’t required under the default flags.

A dedicated php_htmlspecialchars function instead of the “universal”
php_escape_html_entities_ex.
We work with ASCII-compatible encodings, we can employ byte-by-byte scanning and a lookup
table to identify special characters. For c < 0x80, the lookup table is used; for
potentially multi-byte characters, we continue to rely on get_next_char.
This approach provides a noticeable performance improvement for ASCII strings and some
 improvement for multi-byte strings due to more optimized logic.
The new htmlspecialchars function respects the maximum entity size, defined as LONGEST_ENTITY_LENGTH.
There is no strict limit on the length of a numeric entity in the HTML and XML specifications,
but in practice the maximum possible is &#x10FFFF;, which takes up 10 characters.
Any numeric entities larger than this size are effectively invalid and will not be processed by browsers.
@ArtUkrainskiy ArtUkrainskiy marked this pull request as ready for review March 21, 2025 14:45
@ArtUkrainskiy ArtUkrainskiy requested a review from bukka as a code owner March 21, 2025 14:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant