Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PRCE error handling #67

Closed
nlemoine opened this issue Mar 9, 2022 · 5 comments
Closed

PRCE error handling #67

nlemoine opened this issue Mar 9, 2022 · 5 comments

Comments

@nlemoine
Copy link

nlemoine commented Mar 9, 2022

On a legacy project, I'm facing issues with some client who (I don't really know how) has inserted base64 encoded images.

Further more, those images aren't optimized at all (uncompressed PNG for photos...), resulting in huge base64 strings (multiple images up to 2mo):

<img src="....." />
<!-- base64 string is shortened for clarity 😅  -->

What is this feature about (expected vs actual behaviour)?

When minifying such HTML, I end up with an empty string because this replacement fails:

$html = (string) \preg_replace_callback(
'#<([^/\s<>!]+)(?:\s+([^<>]*?)\s*|\s*)(/?)>#',
static function ($matches) {
return '<' . $matches[1] . \preg_replace('#([^\s=]+)(=([\'"]?)(.*?)\3)?(\s+|$)#su', ' $1$2', $matches[2]) . $matches[3] . '>';
},
$html
);

Because it exceeds the pcre.backtrack-limit and returns a PREG_BACKTRACK_LIMIT_ERROR.

I'm fully aware this is all wrong: wrong image format for photo and more importantly, large images sources should be inserted as URLs.

However, technically speaking, all this crap is still valid HTML (the page displays when unminified) and I think such cases (even if it's kind of an edge case) should not fail.

Thus, I wonder if a sanity check should be inserted somewhere to prevent the script from failing and returning an empty string? Maybe catch errors on all preg_ calls and return the unminified HTML if an error occured? Or check the string length against the pcre.backtrack-limit value?

Wrapping preg_ calls the Guzzle json_decode way could be a nice solution (because it would not only handle backtrack-limit errors but all kinds of preg_ errors)?
https://github.com/guzzle/guzzle/blob/74ca2cb463a7a99a0b99f195ca809cc4ba6c3147/src/Utils.php#L281-L301

try {
    $html = (string) Utils::preg_replace_callback(
        '#<([^/\s<>!]+)(?:\s+([^<>]*?)\s*|\s*)(/?)>#',
        static function ($matches) {
            return '<' . $matches[1] . \preg_replace('#([^\s=]+)(=([\'"]?)(.*?)\3)?(\s+|$)#su', ' $1$2', $matches[2]) . $matches[3] . '>';
        },
        $html
    );
} catch (\Exception $e) {
    return $html;
}

What do you think?

How can I reproduce it?

Try to minify some HTML with length that exceeds the pcre.backtrack-limit.

Does it take minutes, hours or days to fix?

I think a couple hours should be enough, I can submit a PR but would like your feedback on the different proposed solutions before working something out.

voku added a commit that referenced this issue Mar 9, 2022
@voku
Copy link
Owner

voku commented Mar 9, 2022

Hi, thanks for the good bug report. :) The problem is *? here, I already had similar problems with another library here (voku/anti-xss@f8a2eef) so I added the same fix here, what do you think about it? 6d26d4f#diff-379408ffd163375efe70c39a5decbe4b7abfbea017cabe052a8b404ed969bfd4

https://regex101.com/r/7EBtDM/1
*? matches the previous token between zero and unlimited times, as few times as possible, expanding as needed (lazy)
* matches the previous token between zero and unlimited times, as many times as possible, giving back as needed (greedy)

voku added a commit that referenced this issue Mar 9, 2022
@nlemoine
Copy link
Author

nlemoine commented Mar 9, 2022

Thanks for your fast feedback! 🙂 Looks like a smart solution, works for me!

@nlemoine nlemoine closed this as completed Mar 9, 2022
@nlemoine
Copy link
Author

nlemoine commented Mar 9, 2022

Will you release a tag for this fix? Just so I can adjust my constraint setting in my composer.json file.

@voku
Copy link
Owner

voku commented Mar 9, 2022

Version 4.4.9 with upstream fixes from "voku/simple_html_dom" are released.

@nlemoine
Copy link
Author

nlemoine commented Mar 9, 2022

Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants