preg_match_all returns false with PREG_BAD_UTF8_ERROR (4) #105

zedzedzed · 2015-09-02T02:16:40Z

preg_match_all with the u switch returns false when there is bad UTF8 characters in the pattern or subject and stops the matching process resulting in no matches and hence no TOC for the page. In all cases, it has been caused by bad characters in the subject.

Is there a way to suppress the error and continue regardless?
Is there a WordPress core function that may be useful to filter the_content?
Why isn't it failing for other WordPress core things considering it is the_content afterall?

zedzedzed · 2015-09-02T02:18:15Z

In extract_headings, you can test the subject during debugging with:
echo mb_check_encoding( $content );

zedzedzed · 2015-09-02T03:36:38Z

http://php.net/manual/en/reference.pcre.pattern.modifiers.php

u (PCRE_UTF8)
This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern and subject strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern and the subject is checked since PHP 4.3.5. An invalid subject will cause the preg_* function to match nothing; an invalid pattern will trigger an error of level E_WARNING. Five and six octet UTF-8 sequences are regarded as invalid since PHP 5.3.4 (resp. PCRE 7.3 2007-08-28); formerly those have been regarded as valid UTF-8.

zedzedzed · 2015-09-02T04:00:45Z

http://us3.php.net/manual/en/function.preg-match-all.php#86366
http://gotoanswer.com/?q=UTF-8+characters+in+preg_match_all+%28PHP%29

zedzedzed · 2015-09-04T10:17:00Z

Help or assistance is needed from developers that are experienced working with UTF8 and PHP.

dsent · 2015-09-04T13:53:06Z

I'm not experienced enough in the matter to provide a solution to the original problem, but I know that parsing HTML with regex is generally considered a bad idea. Maybe switching to some DOM library will solve both the incorrect UTF-8 problem and general limitations of HTML regex-parsing?

hieptd · 2015-09-05T09:13:40Z

            // remove non alphanumeric chars
             $aPattern = array (
        "a" => "á|à|ạ|ả|ã|ă|ắ|ằ|ặ|ẳ|ẵ|â|ấ|ầ|ậ|ẩ|ẫ|Á|À|Ạ|Ả|Ã|Ă|Ắ|Ằ|Ặ|Ẳ|Ẵ|Â|Ấ|Ầ|Ậ|Ẩ|Ẫ",
        "o" => "ó|ò|ọ|ỏ|õ|ô|ố|ồ|ộ|ổ|ỗ|ơ|ớ|ờ|ợ|ở|ỡ|Ó|Ò|Ọ|Ỏ|Õ|Ô|Ố|Ồ|Ộ|Ổ|Ỗ|Ơ|Ớ|Ờ|Ợ|Ở|Ỡ",
        "e" => "é|è|ẹ|ẻ|ẽ|ê|ế|ề|ệ|ể|ễ|É|È|Ẹ|Ẻ|Ẽ|Ê|Ế|Ề|Ệ|Ể|Ễ",
        "u" => "ú|ù|ụ|ủ|ũ|ư|ứ|ừ|ự|ử|ữ|Ú|Ù|Ụ|Ủ|Ũ|Ư|Ứ|Ừ|Ự|Ử|Ữ",
        "i" => "í|ì|ị|ỉ|ĩ|Í|Ì|Ị|Ỉ|Ĩ",
        "y" => "ý|ỳ|ỵ|ỷ|ỹ|Ý|Ỳ|Ỵ|Ỷ|Ỹ",
        "d" => "đ|Đ",
    );
    while(list($key,$value) = each($aPattern))
    {
        $return = @ereg_replace($value, $key, $return);
    }

dsent · 2015-09-05T09:21:21Z

@hieptd This code won't do what @zedzedzed needs.

zedzedzed · 2015-09-07T03:43:04Z

dsent is correct. Additionally, a solution to the code snippet provided was in WordPress's remove_accents function as mentioned in #70 and rolled out in version 1509.

I'm troubleshooting why preg_match_all fails completely when there is a bad UTF character in the subject. Also after options, opinions, thought and alternatives that aren't too slow (costly to compute). I know there have been big improvements in PHP7 but deferring to its release cannot be an option until WordPress core requires it as a minimum.

zedzedzed added this to the Table of Contents Plus Next milestone Sep 4, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

preg_match_all returns false with PREG_BAD_UTF8_ERROR (4) #105

preg_match_all returns false with PREG_BAD_UTF8_ERROR (4) #105

zedzedzed commented Sep 2, 2015

zedzedzed commented Sep 2, 2015

zedzedzed commented Sep 2, 2015

zedzedzed commented Sep 2, 2015

zedzedzed commented Sep 4, 2015

dsent commented Sep 4, 2015

hieptd commented Sep 5, 2015

dsent commented Sep 5, 2015

zedzedzed commented Sep 7, 2015

preg_match_all returns false with PREG_BAD_UTF8_ERROR (4) #105

preg_match_all returns false with PREG_BAD_UTF8_ERROR (4) #105

Comments

zedzedzed commented Sep 2, 2015

zedzedzed commented Sep 2, 2015

zedzedzed commented Sep 2, 2015

zedzedzed commented Sep 2, 2015

zedzedzed commented Sep 4, 2015

dsent commented Sep 4, 2015

hieptd commented Sep 5, 2015

dsent commented Sep 5, 2015

zedzedzed commented Sep 7, 2015