Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

preg_match_all returns false with PREG_BAD_UTF8_ERROR (4) #105

Open
zedzedzed opened this issue Sep 2, 2015 · 8 comments
Open

preg_match_all returns false with PREG_BAD_UTF8_ERROR (4) #105

zedzedzed opened this issue Sep 2, 2015 · 8 comments

Comments

@zedzedzed
Copy link
Owner

preg_match_all with the u switch returns false when there is bad UTF8 characters in the pattern or subject and stops the matching process resulting in no matches and hence no TOC for the page. In all cases, it has been caused by bad characters in the subject.

Is there a way to suppress the error and continue regardless?
Is there a WordPress core function that may be useful to filter the_content?
Why isn't it failing for other WordPress core things considering it is the_content afterall?

@zedzedzed
Copy link
Owner Author

In extract_headings, you can test the subject during debugging with:
echo mb_check_encoding( $content );

@zedzedzed
Copy link
Owner Author

http://php.net/manual/en/reference.pcre.pattern.modifiers.php

u (PCRE_UTF8)
This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern and subject strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern and the subject is checked since PHP 4.3.5. An invalid subject will cause the preg_* function to match nothing; an invalid pattern will trigger an error of level E_WARNING. Five and six octet UTF-8 sequences are regarded as invalid since PHP 5.3.4 (resp. PCRE 7.3 2007-08-28); formerly those have been regarded as valid UTF-8.

@zedzedzed zedzedzed added this to the Table of Contents Plus Next milestone Sep 4, 2015
@zedzedzed
Copy link
Owner Author

Help or assistance is needed from developers that are experienced working with UTF8 and PHP.

@dsent
Copy link

dsent commented Sep 4, 2015

I'm not experienced enough in the matter to provide a solution to the original problem, but I know that parsing HTML with regex is generally considered a bad idea. Maybe switching to some DOM library will solve both the incorrect UTF-8 problem and general limitations of HTML regex-parsing?

@hieptd
Copy link

hieptd commented Sep 5, 2015

            // remove non alphanumeric chars
             $aPattern = array (
        "a" => "á|à|ạ|ả|ã|ă|ắ|ằ|ặ|ẳ|ẵ|â|ấ|ầ|ậ|ẩ|ẫ|Á|À|Ạ|Ả|Ã|Ă|Ắ|Ằ|Ặ|Ẳ|Ẵ|Â|Ấ|Ầ|Ậ|Ẩ|Ẫ",
        "o" => "ó|ò|ọ|ỏ|õ|ô|ố|ồ|ộ|ổ|ỗ|ơ|ớ|ờ|ợ|ở|ỡ|Ó|Ò|Ọ|Ỏ|Õ|Ô|Ố|Ồ|Ộ|Ổ|Ỗ|Ơ|Ớ|Ờ|Ợ|Ở|Ỡ",
        "e" => "é|è|ẹ|ẻ|ẽ|ê|ế|ề|ệ|ể|ễ|É|È|Ẹ|Ẻ|Ẽ|Ê|Ế|Ề|Ệ|Ể|Ễ",
        "u" => "ú|ù|ụ|ủ|ũ|ư|ứ|ừ|ự|ử|ữ|Ú|Ù|Ụ|Ủ|Ũ|Ư|Ứ|Ừ|Ự|Ử|Ữ",
        "i" => "í|ì|ị|ỉ|ĩ|Í|Ì|Ị|Ỉ|Ĩ",
        "y" => "ý|ỳ|ỵ|ỷ|ỹ|Ý|Ỳ|Ỵ|Ỷ|Ỹ",
        "d" => "đ|Đ",
    );
    while(list($key,$value) = each($aPattern))
    {
        $return = @ereg_replace($value, $key, $return);
    } 

@dsent
Copy link

dsent commented Sep 5, 2015

@hieptd This code won't do what @zedzedzed needs.

@zedzedzed
Copy link
Owner Author

dsent is correct. Additionally, a solution to the code snippet provided was in WordPress's remove_accents function as mentioned in #70 and rolled out in version 1509.

I'm troubleshooting why preg_match_all fails completely when there is a bad UTF character in the subject. Also after options, opinions, thought and alternatives that aren't too slow (costly to compute). I know there have been big improvements in PHP7 but deferring to its release cannot be an option until WordPress core requires it as a minimum.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants