-
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Add multibyte support for string export. #437
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add multibyte support for string export. #437
Conversation
|
Hi, thanks for the patch! Can you explain to me why I've tested it on my machine and it worked out for everything but Also can you move the tests that don't need regular expressions into Testing Sorry for my confusion there |
|
In that case we should probably also use Unicode character properties. I.e. something like |
…er 0x7f is treated as printable.
Yes, many legacy software have been using non UTF-8 multibyte characters as their internal encoding, especially in Japan the EUC-JP encoding is heavily used. And also the Shift_JIS encoding can be used with multibyte support by the --enable-zend-multibyte option (Zend Multibyte). Additionally as of PHP 5.4.0 Zend Multibyte is available as the default. Thus I think that non UTF-8 characters should be displayed as normal string.
The character 0x7f is the control code DELETE. However it's no matter since PHP's method definition is [a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]* and it is safe for displaying. The data array(chr(0x7f) . chr(0x80), '/^Binary String: 0x7f80$/') is for a string that consists of printable and non printable characters, it should be treated as a binary string.
Yes, but I think that it is more intentional that the specification for whether a string should be displayed as a normal or hexadecimal string is in one place than in two places.
I believe that my test is now more explicit and clean, but I would appreciate your changes. |
|
Merged. Hope that works out for you. Like I said it's not a 100% solution but if it fixes your use cases it's good to have it. Thanks for the pull |
|
I've tested with the upstream branch, and it worked for me. Thank you for your response. |
|
@edorian: 0x80 (hex value) is above a byte (8 bits) that UTF8 is able to pack (encode) into one byte. A good edge case! Please also include 0x7F, 0x81, 0x00, 0x01, 0xFF, 0xFE into tests as well. |
Binary string export has been solved by #306, but another issue has been brought by it. A string that includes one or more multibyte characters is always treated as a binary string by the current implementation. My request solves this issue.