Add multibyte support for string export. #437

iteman · 2011-12-10T05:05:02Z

Binary string export has been solved by #306, but another issue has been brought by it. A string that includes one or more multibyte characters is always treated as a binary string by the current implementation. My request solves this issue.

edorian · 2011-12-10T14:26:11Z

Hi,

thanks for the patch!

Can you explain to me why ~[^[:print:][:space:]]~u wouldn't do the job for you. Non utf-8 multibyte sequences?

I've tested it on my machine and it worked out for everything but �� ( chr(0x7f) . chr(0x80)) which is printable (it seems?) but I can't match against it.

Also can you move the tests that don't need regular expressions into exportProvider and make the other tests a little more explicit?

Testing /^'.+'$/s is enough to see that it doesn't output "binary string" but if it would be possible to say "x Chars" that would feel more explicit.

Sorry for my confusion there

nikic · 2011-12-10T15:55:59Z

In that case we should probably also use Unicode character properties. I.e. something like ~[^\PC\p{Xps}]~u or ~(*UCP)[^[:print:][:space:]]~u.

…er 0x7f is treated as printable.

iteman · 2011-12-10T17:44:44Z

Can you explain to me why ~[^[:print:][:space:]]~u wouldn't do the job for you. Non utf-8 multibyte sequences?

Yes, many legacy software have been using non UTF-8 multibyte characters as their internal encoding, especially in Japan the EUC-JP encoding is heavily used. And also the Shift_JIS encoding can be used with multibyte support by the --enable-zend-multibyte option (Zend Multibyte). Additionally as of PHP 5.4.0 Zend Multibyte is available as the default. Thus I think that non UTF-8 characters should be displayed as normal string.

I've tested it on my machine and it worked out for everything but �� ( chr(0x7f) . chr(0x80)) which is printable (it seems?) but I can't match against it.

The character 0x7f is the control code DELETE. However it's no matter since PHP's method definition is [a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]* and it is safe for displaying.

The data array(chr(0x7f) . chr(0x80), '/^Binary String: 0x7f80$/') is for a string that consists of printable and non printable characters, it should be treated as a binary string.

Also can you move the tests that don't need regular expressions into exportProvider and make the other tests a little more explicit?

Yes, but I think that it is more intentional that the specification for whether a string should be displayed as a normal or hexadecimal string is in one place than in two places.

Testing /^'.+'$/s is enough to see that it doesn't output "binary string" but if it would be possible to say "x Chars" that would feel more explicit.

I believe that my test is now more explicit and clean, but I would appreciate your changes.

… cases better

edorian · 2011-12-11T14:27:22Z

Merged. Hope that works out for you. Like I said it's not a 100% solution but if it fixes your use cases it's good to have it.

Thanks for the pull

iteman · 2011-12-11T14:56:03Z

I've tested with the upstream branch, and it worked for me. Thank you for your response.

hakre · 2011-12-22T01:26:32Z

@edorian: 0x80 (hex value) is above a byte (8 bits) that UTF8 is able to pack (encode) into one byte. A good edge case! Please also include 0x7F, 0x81, 0x00, 0x01, 0xFF, 0xFE into tests as well.

Add multibyte support for string export.

f64e0cb

iteman added 2 commits December 11, 2011 02:39

Change the condition for binary/non-binary string so that the charact…

f90fca7

…er 0x7f is treated as printable.

Improve testStringExport() and its data.

d82d78d

edorian added a commit that referenced this pull request Dec 11, 2011

Tidy up #437. It's not pretty but it does the job for the current use…

6bf52f0

… cases better

edorian merged commit d82d78d into sebastianbergmann:3.6 Dec 11, 2011

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add multibyte support for string export. #437

Add multibyte support for string export. #437

Uh oh!

iteman commented Dec 10, 2011

Uh oh!

edorian commented Dec 10, 2011

Uh oh!

nikic commented Dec 10, 2011

Uh oh!

iteman commented Dec 10, 2011

Uh oh!

edorian commented Dec 11, 2011

Uh oh!

iteman commented Dec 11, 2011

Uh oh!

hakre commented Dec 22, 2011

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Add multibyte support for string export. #437

Add multibyte support for string export. #437

Uh oh!

Conversation

iteman commented Dec 10, 2011

Uh oh!

edorian commented Dec 10, 2011

Uh oh!

nikic commented Dec 10, 2011

Uh oh!

iteman commented Dec 10, 2011

Uh oh!

edorian commented Dec 11, 2011

Uh oh!

iteman commented Dec 11, 2011

Uh oh!

hakre commented Dec 22, 2011

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants