Skip to content

Conversation

@iteman
Copy link
Contributor

@iteman iteman commented Dec 10, 2011

Binary string export has been solved by #306, but another issue has been brought by it. A string that includes one or more multibyte characters is always treated as a binary string by the current implementation. My request solves this issue.

@edorian
Copy link
Contributor

edorian commented Dec 10, 2011

Hi,

thanks for the patch!

Can you explain to me why ~[^[:print:][:space:]]~u wouldn't do the job for you. Non utf-8 multibyte sequences?

I've tested it on my machine and it worked out for everything but �� ( chr(0x7f) . chr(0x80)) which is printable (it seems?) but I can't match against it.

Also can you move the tests that don't need regular expressions into exportProvider and make the other tests a little more explicit?

Testing /^'.+'$/s is enough to see that it doesn't output "binary string" but if it would be possible to say "x Chars" that would feel more explicit.

Sorry for my confusion there

@nikic
Copy link

nikic commented Dec 10, 2011

In that case we should probably also use Unicode character properties. I.e. something like ~[^\PC\p{Xps}]~u or ~(*UCP)[^[:print:][:space:]]~u.

@iteman
Copy link
Contributor Author

iteman commented Dec 10, 2011

Can you explain to me why ~[^[:print:][:space:]]~u wouldn't do the job for you. Non utf-8 multibyte sequences?

Yes, many legacy software have been using non UTF-8 multibyte characters as their internal encoding, especially in Japan the EUC-JP encoding is heavily used. And also the Shift_JIS encoding can be used with multibyte support by the --enable-zend-multibyte option (Zend Multibyte). Additionally as of PHP 5.4.0 Zend Multibyte is available as the default. Thus I think that non UTF-8 characters should be displayed as normal string.

I've tested it on my machine and it worked out for everything but �� ( chr(0x7f) . chr(0x80)) which is printable (it seems?) but I can't match against it.

The character 0x7f is the control code DELETE. However it's no matter since PHP's method definition is [a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]* and it is safe for displaying.

The data array(chr(0x7f) . chr(0x80), '/^Binary String: 0x7f80$/') is for a string that consists of printable and non printable characters, it should be treated as a binary string.

Also can you move the tests that don't need regular expressions into exportProvider and make the other tests a little more explicit?

Yes, but I think that it is more intentional that the specification for whether a string should be displayed as a normal or hexadecimal string is in one place than in two places.

Testing /^'.+'$/s is enough to see that it doesn't output "binary string" but if it would be possible to say "x Chars" that would feel more explicit.

I believe that my test is now more explicit and clean, but I would appreciate your changes.

edorian added a commit that referenced this pull request Dec 11, 2011
@edorian edorian merged commit d82d78d into sebastianbergmann:3.6 Dec 11, 2011
@edorian
Copy link
Contributor

edorian commented Dec 11, 2011

Merged. Hope that works out for you. Like I said it's not a 100% solution but if it fixes your use cases it's good to have it.

Thanks for the pull

@iteman
Copy link
Contributor Author

iteman commented Dec 11, 2011

I've tested with the upstream branch, and it worked for me. Thank you for your response.

@hakre
Copy link
Contributor

hakre commented Dec 22, 2011

@edorian: 0x80 (hex value) is above a byte (8 bits) that UTF8 is able to pack (encode) into one byte. A good edge case! Please also include 0x7F, 0x81, 0x00, 0x01, 0xFF, 0xFE into tests as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants