::filter doesn't work well #7

jonnybarnes · 2013-10-14T17:17:08Z

This where someone tells me I'm doing this completely wrong, but given the following code

<?php
include("vendor/autoload.php");

use \Patchwork\Utf8 as u;

\Patchwork\Utf8\Bootup::initAll();
\Patchwork\Utf8\Bootup::filterRequestUri();
\Patchwork\Utf8\Bootup::filterRequestInputs();

header("Content-Type: text/html; charset=utf-8");

$txt = "Iñtërnâtiônàlizætiøn \xFC\xA1\xA1\xA1\xA1\xA1";

if(u::isUtf8($txt) != true) {
    $txt = u::filter($txt);
}

echo $txt;

I'd like the i18n word to be preserved. Instead the output is IÃ±tÃ«rnÃ¢tiÃ´nÃ lizÃ¦tiÃ¸n ü¡¡¡¡¡. I'd like the output to be more like Iñtërnâtiônàlizætiøn.

The text was updated successfully, but these errors were encountered:

nicolas-grekas · 2013-10-15T09:03:21Z

This is the expected behavior, but documentation lacks a bit...

I slightly updated the readme on this point, see the penultimate paragraph in the Usage section.

The reasoning is the following:

no data exists into the wild that should be fixed this way, exploits excluded.
security considerations on unicode.
information preservation (no data deletion).

jonnybarnes · 2013-10-15T13:08:50Z

So just to clarify, and I don’t mean to sound like a prick, but the expected behaviour is that my perfectly encoded utf-8 word gets mangled when there is some trailing invalid utf-8 by the ::filter() method?

Having read S3.6.1 I can see why you wouldn't want to remove the invalid bytes. But why does Iñtërnâtiônàlizætiøn get turned into IÃ±tÃ«rnÃ¢tiÃ´nÃ lizÃ¦tiÃ¸n?

Again, sorry if I'm coming across as a prick asking these questions?

nicolas-grekas · 2013-10-15T15:20:32Z

This is a tricky point, you are right to ask, no pb at all.

Your word is perfectly utf-8 valid, but the whole string is not, and u::filter() works by string.
In your case, it checks if the full string is utf-8 valid, which is not the case.
Then it assumes CP-1252 (this is also the choice of HTML5) and converts the string to UTF-8.
This conversion does not see the "ñ" as a single char, but as two CP1252 bytes, which are converted to two utf-8 chars Ã then ±.

Do you have a real case where this string can come up in your data flow?
No single browser behaves like that since years, so that doesn't happen it real life.
But prove me wrong :)

jonnybarnes · 2013-10-15T18:30:01Z

I’m just playing around trying to understand how UTF-8 works and am writing a little script to hex-dump the byte values of a UTF-8 string: https://gist.github.com/jonnybarnes/6951138

So I suppose it’s not a real case of invalid utf-8 coming up in my data flow. And to be honest, other than manually creating some invalid utf-8 a la $invutf8 = "\xC0\xC1" I have no idea how one would paste invalid UTF-8 into the textarea. But I was thinking hypothetically if someone did.

If I set the default value of $txt to include some invalid bytes as well as the fancy i18n word then as I said above the whole word gets garbled.

But as you said, the only sensible way of dealing with an invalid UTF-8 string is to convert the characters into UTF-8, which is causing the valid portion of the string to get converted as well.

nicolas-grekas · 2013-10-15T20:22:43Z

I hope I answered you question. BTW, you should understand now that you shouldn't call isUtf8 before calling filter.

jonnybarnes · 2013-10-16T14:46:45Z

So would a decent workflow to be filtering inputs then if(!isUtf8($input) { throw an error }?

nicolas-grekas · 2013-10-16T14:51:13Z

filtering your input with u::filter() garanties that you will get utf-8, so exception will never ever be thrown

nicolas-grekas · 2013-10-16T15:03:34Z

In fact, this is what \Patchwork\Utf8\Bootup::filterRequestInputs(); does for all autoglobals ($_GET, $_POST, etc.)!

jonnybarnes · 2013-10-16T15:09:45Z

I was just about to say I'm using ::filterRequestInputs(). I love that in the test file you can manually construct the $_GET variable and the ::filterRequestInputs() method will still filter it.

Thanks for the help :)

jonnybarnes closed this as completed Oct 16, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

::filter doesn't work well #7

::filter doesn't work well #7

jonnybarnes commented Oct 14, 2013

nicolas-grekas commented Oct 15, 2013

jonnybarnes commented Oct 15, 2013

nicolas-grekas commented Oct 15, 2013

jonnybarnes commented Oct 15, 2013

nicolas-grekas commented Oct 15, 2013

jonnybarnes commented Oct 16, 2013

nicolas-grekas commented Oct 16, 2013

nicolas-grekas commented Oct 16, 2013

jonnybarnes commented Oct 16, 2013

::filter doesn't work well #7

::filter doesn't work well #7

Comments

jonnybarnes commented Oct 14, 2013

nicolas-grekas commented Oct 15, 2013

jonnybarnes commented Oct 15, 2013

nicolas-grekas commented Oct 15, 2013

jonnybarnes commented Oct 15, 2013

nicolas-grekas commented Oct 15, 2013

jonnybarnes commented Oct 16, 2013

nicolas-grekas commented Oct 16, 2013

nicolas-grekas commented Oct 16, 2013

jonnybarnes commented Oct 16, 2013