Skip to content
This repository has been archived by the owner on Jan 8, 2021. It is now read-only.

::filter doesn't work well #7

Closed
jonnybarnes opened this issue Oct 14, 2013 · 9 comments
Closed

::filter doesn't work well #7

jonnybarnes opened this issue Oct 14, 2013 · 9 comments

Comments

@jonnybarnes
Copy link

This where someone tells me I'm doing this completely wrong, but given the following code

<?php
include("vendor/autoload.php");

use \Patchwork\Utf8 as u;

\Patchwork\Utf8\Bootup::initAll();
\Patchwork\Utf8\Bootup::filterRequestUri();
\Patchwork\Utf8\Bootup::filterRequestInputs();

header("Content-Type: text/html; charset=utf-8");

$txt = "Iñtërnâtiônàlizætiøn \xFC\xA1\xA1\xA1\xA1\xA1";

if(u::isUtf8($txt) != true) {
    $txt = u::filter($txt);
}

echo $txt;

I'd like the i18n word to be preserved. Instead the output is Iñtërnâtiônà lizætiøn ü¡¡¡¡¡. I'd like the output to be more like Iñtërnâtiônàlizætiøn.

@nicolas-grekas
Copy link
Contributor

This is the expected behavior, but documentation lacks a bit...

I slightly updated the readme on this point, see the penultimate paragraph in the Usage section.

The reasoning is the following:

@jonnybarnes
Copy link
Author

So just to clarify, and I don’t mean to sound like a prick, but the expected behaviour is that my perfectly encoded utf-8 word gets mangled when there is some trailing invalid utf-8 by the ::filter() method?

Having read S3.6.1 I can see why you wouldn't want to remove the invalid bytes. But why does Iñtërnâtiônàlizætiøn get turned into Iñtërnâtiônà lizætiøn?

Again, sorry if I'm coming across as a prick asking these questions?

@nicolas-grekas
Copy link
Contributor

This is a tricky point, you are right to ask, no pb at all.

Your word is perfectly utf-8 valid, but the whole string is not, and u::filter() works by string.
In your case, it checks if the full string is utf-8 valid, which is not the case.
Then it assumes CP-1252 (this is also the choice of HTML5) and converts the string to UTF-8.
This conversion does not see the "ñ" as a single char, but as two CP1252 bytes, which are converted to two utf-8 chars à then ±.

Do you have a real case where this string can come up in your data flow?
No single browser behaves like that since years, so that doesn't happen it real life.
But prove me wrong :)

@jonnybarnes
Copy link
Author

I’m just playing around trying to understand how UTF-8 works and am writing a little script to hex-dump the byte values of a UTF-8 string: https://gist.github.com/jonnybarnes/6951138

So I suppose it’s not a real case of invalid utf-8 coming up in my data flow. And to be honest, other than manually creating some invalid utf-8 a la $invutf8 = "\xC0\xC1" I have no idea how one would paste invalid UTF-8 into the textarea. But I was thinking hypothetically if someone did.

If I set the default value of $txt to include some invalid bytes as well as the fancy i18n word then as I said above the whole word gets garbled.

But as you said, the only sensible way of dealing with an invalid UTF-8 string is to convert the characters into UTF-8, which is causing the valid portion of the string to get converted as well.

@nicolas-grekas
Copy link
Contributor

I hope I answered you question. BTW, you should understand now that you shouldn't call isUtf8 before calling filter.

@jonnybarnes
Copy link
Author

So would a decent workflow to be filtering inputs then if(!isUtf8($input) { throw an error }?

@nicolas-grekas
Copy link
Contributor

filtering your input with u::filter() garanties that you will get utf-8, so exception will never ever be thrown

@nicolas-grekas
Copy link
Contributor

In fact, this is what \Patchwork\Utf8\Bootup::filterRequestInputs(); does for all autoglobals ($_GET, $_POST, etc.)!

@jonnybarnes
Copy link
Author

I was just about to say I'm using ::filterRequestInputs(). I love that in the test file you can manually construct the $_GET variable and the ::filterRequestInputs() method will still filter it.

Thanks for the help :)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants