[RT 96399] Better handling of utf-8 encoded html files #3

kasei · 2017-08-15T02:55:25Z

I believe this fixes the bug reported in debian and forwarded on to RT.

kjetilk · 2017-08-15T08:48:19Z

Yup, seems good to me.

I hope we can get a release with this as soon as possible, since most of Perl+RDF will be removed from Debian soon if this isn't fixed...

karenetheridge · 2017-08-21T15:59:41Z

NOOOOOO this is a horrible misuse of is_utf8. DO NOT APPLY THIS PATCH.

kasei · 2017-08-21T17:08:24Z

Ok. But this is a real problem. Instead of just “NOOOOOO”, can you provide a constructive suggestion for how to proceed?

karenetheridge · 2017-08-21T18:01:31Z

The code needs to look at the charset in the headers, and decide if that means it needs to do utf8-decoding on the content. There is no way of determining that by inspecting the character stream itself - it must be informed by the headers.

kasei · 2017-08-21T18:06:06Z

I don't believe this code always has headers to look at since the content can come from a variety of sources (string, file, http request...). At the point this code is run, the content in the args might be a byte or character string, and the code that is then called requires an encoded byte string. That's my understanding, anyway.

kjetilk · 2017-08-21T19:24:56Z

Yeah, users do not necessarily have any headers to rely on. This problem came to light because of a command line tool.

So, any other suggestions?

karenetheridge · 2017-08-21T19:25:11Z

Then the interface needs to change to either require consistent inputs, or to allow the encoding to be passed as an argument.

In general, bytestreams should be decoded to characters only on the fringes of the application (e.g. when reading from the network, or slurping a file from disk). Most code should be dealing with characters, not bytes.

kasei · 2017-08-22T00:21:48Z

I'm not terribly familiar with this code, but I think it's trying to occupy space in that "fringe of the application." It has some interaction with XML::LibXML which, AFAIK, has a need to differentiate bytes from characters in scalars passed in, and this module tries to abstract strings, file handles, and network requests from the user so that they don't have to know the character set ahead of time (it can be in a header, or declared at the top of the html, or use a heuristic).

I'm OK trying to figure out the right way to do this, but I'm not entirely clear why using is_utf8 is a "horrible misuse" in this case. The input can be coming from a variety of places, and when we get to that code, I think there are only two options: it's an encoded byte string, or it's a character string (which will have is_utf8 as true(?)). In the latter case, we need to re-encode it so that the xml-parsing code which requires an encoded byte string can do its thing.

We can try to push for this API being updated to differentiate these cases, but I believe that would be a breaking change to consumers of this API, and would have a much bigger impact than this fix which I think is restricted to code that currently breaks in a subset of cases where heuristics are used to identify utf8 encoded files loaded from disk without a declared charset. That approach would also probably involve a much longer timeframe than a fix like this which we're trying to do quickly to avoid a large dependency chain from being pulled out of Debian as a result of the underlying bug.

karenetheridge · 2017-08-22T01:02:03Z

is_utf8 does not mean "is this a character string, or is it utf8-encoded?" It's unfortunate that the function name suggests that's what it does, but that's totally not the case. You can easily have a unicode character string where is_utf8($var) returns true, or it returns false. All it's looking at is an internal flag to see how it's represented internally; utf8::upgrade and utf8::downgrade flip between these two different states. See the documentation here (https://metacpan.org/pod/release/WOLFSAGE/perl-5.27.3/lib/utf8.pm) for more, and also the links that it points to in perldoc perlunifaq.

Let's back up. What's the problem upstream in Debian? It doesn't look like there's many users of this distribution on the cpan. Which ones of them are of concern in Debian-land? It will probably be easier to fix those directly by moving them off of HTML::HTML5::Parser onto something more actively maintained that doesn't have a broken API.

jonassmedegaard · 2017-08-22T01:13:24Z

The most prominent user is RDF::RDFa::Parser which in turn is - optionally (and therefore commonly hidden in CPAN) - used in RDF::Trine which has a slew of users.

kasei · 2017-08-22T01:14:41Z

Yes, I believe I understand what is_utf8 is doing. I'm not looking for whether the content is utf8, I'm looking to determine whether perl has marked the SV as bytes or characters. That's correct, right?

The Debian bug that started this had a use case where two different utf-8 encoded html files, differing by a single non-ascii character, resulted in data loss and bad unicode handling that resulted in data loss using the tools in this package. I believe the issue is caused by calling a function that requires encoded bytes with scalars that can be either bytes or characters, depending on the exact code path followed. There are many such code paths (I believe this is at least a product of source -- string, file, network -- and whether there is an explicit character encoding available to perl), and I'm not sure it's possible to make backwards compatible changes to the API to separate these cases.

I believe what my patch does (even if it's not done in an appropriate way) is one possible solution to this problem. Regardless of where the data comes from, the code that is changed in the patch should be passing on encoded bytes to the $self->{parser}->parse_byte_string. So I've attempted, when necessary, to encode characters into utf8 bytes before making that call. Does that sound reasonable?

karenetheridge · 2017-08-22T17:02:21Z

No, this is not correct. It is simply not possible to tell, by inspecting the SV, whether a string is unicode characters or if it's encoded as UTF-8 (or any other encoding for that matter). What you are doing happens to work for the particular test case you've concocted, but will fail for many other situations.

The only correct solution is for the function to either consistently take one or the other, or to pass another argument that indicates which of these cases applies.

kasei · 2017-08-22T17:14:18Z

Can you point me to where I might see an example of where an SV is marked with the utf8 flag but is actually treated as bytes? I'm not clear how this could happen.

I'm not sure I'm in a position to suggest a change to what seems to amount to the entire API surface area of this module if you're saying there isn't any way to test if an SV will be treated as characters or bytes by perl.

karenetheridge · 2017-08-22T17:39:45Z

here's a really simple example, with an unencoded unicode character string:

perl -CS -wle'my $str = "ಠ_ಠ"; my $str2 = "ಠ_ಠ";  utf8::upgrade($str); utf8::downgrade($str2); print "str1 is_utf8: ", (utf8::is_utf8($str) ? 1 : 0); print "str2 is_utf8: ", (utf8::is_utf8($str2) ? 1 : 0);'

karenetheridge · 2017-08-22T17:40:55Z

I'm not sure I'm in a position to suggest a change to what seems to amount to the entire API surface area of this module

This is what I said earlier -- the interface itself needs to be redesigned, or (more likely, especially in the immediate term) the callers fixed to use something else.

Grinnz · 2017-08-22T17:47:49Z

As an example a CP-1252 (windows latin1 extension) string with non-ascii characters may be represented in both the upgraded and downgraded formats. If it's upgraded is_utf8 will return true, if it's downgraded it will return false, this is whether the internal SV (visible to XS code) is UTF-8 encoded or in the native platform encoding (NOT characters or bytes). But the content of the string (to any perl code) will still be those same non-ascii characters regardless of upgrading or downgrading. Thus you can end up with a situation where is_utf8 is false but the string is not ascii or encoded bytes. These strings often break XS modules that don't use strings correctly, like DBD::mysql (which was fixed and then re-broken).

On the other side, a string of encoded bytes can be upgraded, and then is_utf8 will be true again, this is still a valid string of encoded bytes, only the internal representation changes. If you were to incorrectly assume this was now a string of characters, or use the internal representation directly, you will end up with a double-encoded string.

Using is_utf8 as a boolean to determine whether to run encode or decode is always incorrect. You must determine whether you are dealing with characters or bytes by another means, or require that the user provide one or the other.

kasei · 2017-08-22T18:18:32Z

Sorry to keep dragging this out. I'm really trying to understand where I'm going wrong here. I'm willing to believe I've got it wrong, but I'm clearly missing something.

I'm not sure I follow that example as an response to my question. That string doesn't seem "unencoded" to me. It's a utf8-encoded byte string that you then twiddle the utf8 flag on, right?

perl -MDevel::Peek -CS -wle'my $str = "ಠ_ಠ"; Dump($str); utf8::upgrade($str); Dump($str)'

So while the utf8 flag is set on $str, I'm not sure I'd really consider that being "treated as bytes", because you've explicitly declared that the bytes are characters, right? Feels to me like it's being "treated as characters," but you've done some extra work to define what those characters are. In the absence of you explicitly making those declarations with utf8::upgrade, it still looks to me like is_utf8 would indicate whether the scalar will be treated as bytes or as characters.

Just to restate, so that there's no confusion about where I'm coming from here, I'm not interested in what encoding the data is in if it arrives in the function in any encoded form (what I'm thinking of as "perl will treat it as bytes"). That simply gets passed on to the xml parser. I'm only interested in the case where perl would treat the scalar as a collection of characters, in which case I want to turn those characters into a sequence of utf8 encoded bytes, and send those bytes on to the xml parser.

Grinnz · 2017-08-22T18:37:32Z

is_utf8 does not indicate any such thing. It indicates how the string is internally represented. Nothing more, nothing less. As I mentioned, byte strings can be internally represented in UTF-8, and character strings can be internally represented in the native platform encoding.

Grinnz · 2017-08-22T18:58:16Z

Here, some examples.

Bytestring representing the UTF-8 encoding of the snowman character, which when printed directly, the terminal prints correctly, as it expects UTF-8 encoded bytes. Then upgraded so its internal representation is UTF-8 (thus double encoded, if assumed to be a character string). To perl, the string's content does not change.

perl -MDevel::Peek -MEncode=encode -E'my $str = encode "UTF-8", "\x{2603}"; say $str; Dump($str); utf8::upgrade($str); say $str; Dump($str);'

Character string with two characters, which has to be UTF-8 encoded to print correctly to the terminal expecting UTF-8 encoded bytes. Before upgrading its internal representation is latin-1. To perl, the string's content does not change. Note that an upgrade like this can happen automatically in string operations, such as when concatenating with an upgraded character string.

perl -MDevel::Peek -MEncode=encode -E'my $str = "\x{e9}\x{a2}"; say encode "UTF-8", $str; Dump($str); utf8::upgrade($str); say encode "UTF-8", $str; Dump($str);'

kasei · 2017-08-22T19:20:24Z

@Grinnz –

I think the examples of utf8::upgrade are instructive, but honestly wonder if its relevant here. The snowman is a fine example. I've expanded it a bit, with the scalar starting out holding the literal value (I think this could just as easily be done after the fact with an additional decode call):

perl -MDevel::Peek -MEncode=encode -E'my $str = "\x{2603}";  say $str; say substr($str, 0, 1); Dump($str); my $str = encode "UTF-8", $str; say $str; say substr($str, 0, 1); Dump($str); utf8::upgrade($str); say $str; say substr($str, 0, 1); Dump($str);'

The crux of the issue here is finding a way to know ahead of time if the substr call will return a snowman character, or something else (the first byte of whatever encoding is being used, e.g. "\xe2"). From this discussion, it sounds like what you're saying is that there is no way in perl to do this.

Grinnz · 2017-08-22T19:50:53Z

That is correct. The only way to know whether you receive bytes or characters is to define your interface (documentation) to work with one or the other.

(The intent of the utf8::upgrade examples is just to show what is_utf8 actually means.)

kasei · 2017-08-22T21:20:56Z

OK. I'm surprised by that, but it sounds like there's no way to fix this bug with the current API. I'll close the PR and I guess somebody else can pursue this.

For my own sake, can you help me understand two things about these unicode-related issues?

Can you give me an example where a substr($str, 0, 1) will return something that isn't a byte (ord < 256) when is_utf8 is false? I'm not sure I've seen one yet, which I think is a big part of my confusion as to why what I had done wasn't going to work in all cases.
Does this mean that it's unsafe to have an API that accepts a filehandle for reading for which you don't know all of the perlio layers? It seems like the same issue would occur if you couldn't tell ahead of time a read was going to give you back encoded bytes or decoded characters. I and many others have APIs in code that accept file handles like this, and I'm now wondering if all of that code is subject to the same API issues as we're dealing with here.

Thanks.

Grinnz · 2017-08-22T21:54:24Z

If is_utf8 is false, you are correct that the string cannot contain characters above ordinal 256. However, this does not mean it is a byte string. This is where my second example from before comes in; it is a character string with is_utf8 false, and it is not ascii, so if you assume it is a byte string, it will be incorrectly handled.
Yes. This is a big reason why (for example) File::Slurp's interface accepting filehandles has commonly been an issue, and the recommended alternative File::Slurper only accepts filenames. The only way you can accept filehandles reliably is by documenting that the passed filehandle must have no (extra) layers, or a particular set of layers.

kasei · 2017-08-22T22:04:56Z

Good to know about file handles. Thanks. I guess I'll work on deprecating those methods.

Regarding point 1, I'm not clear on what your second example is intended to show. Since it's latin-1, couldn't we consider that both a byte string and a character string (since the width of characters is always 1 byte)? When you say "if you assume it is a byte string, it will be incorrectly handled" what does that mean? Wouldn't determination of correctness depend entirely on how you used the bytes after making the assumption? I'm not entirely sure how that part of this codebase works, but I'm imagining a case where you don't know the encoding, but you attempt to treat the byte string as ascii which in many cases is good enough to find charset="foo" in the first N bytes to guide the parsing heuristics.

Grinnz · 2017-08-22T22:10:51Z

I mean that if your output expects UTF-8, and you output that "byte string", it will not be correctly interpreted. It is a "byte string" in that sense, but it's latin-1 encoded (unless you're on EBCDIC), so you need to use it as a character string (meaning, encode it to whatever encoding is expected) to be compatible with anything that doesn't expect latin-1. And sure, if you ignore every character over ordinal 128, you are left with an ASCII string that is identical in byte and character representation.

kasei · 2017-08-22T22:20:33Z

But the code in question doesn't output directly. If no explicit charset is supplied, it uses heuristics in order to correctly interpret the data. In fact, it does something very similar to what I described as a possible scenario, sniffing for BOMs in the bytes, and then falling back to using IO::HTML and some algorithm defined by HTML5:

https://github.com/kasei/p5-html-html5-parser/blob/master/lib/HTML/HTML5/Parser/TagSoupParser.pm#L558

Grinnz · 2017-08-22T22:34:59Z

That would probably work as a heuristic input, but that's not the case you presented earlier.

I'm only interested in the case where perl would treat the scalar as a collection of characters, in which case I want to turn those characters into a sequence of utf8 encoded bytes, and send those bytes on to the xml parser.

Your patch here assumes that is_utf8 being true means it is a decoded character string, but any string can be upgraded in perl, and only fully ascii strings are unaffected by UTF-8 encoding, so that can't work. Unless, similarly, you drop any non-ascii characters thus making the encoding irrelevant. The further problem is that the metadata found by the heuristics will only tell you what encoding the source was originally in; it won't tell you if it's already been decoded, which is the main issue here.

kasei · 2017-08-22T22:39:43Z

Fair point. That was my mistake. I've been having some trouble keeping in my head what the code is trying to do and what I'm trying to fix :) If somebody explicitly upgrades a scalar, though, isn't that a case where they kind of take ownership of the outcome if the data isn't actually in the right format?

Honestly, I'm not sure what all the html5 sniffing stuff does, and I'm not in a position to dive deeply into it to find out. My take is that if you've resorted to sniffing, it's a best-effort sort of thing, and if it gets the wrong encoding, well that's just how things go.

Like I said, I'll take your advice and just close this PR and the issue can remain until somebody figures out a solution and/or rewrites the entire thing.

Grinnz · 2017-08-22T23:01:53Z

Scalars can be upgraded (or downgraded) for various reasons, including automatic upgrades as I mentioned earlier, and theoretically automatic downgrades (don't know if this occurs in practice). It's a bit tautological, but these operations have no effect on code that correctly uses perl strings (unless it's code dealing with the internals of the string, of course), so a manual upgrade or downgrade is usually an attempt to work around incorrect code (often XS code), and this causes bugs in other incorrect code. Unfortunately I also don't have a good understanding of this codebase to suggest solutions that don't involve is_utf8.

kjetilk · 2017-09-08T11:22:53Z

@jonassmedegaard brought up Encoding::FixLatin as a possible quick fix. Is it?

karenetheridge · 2017-12-19T22:58:00Z

I just re-found this ticket, which is long but covers the is_utf8 problem in great detail -- it was certainly very illuminating for me when I first read it: https://rt.cpan.org/Ticket/Display.html?id=104433

kjetilk · 2017-12-19T23:32:41Z

Nice, thanks!

So, HTML5 parsing is fairly peripheral to the RDF stuff where @kasei and I have our main interest, and we've been working to mitigate the effects of this bug in our code.

I would be very happy if someone else stepped up to fix this. Also, @tobyink isn't working so much on it nowadays either, so if someone would assume maintainership of this module, I think it would make everybody happy.

Cheers,

Kjetil

kjetilk · 2019-07-04T23:13:22Z

With the time that has passed, has there been advancements in other code that could help fix this?

Grinnz · 2021-09-08T14:04:12Z

Please revert this, for the reasons discussed above.

Grinnz · 2021-09-08T16:52:22Z

Note that the PR author indicated that they understood that this is not a correct fix.

kasei · 2021-09-08T16:59:27Z

Note that the PR author indicated that they understood that this is not a correct fix.

Well, I said I was OK closing it. I still think this might be a reasonable fix given the odd place this code sits in between libraries that do heuristic charset work, and the existing API surface that tries to make that work in situations where the caller doesn't know the underlying charset. But I don't have the expertise or desire to get dragged into a fight over it.

kjetilk · 2021-09-08T20:13:04Z

Yeah, it would be good to finally have a fix. It would be interesting as a first step to have a failing test that demonstrates the problem more clearly though. I had a test in #4 , but that apparently needs to be extended.

kasei added 2 commits August 13, 2017 20:33

Added handling for html content with utf8 encoding (RT #96399).

1e8be5a

Added POD to t/rt-96399.t.

aae7fc3

Grinnz referenced this pull request in redhotpenguin/soaplite Dec 19, 2017

Report patch proposed in RT#78588 (#34)

e9d1f56

Grinnz mentioned this pull request Dec 19, 2017

Report patch proposed in RT#78588 redhotpenguin/soaplite#34

Merged

kjetilk mentioned this pull request Jul 4, 2019

Failing test case for UTF8 issue #4

Merged

Grinnz mentioned this pull request Jan 4, 2021

parse_json suffers the "Unicode Bug" benkasminbullock/JSON-Parse#46

Closed

Merge branch 'master' into rt-96399

237a66a

tobyink merged commit e15df7c into tobyink:master Sep 8, 2021

Uh oh!

[RT 96399] Better handling of utf-8 encoded html files #3

[RT 96399] Better handling of utf-8 encoded html files #3

Uh oh!

Conversation

kasei commented Aug 15, 2017

Uh oh!

kjetilk commented Aug 15, 2017

Uh oh!

karenetheridge commented Aug 21, 2017

Uh oh!

kasei commented Aug 21, 2017

Uh oh!

karenetheridge commented Aug 21, 2017

Uh oh!

kasei commented Aug 21, 2017

Uh oh!

kjetilk commented Aug 21, 2017

Uh oh!

karenetheridge commented Aug 21, 2017

Uh oh!

kasei commented Aug 22, 2017

Uh oh!

karenetheridge commented Aug 22, 2017

Uh oh!

jonassmedegaard commented Aug 22, 2017

Uh oh!

kasei commented Aug 22, 2017

Uh oh!

karenetheridge commented Aug 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kasei commented Aug 22, 2017

Uh oh!

karenetheridge commented Aug 22, 2017

Uh oh!

karenetheridge commented Aug 22, 2017

Uh oh!

Grinnz commented Aug 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kasei commented Aug 22, 2017

Uh oh!

Grinnz commented Aug 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Grinnz commented Aug 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kasei commented Aug 22, 2017

Uh oh!

Grinnz commented Aug 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kasei commented Aug 22, 2017

Uh oh!

Grinnz commented Aug 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kasei commented Aug 22, 2017

Uh oh!

Grinnz commented Aug 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kasei commented Aug 22, 2017

Uh oh!

Grinnz commented Aug 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kasei commented Aug 22, 2017

Uh oh!

Grinnz commented Aug 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kjetilk commented Sep 8, 2017

Uh oh!

karenetheridge commented Dec 19, 2017

Uh oh!

kjetilk commented Dec 19, 2017

Uh oh!

karenetheridge commented Aug 22, 2017 •

edited

Loading

Grinnz commented Aug 22, 2017 •

edited

Loading

Grinnz commented Aug 22, 2017 •

edited

Loading

Grinnz commented Aug 22, 2017 •

edited

Loading

Grinnz commented Aug 22, 2017 •

edited

Loading

Grinnz commented Aug 22, 2017 •

edited

Loading

Grinnz commented Aug 22, 2017 •

edited

Loading

Grinnz commented Aug 22, 2017 •

edited

Loading

Grinnz commented Aug 22, 2017 •

edited

Loading