Text chunk processing should be in terms of the Encoding Standard #273

annevk · 2023-03-03T18:07:57Z

There's a couple of issues with the text here as far as I can tell:

It's not clear how invalid UTF-8 byte sequences are handled.
It's not clear if Latin1 means what it maps to in browsers (windows-1252) or isomorphic decode (https://infra.spec.whatwg.org/#isomorphic-decode).

I'm not sure it needs to say anything about encoders having data in other encodings. As there are no fields to store other encodings, it seems self-evident they have to convert.

svgeesus · 2023-03-06T16:14:23Z

Hi @annevk sounds good, do you have examples of other specs that correctly describe handling of invalid UTF-8 byte sequences by referencing Encoding?

Browsers tend not to display the text chunks inside images, so it is hard to test how they handle bad Latn-1 in practice.

annevk · 2023-03-06T17:41:28Z

See https://encoding.spec.whatwg.org/#specification-hooks for the possible UTF-8 paths. I suspect you want "UTF-8 decode".

If this is true Latin1 you want https://infra.spec.whatwg.org/#isomorphic-decode. Otherwise you want https://encoding.spec.whatwg.org/#decode with the windows-1252 encoding. I guess we could see what various operating systems do and go with that?

svgeesus · 2023-03-14T20:15:32Z

Looks like we will need some WPT tests with iTXt containing a UTF-8 BOM, to check it gets discarded rather then presented as literal characters or replacement characters.

svgeesus · 2023-03-14T20:33:22Z

TweakPNG can be used to view, and edit, the content of iTXt so should be good for making test files too.

ProgramMax · 2023-03-15T05:12:03Z

Yeah, TweakPNG is nice. I used it for the tRNS WPT test.

Do browsers not ignore iTXt? Is there a way for the browser to access it? (So it can be used in a WPT?)

svgeesus · 2023-04-27T15:57:33Z

It is possible to get at any part of a PNG file from JavaScript, though it does require writing one's own parser. As an example, PNG file chunk inspector

annevk · 2023-04-27T16:33:47Z

(I'm pretty sure browsers don't have an API to get at image data currently, though there have been proposals in the past.)

ProgramMax · 2023-04-28T04:38:48Z

I might be able to compile libpng to wasm, giving us a parser. But this seems pretty extreme for a WPT. And it wouldn't really be testing the browser's iTXt handling, anyway.

annevk · 2023-04-28T07:19:55Z

How are we writing tests at the moment?

I think ideally tests are in some kind of format that many different kind of implementations can consume. And then implementations that end up ignoring iTXt chunks simply don't run those tests (or only run them to ensure they do the correct thing decoding-wise).

ProgramMax · 2023-04-28T14:25:56Z

Currently, the tests let the browser raster the image and then query pixels to confirm they are the expected color:
https://wpt.fyi/results/png?label=master&label=experimental&aligned&view=subtest&q=png

The problem I mean is I bet most (all?) browsers see a iTXt chunk and skip past it. I don't think they parse the contents at all.
If I write a parser, I'm not testing the browser. And if my parser then handles the iTXt chunk correctly, all I've done is confirm my own parser behaves correctly. It wouldn't test any sort of conformance.

annevk · 2023-04-28T14:32:49Z

Right, so

iTXt probably needs to be optional in terms of that it is okay to skip past it. That also means that if you do process it you probably don't want to bail out on the entire image if there's an error (i.e., don't use a fatal text decoder).
Some aspects of this can be tested across all implementations.
Some tests won't be useful for all implementations.

I don't think we want to drop the feature so that's the best we can do here.

ProgramMax · 2023-04-28T15:09:49Z

Oh, I think I follow now.
We can create an image with a deliberately broken iTXt chunk and make sure the tests still display an image.

svgeesus · 2023-05-24T18:09:21Z

@annevk wrote:

iTXt probably needs to be optional in terms of that it is okay to skip past it.

PNG already has this: critical chunks are "absolutely required in order to successfully decode a PNG image" while ancillary chunks "may be ignored by a decoder." iTXt is ancillary.

That also means that if you do process it you probably don't want to bail out on the entire image if there's an error (i.e., don't use a fatal text decoder).

Covered under PNG decoders and viewers ... Error handling, specifically:

Recover from an error, if possible; otherwise fail gracefully. Errors that have little or no effect on the processing of the image may be ignored,

and

This is a specific kind of criticality and one that is not necessarily relevant to every conceivable decoder. For example, a program whose sole purpose is to extract text annotations (for example, copyright information) does not require a viewable image. Another decoder might consider the tRNS and gAMA chunks essential to its proper execution.

or, in this specific case, a program whose sole purpose is to extract text annotations should decode UTF-8 correctly (insert link to Encoding Standard).

@ProgramMax wrote:

We can create an image with a deliberately broken iTXt chunk and make sure the tests still display an image.

Yes.

xfq added the i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. label Mar 5, 2023

w3cbot mentioned this issue Mar 5, 2023

Text chunk processing should be in terms of the Encoding Standard w3c/i18n-activity#1668

Closed

svgeesus added the blocking-3rd-edition-cr label Mar 6, 2023

svgeesus added needs testcase (wpt) needs edits labels May 24, 2023

svgeesus mentioned this issue Jun 7, 2023

Follow Encoding standard on UTF-8 encode and decode #318

Merged

svgeesus closed this as completed in #318 Jun 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text chunk processing should be in terms of the Encoding Standard #273

Text chunk processing should be in terms of the Encoding Standard #273

annevk commented Mar 3, 2023

svgeesus commented Mar 6, 2023

annevk commented Mar 6, 2023

svgeesus commented Mar 14, 2023

svgeesus commented Mar 14, 2023 •

edited

Loading

ProgramMax commented Mar 15, 2023

svgeesus commented Apr 27, 2023

annevk commented Apr 27, 2023

ProgramMax commented Apr 28, 2023

annevk commented Apr 28, 2023

ProgramMax commented Apr 28, 2023

annevk commented Apr 28, 2023 •

edited

Loading

ProgramMax commented Apr 28, 2023

svgeesus commented May 24, 2023

Text chunk processing should be in terms of the Encoding Standard #273

Text chunk processing should be in terms of the Encoding Standard #273

Comments

annevk commented Mar 3, 2023

svgeesus commented Mar 6, 2023

annevk commented Mar 6, 2023

svgeesus commented Mar 14, 2023

svgeesus commented Mar 14, 2023 • edited Loading

ProgramMax commented Mar 15, 2023

svgeesus commented Apr 27, 2023

annevk commented Apr 27, 2023

ProgramMax commented Apr 28, 2023

annevk commented Apr 28, 2023

ProgramMax commented Apr 28, 2023

annevk commented Apr 28, 2023 • edited Loading

ProgramMax commented Apr 28, 2023

svgeesus commented May 24, 2023

svgeesus commented Mar 14, 2023 •

edited

Loading

annevk commented Apr 28, 2023 •

edited

Loading