Handle encoding of subresources #46

Treora · 2019-06-01T10:40:40Z

Freeze-dry messes up if a stylesheet or framed document is encoded in utf16, utf32, or possibly other encodings. We use FileReader.readAsText to decode these resources, which by default assumes utf8 encoding. This assumption is adequate most of the time, but when it isn’t the resource is effectively unreadable.

I do not know enough about the standards, but I suppose the decoder should look at the HTTP Content-Type header, the file’s byte order mark (BOM), and in-document declarations (@charset in CSS, <meta charset=…> in HTML).

This detection&decoding issue seems so generic it should not have to burden this repo, but I have not yet discovered the right tool. Some options I thought of:

The browser’s fetch, but unfortunately appears not to help with decoding; its Response.text() is spec'd to "return the result of running UTF-8 decode on bytes".
XMLHttpRequest.responseText does seem to respect HTTP header and BOM, though I am not sure about in-document declarations. And it feels a little outdated, as I think fetch was supposed to make it obsolete; but perhaps not.
Some javascript module? I did not yet find anything that comes close.

Tips welcome.

Note this issue is similar to issue #29, but that one concerns the DOM that the browser has already decoded for us; this issue is about subresources we fetch.

The text was updated successfully, but these errors were encountered:

Treora added the snapshot quality Improving fidelity/size/durability/etc of the output label Jun 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle encoding of subresources #46

Handle encoding of subresources #46

Treora commented Jun 1, 2019

Handle encoding of subresources #46

Handle encoding of subresources #46

Comments

Treora commented Jun 1, 2019