Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding issue in Node app #68

Open
scripting opened this issue Mar 14, 2018 · 7 comments
Open

Encoding issue in Node app #68

scripting opened this issue Mar 14, 2018 · 7 comments

Comments

@scripting
Copy link
Owner

scripting commented Mar 14, 2018

Christoph Knopf reports re River5 a problem with reading umlauts.

He offers a feed that illustrates the problem.

I wrote a simple Node app that reads the file using the standard request package, and what he reports is observed. The umlauts appear as � characters.

I found this Stack Overflow thread that says the answer is to use iconv-lite. Others seem to confirm this is the way to go. Before I contemplate making a change to RIver5, I wanted to get the opinion of the braintrust.

Thanks in advance. 🚀

@andrewshell
Copy link

I think it's worth it to make sure special characters are supported.

@scripting
Copy link
Owner Author

Andrew, yes I totally agree. I just want some feedback on this approach. River5 has been deployed enough now that it's important to do this slowly and carefully. ;-)

@andrewshell
Copy link

I haven't used the node library iconv-lite, but I know PHP has an iconv package so I'm assuming it's a fairly common library/api for handling this sort of thing.

@ttepasse
Copy link

The feed in question does nothing wrong. The document is encoded in ISO 8859-1. As it is an XML document, it has encoding information in its XML declaration as mandated. And as recommended by RFC 3023 the response sends the correct charset information:

Content-Type: application/xml; charset=ISO-8859-1

The problem seems to be that none of the infrastructure seems to make use of this information. River5 uses davereader, davereader fetched with request, then parses the feed with feedparser, which uses sax-js to parse the XML structure. Nowhere in that chain of packages anyone pays attention to encoding issues. It seems everyone assumes the web is already UTF-8 only. Depressing.

request has an encoding option per its Readme but it only seems it only gets used when transforming the response buffer to a string. davereader instead pipes the stream directly to feedparser which then uses sax-js which assumes ... Unicode. The correct place to transform from other chatsets to Unicode seems to be before piping it to feedparser.

feedparser agrees: they have an example in their repository - iconv.js - which uses the same approach. And unlike the Stack Overflow answer in a way that respects the HTTP charset information and does the transform on a stream.

(The example uses the iconv package, the difference to iconv-lite seems to be that the former is a binding for the C library whereas the latter is a JS-only package with not as much encodings. The JS APIs seem to be the same.)

@scripting
Copy link
Owner Author

scripting commented Mar 14, 2018 via email

@scripting
Copy link
Owner Author

scripting commented Mar 14, 2018 via email

@scripting
Copy link
Owner Author

Okay it's all packaged up as an NPM thing -- davefeedread.

Here's the example code --

https://github.com/scripting/feedRead/blob/master/testing/test.js

Dave

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants