-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encoding issue in Node app #68
Comments
I think it's worth it to make sure special characters are supported. |
Andrew, yes I totally agree. I just want some feedback on this approach. River5 has been deployed enough now that it's important to do this slowly and carefully. ;-) |
I haven't used the node library |
The feed in question does nothing wrong. The document is encoded in ISO 8859-1. As it is an XML document, it has encoding information in its XML declaration as mandated. And as recommended by RFC 3023 the response sends the correct charset information:
The problem seems to be that none of the infrastructure seems to make use of this information. River5 uses
(The example uses the iconv package, the difference to iconv-lite seems to be that the former is a binding for the C library whereas the latter is a JS-only package with not as much encodings. The JS APIs seem to be the same.) |
I am the maintainer of River5.
So let's fix it so it does it right.
What's the proper way to do that?
…On Wed, Mar 14, 2018 at 2:09 PM ttepasse ***@***.***> wrote:
The feed in question does nothing wrong. The document is encoded in ISO
8859-1. As it is an XML document, it has encoding information in its XML
declaration as mandated <https://www.w3.org/TR/REC-xml/#charencoding>.
And as recommended by RFC 3023
<https://tools.ietf.org/html/rfc3023#section-3.2> the response sends the
correct charset information:
Content-Type: application/xml; charset=ISO-8859-1
The problem seems to be that none of the infrastructure seems to make use
of this information. River5 uses davereader, davereader fetched with
request, then parses the feed with feedparser
<https://github.com/danmactough/node-feedparser>, which uses sax-js
<https://github.com/isaacs/sax-js> to parse the XML structure. Nowhere in
that chain of packages anyone pays attention to encoding issues. It seems
everyone assumes the web is already UTF-8 only. Depressing.
request has an encoding option per its Readme but it only seems it only
gets used when transforming the response buffer to a string. davereader
instead pipes the stream directly to feedparser which then uses sax-js
which assumes ... Unicode. The correct place to transform from other
chatsets to Unicode seems to be before piping it to feedparser.
feedparser agrees: they have an example in their repository - iconv.js
<https://github.com/danmactough/node-feedparser/blob/master/examples/iconv.js>
- which uses the same approach. And unlike the Stack Overflow answer in a
way that respects the HTTP charset information and does the transform on a
stream.
(The example uses the iconv package, the difference to iconv-lite seems to
be that the former is a binding for the C library whereas the latter is a
JS-only package with not as much encodings. The JS APIs seem to be the
same.)
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#68 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABm9O1UU3JmShwp5-YUXHncZRfu8IMPwks5teVzDgaJpZM4Sqj8r>
.
|
I just read the sample from feedParser. I'll just do what he does. Easy.
…On Wed, Mar 14, 2018 at 3:09 PM Dave Winer ***@***.***> wrote:
I am the maintainer of River5.
So let's fix it so it does it right.
What's the proper way to do that?
On Wed, Mar 14, 2018 at 2:09 PM ttepasse ***@***.***> wrote:
> The feed in question does nothing wrong. The document is encoded in ISO
> 8859-1. As it is an XML document, it has encoding information in its XML
> declaration as mandated <https://www.w3.org/TR/REC-xml/#charencoding>.
> And as recommended by RFC 3023
> <https://tools.ietf.org/html/rfc3023#section-3.2> the response sends the
> correct charset information:
>
> Content-Type: application/xml; charset=ISO-8859-1
>
> The problem seems to be that none of the infrastructure seems to make use
> of this information. River5 uses davereader, davereader fetched with
> request, then parses the feed with feedparser
> <https://github.com/danmactough/node-feedparser>, which uses sax-js
> <https://github.com/isaacs/sax-js> to parse the XML structure. Nowhere
> in that chain of packages anyone pays attention to encoding issues. It
> seems everyone assumes the web is already UTF-8 only. Depressing.
>
> request has an encoding option per its Readme but it only seems it only
> gets used when transforming the response buffer to a string. davereader
> instead pipes the stream directly to feedparser which then uses sax-js
> which assumes ... Unicode. The correct place to transform from other
> chatsets to Unicode seems to be before piping it to feedparser.
>
> feedparser agrees: they have an example in their repository - iconv.js
> <https://github.com/danmactough/node-feedparser/blob/master/examples/iconv.js>
> - which uses the same approach. And unlike the Stack Overflow answer in a
> way that respects the HTTP charset information and does the transform on a
> stream.
>
> (The example uses the iconv package, the difference to iconv-lite seems
> to be that the former is a binding for the C library whereas the latter is
> a JS-only package with not as much encodings. The JS APIs seem to be the
> same.)
>
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub
> <#68 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/ABm9O1UU3JmShwp5-YUXHncZRfu8IMPwks5teVzDgaJpZM4Sqj8r>
> .
>
|
Okay it's all packaged up as an NPM thing -- davefeedread. Here's the example code -- https://github.com/scripting/feedRead/blob/master/testing/test.js Dave |
Christoph Knopf reports re River5 a problem with reading umlauts.
He offers a feed that illustrates the problem.
I wrote a simple Node app that reads the file using the standard request package, and what he reports is observed. The umlauts appear as � characters.
I found this Stack Overflow thread that says the answer is to use iconv-lite. Others seem to confirm this is the way to go. Before I contemplate making a change to RIver5, I wanted to get the opinion of the braintrust.
Thanks in advance. 🚀
The text was updated successfully, but these errors were encountered: