Decoder output garbled text on occasions #88

bitinn · 2014-12-17T18:45:17Z

We are running into issues where some part of the response body are garbled upon decoding, and more annoyingly it does not always happen to the same page, not even the same text. So it's difficult for us to determine whether this is problem for needle or iconv-lite.

For example: http://www.huanqiukexue.com/html/newgc/2014/1215/25011.html is a page with charset gb2312, and upon multiple fetch we see different result on decode.

(text between question mark symbol are garbled, hopefully it show up the same on all OS)

attempt 1:

2005年，美国加利福尼亚大学河滨分校的心理学家索尼娅•柳�┟锥�斯基（Sonja Lyubomirsky）和她的同事一起，对多项研究的结论进行了综述，这些研究显示幸福与成功呈正比。

attempt 2:

2005年，美国加利福尼亚大学河滨分校的心理学家索尼娅•柳博米尔斯基（Sonja Lyubomirsky）和她的同事一起，对多项研究的结论进行了综述，这些研究显示幸福与成功呈正比。

There are also instance where both attempts result in the same garbled text.

attempt 1:

那些能够产生“心流”的任务不能太单调沉闷，�荒苋萌瞬�生挫败感；它还要具有充分的挑战性，要求一个人全神贯注。

attempt 2:

那些能够产生“心流”的任务不能太单调沉闷，�荒苋萌瞬�生挫败感；它还要具有充分的挑战性，要求一个人全神贯注。

and we use needle similar to this:

    stream.on('data', function(chunk) {
        if (chunk === null) {
            return;
        }

        raw.push(chunk);
    });

    stream.on('end', function(err) {
        body = Buffer.concat(raw).toString();
    });

The text was updated successfully, but these errors were encountered:

bitinn · 2014-12-18T10:15:47Z

I tried to set both parse and decode to false and listen on headers event to inspect the header but charset=xxx are somehow missing from content-type, is this a node http module feature or am I just that unlucky...

$ curl -I http://bitinn.net/11084/
HTTP/1.1 200 OK
Date: Thu, 18 Dec 2014 10:12:17 GMT
Content-Type: text/html; charset=UTF-8
Connection: keep-alive
Set-Cookie: __cfduid=d3c54904565e8e93a88573cde1a7d6a211418897537; expires=Fri, 18-Dec-15 10:12:17 GMT; path=/; domain=.bitinn.net; HttpOnly
X-Powered-By: PHP/5.4.17RC1
X-Pingback: http://bitinn.net/xmlrpc.php
Link: <http://bitinn.net/?p=11084>; rel=shortlink
Server: cloudflare-nginx
CF-RAY: 19aa94ca892d11e3-SJC

needle.on('headers', function(headers) {
  console.log(headers);
});

{ date: 'Thu, 18 Dec 2014 10:14:12 GMT',
  'content-type': 'text/html',
  'transfer-encoding': 'chunked',
  connection: 'close',
  'set-cookie': [ '__cfduid=de102602fcd12a37cfcb6d934514556831418897652; expires=Fri, 18-Dec-15 10:14:12 GMT; path=/; domain=.bitinn.net; HttpOnly' ],
  'x-powered-by': 'PHP/5.4.17RC1',
  server: 'cloudflare-nginx',
  'cf-ray': '19aa979644b611e3-SJC',
  'content-encoding': 'gzip' }

chrome

HTTP/1.1 200 OK
Date: Thu, 18 Dec 2014 10:17:07 GMT
Content-Type: text/html; charset=UTF-8
Transfer-Encoding: chunked
Connection: keep-alive
X-Powered-By: PHP/5.4.17RC1
X-Pingback: http://bitinn.net/xmlrpc.php
Server: cloudflare-nginx
CF-RAY: 19aa9bddb34311e3-SJC
Content-Encoding: gzip

bitinn · 2014-12-18T11:03:04Z

So I ended up writing my own charset detection checks and convert charset on stream end instead of using the built-in stream transform decoder. And the problem is gone, so I suspect needle decoder is somehow messing up the buffer (chunk).

Looking at the decoder.js, I don't see how it prevent chunk from causing invalid multi-byte cut-off. So while the charset is detected correctly, there is no guarantee that the start/end of each chunk does not cut-off a multi-byte character, thus the issue I am observing...

So it looks like my previous concern is legit after all, and best practice should be to keep the chunk in array and concat on end of stream, then convert encoding altogether.

ref:

https://github.com/ashtuchkin/iconv-lite/wiki/Use-Buffers-when-decoding, while the problem is different, the solution applies.
http://nodejs.org/api/stream.html#stream_readable_setencoding_encoding, explain the problem, though in needle's case it's not toString causing problem, but chunk contains partial multi-byte.

tomas · 2014-12-18T15:13:31Z

Good find!

I guess the problem is that we're trying to decode individual chunks instead of the whole thing once the transfer ends, which leads to multibyte chars being cut, as you mention.

I'm not sure if there's a way around this, though. Did you try the .collect trick explained in the first link (iconv-lite wiki)?

tomas · 2014-12-18T15:41:49Z

On second thought, rather than using .collect we might try using iconv.decodeStream(charset) instead of new StreamDecoder(charset) in line 49 in decoder.js. That way we can rely on iconv-lite's internal stream decoding logic instead of calling iconv.decode manually for each chunk.

Please give it a try and let me know if it works!

bitinn · 2014-12-18T20:23:04Z

If we know the original charset beforehand (say charset is present in header content-type), then it works; but if we need to extract the charset from body (say charset is from meta tag), then we must work with the first chunk.

PS: I still haven't figure out why is charset often missing from needle headers event (and some headers dropped), I can't reproduce this with curl or chrome...

{ date: 'Thu, 18 Dec 2014 19:57:49 GMT',
  'content-type': 'text/html; charset=UTF-8',
  'transfer-encoding': 'chunked',
  connection: 'close',
  'set-cookie': [ '__cfduid=d0054f11661f55f78714aed3a5f67e3091418932669; expires=Fri, 18-Dec-15 19:57:49 GMT; path=/; domain=.bitinn.net; HttpOnly' ],
  'x-powered-by': 'PHP/5.4.17RC1',
  link: '<http://bitinn.net/?p=11084>; rel=shortlink',
  'x-pingback': 'http://bitinn.net/xmlrpc.php',
  server: 'cloudflare-nginx',
  'cf-ray': '19adee7f8bb611e9-SJC',
  'content-encoding': 'gzip' }

{ date: 'Thu, 18 Dec 2014 19:58:03 GMT',
  'content-type': 'text/html',
  'transfer-encoding': 'chunked',
  connection: 'close',
  'set-cookie': [ '__cfduid=dea4aa77a2174c06c88f3b403e900cc2e1418932683; expires=Fri, 18-Dec-15 19:58:03 GMT; path=/; domain=.bitinn.net; HttpOnly' ],
  'x-powered-by': 'PHP/5.4.17RC1',
  server: 'cloudflare-nginx',
  'cf-ray': '19adeed7571311e3-SJC',
  'content-encoding': 'gzip' }

not a huge problem but quite annoying when trying to debug what's going on...

leesei · 2015-06-25T03:56:19Z

I encounter the same issue with the latest release (0.9.2).
@bitinn, indeed many server won't report encoding in HTTP header.
The HTML meta detection with the first chunk works correctly and detects the page's encoding as big5.
Some of above snippets are not relevant anymore so I'm restating the issue here.

Say the BIG5 string "國際" (0xB0EA, 0xBBDA) in HTML is split across chunk.

// chunk = <Buffer B0 EA BB>, this.charset = 'big5'
// this would produce garbage
res = iconv.decode(chunk, this.charset);

So no one has tried iconv.decodeStream(charset) yet?

bitinn · 2015-06-25T06:12:17Z

@leesei shameless self-plug, I ended up writing this module: https://github.com/bitinn/node-fetch

You should be able to decode res.body as a stream, using iconv.decodeStream. But in most case you probably don't need a html stream, right? So we make sure res.text() decode correctly by buffering.

leesei · 2015-06-25T16:35:44Z

@bitinn Thanks for the info. I'll wait and see if there's any progress on this issue.

I tried to create a StreamDecoder after getting charset at decoder.js:28, but my stream-fu is not powerful enough to get the job done.
And I don't know how error should be handled when using iconv's stream API.

tomas · 2018-02-13T21:58:50Z

I'm closing this issue for the time being. If anyone wants more context, here's the related discussion on the iconv-lite repo.

tomas closed this as completed Feb 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decoder output garbled text on occasions #88

Decoder output garbled text on occasions #88

bitinn commented Dec 17, 2014

bitinn commented Dec 18, 2014

bitinn commented Dec 18, 2014

tomas commented Dec 18, 2014

tomas commented Dec 18, 2014

bitinn commented Dec 18, 2014

leesei commented Jun 25, 2015

bitinn commented Jun 25, 2015

leesei commented Jun 25, 2015

tomas commented Feb 13, 2018

Decoder output garbled text on occasions #88

Decoder output garbled text on occasions #88

Comments

bitinn commented Dec 17, 2014

bitinn commented Dec 18, 2014

bitinn commented Dec 18, 2014

tomas commented Dec 18, 2014

tomas commented Dec 18, 2014

bitinn commented Dec 18, 2014

leesei commented Jun 25, 2015

bitinn commented Jun 25, 2015

leesei commented Jun 25, 2015

tomas commented Feb 13, 2018