Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decoder output garbled text on occasions #88

Closed
bitinn opened this issue Dec 17, 2014 · 9 comments
Closed

Decoder output garbled text on occasions #88

bitinn opened this issue Dec 17, 2014 · 9 comments

Comments

@bitinn
Copy link

bitinn commented Dec 17, 2014

We are running into issues where some part of the response body are garbled upon decoding, and more annoyingly it does not always happen to the same page, not even the same text. So it's difficult for us to determine whether this is problem for needle or iconv-lite.

For example: http://www.huanqiukexue.com/html/newgc/2014/1215/25011.html is a page with charset gb2312, and upon multiple fetch we see different result on decode.

(text between question mark symbol are garbled, hopefully it show up the same on all OS)

attempt 1:

2005年,美国加利福尼亚大学河滨分校的心理学家索尼娅•柳�┟锥�斯基(Sonja Lyubomirsky)和她的同事一起,对多项研究的结论进行了综述,这些研究显示幸福与成功呈正比。

attempt 2:

2005年,美国加利福尼亚大学河滨分校的心理学家索尼娅•柳博米尔斯基(Sonja Lyubomirsky)和她的同事一起,对多项研究的结论进行了综述,这些研究显示幸福与成功呈正比。

There are also instance where both attempts result in the same garbled text.

attempt 1:

那些能够产生“心流”的任务不能太单调沉闷,�荒苋萌瞬�生挫败感;它还要具有充分的挑战性,要求一个人全神贯注。

attempt 2:

那些能够产生“心流”的任务不能太单调沉闷,�荒苋萌瞬�生挫败感;它还要具有充分的挑战性,要求一个人全神贯注。

and we use needle similar to this:

    stream.on('data', function(chunk) {
        if (chunk === null) {
            return;
        }

        raw.push(chunk);
    });

    stream.on('end', function(err) {
        body = Buffer.concat(raw).toString();
    });
@bitinn
Copy link
Author

bitinn commented Dec 18, 2014

I tried to set both parse and decode to false and listen on headers event to inspect the header but charset=xxx are somehow missing from content-type, is this a node http module feature or am I just that unlucky...

$ curl -I http://bitinn.net/11084/
HTTP/1.1 200 OK
Date: Thu, 18 Dec 2014 10:12:17 GMT
Content-Type: text/html; charset=UTF-8
Connection: keep-alive
Set-Cookie: __cfduid=d3c54904565e8e93a88573cde1a7d6a211418897537; expires=Fri, 18-Dec-15 10:12:17 GMT; path=/; domain=.bitinn.net; HttpOnly
X-Powered-By: PHP/5.4.17RC1
X-Pingback: http://bitinn.net/xmlrpc.php
Link: <http://bitinn.net/?p=11084>; rel=shortlink
Server: cloudflare-nginx
CF-RAY: 19aa94ca892d11e3-SJC
needle.on('headers', function(headers) {
  console.log(headers);
});

{ date: 'Thu, 18 Dec 2014 10:14:12 GMT',
  'content-type': 'text/html',
  'transfer-encoding': 'chunked',
  connection: 'close',
  'set-cookie': [ '__cfduid=de102602fcd12a37cfcb6d934514556831418897652; expires=Fri, 18-Dec-15 10:14:12 GMT; path=/; domain=.bitinn.net; HttpOnly' ],
  'x-powered-by': 'PHP/5.4.17RC1',
  server: 'cloudflare-nginx',
  'cf-ray': '19aa979644b611e3-SJC',
  'content-encoding': 'gzip' }
chrome

HTTP/1.1 200 OK
Date: Thu, 18 Dec 2014 10:17:07 GMT
Content-Type: text/html; charset=UTF-8
Transfer-Encoding: chunked
Connection: keep-alive
X-Powered-By: PHP/5.4.17RC1
X-Pingback: http://bitinn.net/xmlrpc.php
Server: cloudflare-nginx
CF-RAY: 19aa9bddb34311e3-SJC
Content-Encoding: gzip

@bitinn
Copy link
Author

bitinn commented Dec 18, 2014

So I ended up writing my own charset detection checks and convert charset on stream end instead of using the built-in stream transform decoder. And the problem is gone, so I suspect needle decoder is somehow messing up the buffer (chunk).

Looking at the decoder.js, I don't see how it prevent chunk from causing invalid multi-byte cut-off. So while the charset is detected correctly, there is no guarantee that the start/end of each chunk does not cut-off a multi-byte character, thus the issue I am observing...

So it looks like my previous concern is legit after all, and best practice should be to keep the chunk in array and concat on end of stream, then convert encoding altogether.

ref:

@tomas
Copy link
Owner

tomas commented Dec 18, 2014

Good find!

I guess the problem is that we're trying to decode individual chunks instead of the whole thing once the transfer ends, which leads to multibyte chars being cut, as you mention.

I'm not sure if there's a way around this, though. Did you try the .collect trick explained in the first link (iconv-lite wiki)?

@tomas
Copy link
Owner

tomas commented Dec 18, 2014

On second thought, rather than using .collect we might try using iconv.decodeStream(charset) instead of new StreamDecoder(charset) in line 49 in decoder.js. That way we can rely on iconv-lite's internal stream decoding logic instead of calling iconv.decode manually for each chunk.

Please give it a try and let me know if it works!

@bitinn
Copy link
Author

bitinn commented Dec 18, 2014

If we know the original charset beforehand (say charset is present in header content-type), then it works; but if we need to extract the charset from body (say charset is from meta tag), then we must work with the first chunk.

PS: I still haven't figure out why is charset often missing from needle headers event (and some headers dropped), I can't reproduce this with curl or chrome...

{ date: 'Thu, 18 Dec 2014 19:57:49 GMT',
  'content-type': 'text/html; charset=UTF-8',
  'transfer-encoding': 'chunked',
  connection: 'close',
  'set-cookie': [ '__cfduid=d0054f11661f55f78714aed3a5f67e3091418932669; expires=Fri, 18-Dec-15 19:57:49 GMT; path=/; domain=.bitinn.net; HttpOnly' ],
  'x-powered-by': 'PHP/5.4.17RC1',
  link: '<http://bitinn.net/?p=11084>; rel=shortlink',
  'x-pingback': 'http://bitinn.net/xmlrpc.php',
  server: 'cloudflare-nginx',
  'cf-ray': '19adee7f8bb611e9-SJC',
  'content-encoding': 'gzip' }
{ date: 'Thu, 18 Dec 2014 19:58:03 GMT',
  'content-type': 'text/html',
  'transfer-encoding': 'chunked',
  connection: 'close',
  'set-cookie': [ '__cfduid=dea4aa77a2174c06c88f3b403e900cc2e1418932683; expires=Fri, 18-Dec-15 19:58:03 GMT; path=/; domain=.bitinn.net; HttpOnly' ],
  'x-powered-by': 'PHP/5.4.17RC1',
  server: 'cloudflare-nginx',
  'cf-ray': '19adeed7571311e3-SJC',
  'content-encoding': 'gzip' }

not a huge problem but quite annoying when trying to debug what's going on...

@leesei
Copy link

leesei commented Jun 25, 2015

I encounter the same issue with the latest release (0.9.2).
@bitinn, indeed many server won't report encoding in HTTP header.
The HTML meta detection with the first chunk works correctly and detects the page's encoding as big5.
Some of above snippets are not relevant anymore so I'm restating the issue here.

Say the BIG5 string "國際" (0xB0EA, 0xBBDA) in HTML is split across chunk.

// chunk = <Buffer B0 EA BB>, this.charset = 'big5'
// this would produce garbage
res = iconv.decode(chunk, this.charset);

So no one has tried iconv.decodeStream(charset) yet?

@bitinn
Copy link
Author

bitinn commented Jun 25, 2015

@leesei shameless self-plug, I ended up writing this module: https://github.com/bitinn/node-fetch

You should be able to decode res.body as a stream, using iconv.decodeStream. But in most case you probably don't need a html stream, right? So we make sure res.text() decode correctly by buffering.

@leesei
Copy link

leesei commented Jun 25, 2015

@bitinn Thanks for the info. I'll wait and see if there's any progress on this issue.

I tried to create a StreamDecoder after getting charset at decoder.js:28, but my stream-fu is not powerful enough to get the job done.
And I don't know how error should be handled when using iconv's stream API.

@tomas
Copy link
Owner

tomas commented Feb 13, 2018

I'm closing this issue for the time being. If anyone wants more context, here's the related discussion on the iconv-lite repo.

@tomas tomas closed this as completed Feb 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants