-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decoder output garbled text on occasions #88
Comments
I tried to set both
|
So I ended up writing my own charset detection checks and convert charset on stream Looking at the decoder.js, I don't see how it prevent chunk from causing invalid multi-byte cut-off. So while the charset is detected correctly, there is no guarantee that the start/end of each chunk does not cut-off a multi-byte character, thus the issue I am observing... So it looks like my previous concern is legit after all, and best practice should be to keep the chunk in array and concat on ref:
|
Good find! I guess the problem is that we're trying to decode individual chunks instead of the whole thing once the transfer ends, which leads to multibyte chars being cut, as you mention. I'm not sure if there's a way around this, though. Did you try the |
On second thought, rather than using .collect we might try using Please give it a try and let me know if it works! |
If we know the original charset beforehand (say charset is present in header content-type), then it works; but if we need to extract the charset from body (say charset is from meta tag), then we must work with the first chunk. PS: I still haven't figure out why is charset often missing from needle
not a huge problem but quite annoying when trying to debug what's going on... |
I encounter the same issue with the latest release (0.9.2). Say the BIG5 string "國際" (0xB0EA, 0xBBDA) in HTML is split across chunk. // chunk = <Buffer B0 EA BB>, this.charset = 'big5'
// this would produce garbage
res = iconv.decode(chunk, this.charset); So no one has tried |
@leesei shameless self-plug, I ended up writing this module: https://github.com/bitinn/node-fetch You should be able to decode res.body as a stream, using |
@bitinn Thanks for the info. I'll wait and see if there's any progress on this issue. I tried to create a |
I'm closing this issue for the time being. If anyone wants more context, here's the related discussion on the iconv-lite repo. |
We are running into issues where some part of the response body are garbled upon decoding, and more annoyingly it does not always happen to the same page, not even the same text. So it's difficult for us to determine whether this is problem for
needle
oriconv-lite
.For example: http://www.huanqiukexue.com/html/newgc/2014/1215/25011.html is a page with charset gb2312, and upon multiple fetch we see different result on decode.
(text between question mark symbol are garbled, hopefully it show up the same on all OS)
attempt 1:
attempt 2:
There are also instance where both attempts result in the same garbled text.
attempt 1:
attempt 2:
and we use needle similar to this:
The text was updated successfully, but these errors were encountered: