-
-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue with multi-byte representations of unicode code points. #23
Comments
Could you do a repro case, so I can debug? Backgrounder: JS uses UTF16 internally to represent strings, which still allows multi-byte symbols, but I am not sure it is even supported by JS. |
Here's an example.
|
As all string translation is done by Node, it is likely a general problem, not just It is an interesting problem how Node actually processes multi-byte characters when reading files (or whatever) rather than an artificially chunked input. But I do understand this problem -- my other project The rational way to deal with it is to prepend Try them out and see, if you still have this problem. If it works for you, I'll close the ticket and update the documentation alerting users about possible problem, and the way to solve it. Otherwise, I can write this helper stream myself. |
Pressed the wrong button. :-) |
I think it's nice if the parsers handle it and wasn't sure if you were aware of the issue. I just wanted to give you a heads up. But you are right, preprocessing the chunks works. Thanks for the links, too. |
OK, now I need to update the docs. |
By looking at the source of
Parser.js
, I suspect an issue with characters that are represented in UTF8 by multiple bytes.Example:
č
may be encoded as[0x63, 0xcc, 0x8c]
, but there is no guarantee that these three bytes are passed totransform
within the same buffer. And if for example it is passed within separate buffers[0x63, 0xcc]
followed by[0x8c]
then it will be decoded as the javascript stringc��
before it is passed through to the actual parsing code.The text was updated successfully, but these errors were encountered: