Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with multi-byte representations of unicode code points. #23

Closed
alwinb opened this issue Jun 13, 2017 · 6 comments
Closed

Issue with multi-byte representations of unicode code points. #23

alwinb opened this issue Jun 13, 2017 · 6 comments
Assignees

Comments

@alwinb
Copy link

alwinb commented Jun 13, 2017

By looking at the source of Parser.js, I suspect an issue with characters that are represented in UTF8 by multiple bytes.

Example: č may be encoded as [0x63, 0xcc, 0x8c], but there is no guarantee that these three bytes are passed to transform within the same buffer. And if for example it is passed within separate buffers [0x63, 0xcc] followed by [0x8c] then it will be decoded as the javascript string c�� before it is passed through to the actual parsing code.

@uhop
Copy link
Owner

uhop commented Jun 13, 2017

Could you do a repro case, so I can debug?

Backgrounder: JS uses UTF16 internally to represent strings, which still allows multi-byte symbols, but I am not sure it is even supported by JS.

@uhop uhop self-assigned this Jun 13, 2017
@alwinb
Copy link
Author

alwinb commented Jun 14, 2017

Here's an example.

var makeSource = require ('stream-json')

var log = console.log
  .bind (console)

var sjon1 = new makeSource ()
  .on ('stringValue', log)
  .input

var sjon2 = new makeSource ()
  .on ('stringValue', log)
  .input

var chuncks1 = 
  [ '"'
  , new Buffer ([0x63, 0xcc])
  , new Buffer ([0x8c])
  , '"' ]

var chuncks2 = 
  [ '"'
  , new Buffer ([0x63])
  , new Buffer ([0xcc, 0x8c])
  , '"' ]

chuncks1.map (sjon1.write.bind (sjon1)) // Results in `c��`
chuncks2.map (sjon2.write.bind (sjon2)) // Results in `č`

@uhop
Copy link
Owner

uhop commented Jun 14, 2017

As all string translation is done by Node, it is likely a general problem, not just stream-json. I suspect, if in your example you implement a super-simple stream instead of using stream-json, which prints its input character-by-character, you will see the same problem.

It is an interesting problem how Node actually processes multi-byte characters when reading files (or whatever) rather than an artificially chunked input. But I do understand this problem -- my other project re2 actually transforms strings to utf8 and back, and I had to deal with similar problems anyway.

The rational way to deal with it is to prepend stream-json with a stream, which will sanitize and convert utf8 characters properly handling multi-byte boundaries. Quick search revealed at least two projects on npm that appear to do just that:

Try them out and see, if you still have this problem. If it works for you, I'll close the ticket and update the documentation alerting users about possible problem, and the way to solve it. Otherwise, I can write this helper stream myself.

@uhop uhop closed this as completed Jun 14, 2017
@uhop uhop reopened this Jun 14, 2017
@uhop
Copy link
Owner

uhop commented Jun 14, 2017

Pressed the wrong button. :-)

@alwinb
Copy link
Author

alwinb commented Jun 14, 2017

I think it's nice if the parsers handle it and wasn't sure if you were aware of the issue. I just wanted to give you a heads up. But you are right, preprocessing the chunks works. Thanks for the links, too.

@uhop
Copy link
Owner

uhop commented Jun 14, 2017

OK, now I need to update the docs.

@uhop uhop closed this as completed in 112067c Jun 16, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants