Issue with multi-byte representations of unicode code points. #23

alwinb · 2017-06-13T10:04:37Z

By looking at the source of Parser.js, I suspect an issue with characters that are represented in UTF8 by multiple bytes.

Example: č may be encoded as [0x63, 0xcc, 0x8c], but there is no guarantee that these three bytes are passed to transform within the same buffer. And if for example it is passed within separate buffers [0x63, 0xcc] followed by [0x8c] then it will be decoded as the javascript string c�� before it is passed through to the actual parsing code.

The text was updated successfully, but these errors were encountered:

uhop · 2017-06-13T13:35:41Z

Could you do a repro case, so I can debug?

Backgrounder: JS uses UTF16 internally to represent strings, which still allows multi-byte symbols, but I am not sure it is even supported by JS.

alwinb · 2017-06-14T08:16:38Z

Here's an example.

var makeSource = require ('stream-json')

var log = console.log
  .bind (console)

var sjon1 = new makeSource ()
  .on ('stringValue', log)
  .input

var sjon2 = new makeSource ()
  .on ('stringValue', log)
  .input

var chuncks1 = 
  [ '"'
  , new Buffer ([0x63, 0xcc])
  , new Buffer ([0x8c])
  , '"' ]

var chuncks2 = 
  [ '"'
  , new Buffer ([0x63])
  , new Buffer ([0xcc, 0x8c])
  , '"' ]

chuncks1.map (sjon1.write.bind (sjon1)) // Results in `c��`
chuncks2.map (sjon2.write.bind (sjon2)) // Results in `č`

uhop · 2017-06-14T15:28:57Z

As all string translation is done by Node, it is likely a general problem, not just stream-json. I suspect, if in your example you implement a super-simple stream instead of using stream-json, which prints its input character-by-character, you will see the same problem.

It is an interesting problem how Node actually processes multi-byte characters when reading files (or whatever) rather than an artificially chunked input. But I do understand this problem -- my other project re2 actually transforms strings to utf8 and back, and I had to deal with similar problems anyway.

The rational way to deal with it is to prepend stream-json with a stream, which will sanitize and convert utf8 characters properly handling multi-byte boundaries. Quick search revealed at least two projects on npm that appear to do just that:

Try them out and see, if you still have this problem. If it works for you, I'll close the ticket and update the documentation alerting users about possible problem, and the way to solve it. Otherwise, I can write this helper stream myself.

uhop · 2017-06-14T15:29:33Z

Pressed the wrong button. :-)

alwinb · 2017-06-14T15:48:35Z

I think it's nice if the parsers handle it and wasn't sure if you were aware of the issue. I just wanted to give you a heads up. But you are right, preprocessing the chunks works. Thanks for the links, too.

uhop · 2017-06-14T16:02:58Z

OK, now I need to update the docs.

uhop self-assigned this Jun 13, 2017

uhop added the in progress label Jun 13, 2017

uhop closed this as completed Jun 14, 2017

uhop reopened this Jun 14, 2017

uhop closed this as completed in 112067c Jun 16, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with multi-byte representations of unicode code points. #23

Issue with multi-byte representations of unicode code points. #23

alwinb commented Jun 13, 2017

uhop commented Jun 13, 2017 •

edited

Loading

alwinb commented Jun 14, 2017

uhop commented Jun 14, 2017

uhop commented Jun 14, 2017

alwinb commented Jun 14, 2017

uhop commented Jun 14, 2017

Issue with multi-byte representations of unicode code points. #23

Issue with multi-byte representations of unicode code points. #23

Comments

alwinb commented Jun 13, 2017

uhop commented Jun 13, 2017 • edited Loading

alwinb commented Jun 14, 2017

uhop commented Jun 14, 2017

uhop commented Jun 14, 2017

alwinb commented Jun 14, 2017

uhop commented Jun 14, 2017

uhop commented Jun 13, 2017 •

edited

Loading