Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

End not working, does not work with require, and not performant. #3

Closed
iklotzko opened this issue Feb 28, 2022 · 2 comments
Closed

Comments

@iklotzko
Copy link

End does not seem to work, and performance is not what I expected.

I'm sure I'm doing something wrong

I am reading in a string that is an array, i need to separate the string by the array elements and pass it on. I was using JSON.parse but thought I'd give a SAX like event based parser a try.

Also this doesn't seem to work with require

I get
Instead change the require of mod.js in /home/iklotzko/monorepo/packages/common/lib/utils.js to a dynamic import() which is available in all CommonJS modules.


/** I'm using an await import in my requires **/
JsonHigh.then((jh) => {
			if (this.parsing) {
				return;
			}
			this.parsing = true;
			let c = [];
			let level = 0;
			let start2 = hrtime.bigint();
			const stream = jh({
				openArray: () => {
					level++;
					if (level === 1) {
						c = arguments['0'];
					}
					return true;

				},
				closeArray: () => {
					recordStat('JsonHighLow', start2);
					level--;
					if (level === 0) {
						if (`[${c.join(',')}]` != JSON.stringify(b)) {
							console.log(`c is not equal`);
						}
						this.parsing = false;
						return stream.end();
					}
				}
			});
			stream.chunk(buf.toString());
			stream.end(() => {
				console.log(`ended FINALLY`);
			})
		});

@djedr
Copy link
Member

djedr commented Mar 2, 2022

Hello,
Thanks for filing the issue!

doesn't seem to work with require

Indeed, it won't work with require. It's based on JavaScript modules.

To make it work out of the box in Node.js you can enable ESM. I recommend this if possible.

Otherwise you'd have to resort to transpilation or manual intervention.

I suspect you have performance issues because your buf is large and you are converting it to string before passing it on. That may incur a significant penalty.

An optimal low level way to do this instead could be like this:

import {JsonLow} from './JsonLow.js'
import {JsonLowToHigh} from './JsonLowToHigh.js'

// this is copied from https://github.com/xtao-org/utf8x2x
// it's a simple low-level library for handling utf-8 byte streams as code points
// which is compatible with JsonHilo; it's not prepared for use with Node.js yet, so
// I copied it in here, so you don't have to figure this out manually 
// you can treat it as a black box
const Utf8bs2c = (next) => {
  const {codePoint} = next

  let partialCodePoint = 0
  let bytesRemain = 0

  return {
    bytes: (bytes) => {
      for (const byte of bytes) switch (bytesRemain) {
      case 0:
        if (byte < 128) codePoint(byte)
        else if ((byte >> 5) === 0b110) {
          bytesRemain = 1
          partialCodePoint = (byte & 0b00011111) << 6
        }
        else if ((byte >> 4) === 0b1110) {
          bytesRemain = 2
          partialCodePoint = (byte & 0b00001111) << 12
        }
        else if ((byte >> 3) === 0b11110) {
          bytesRemain = 3
          partialCodePoint = (byte & 0b00000111) << 18
        }
        else {
          throw Error(`Unexpected byte ${byte} (0x${byte.toString(16)} = 0b${byte.toString(2)})!`)
        }
      break
      case 1:
        bytesRemain = 0
        codePoint(partialCodePoint | (byte & 0b00111111))
      break
      case 2:
        bytesRemain = 1
        partialCodePoint |= (byte & 0b00111111) << 6
      break
      case 3:
        bytesRemain = 2
        partialCodePoint |= (byte & 0b00111111) << 12
      break
      }
    },
    end: () => {
      if (bytesRemain > 0) {
        throw Error(`Unexpected end! Expected at least ${bytesRemain} more bytes. Incomplete code point at the end: ${partialCodePoint}.`)
      }
      return next.end?.()
    }
  }
}

// creating a stream which will accept bytes and pass them on to successive parser streams
const stream = Utf8bs2c(JsonLow(JsonLowToHigh({
  openArray: () => {
    // do something
  },
  closeArray: () => {
    // do something
  },
  end() {console.log('ended')}
})))

process.stdin.on('data', (buf) => { 
  stream.bytes(buf)
})
process.stdin.on('end', () => {
  stream.end()
})

This sketch script uses the Node.js standard input to receive the utf-8 bytes.

You'd run it like so:

node script.js < bigjsonfile.json

That's all I am able to explain for now. Hope it's better than nothing. Will try to expand later if you're interested.

In the long run I suspect you might be interested by a higher-level library based on JsonHilo I've been sketching out:

https://github.com/xtao-org/jsonstrum

It may be sensible for your use case.

If you send me a sample file which is structured the way your input is and tell me how you want the output to look like I can look into it specifically.

Last thing: you shouldn't call stream.end() inside the event handlers (closeArray in your example).

You only call end to signal to the parser stream that the input is finished. It will then verify that everything is valid and return the result of your handler.

Thanks again for filing the issue and considering the library. It's a motivation to improve.

Cheers

@iklotzko
Copy link
Author

iklotzko commented Mar 2, 2022

@djedr,

Thank you very much for your good advice! I was actually considering parsing based off of the buffer, but it turns out the buffer.toString is pretty swift (< 10% of the total time) but still once I work out the kinks here in my approach could become more significant.

Also, looks like I need to brush up on my module/script loading knowledge.

I can't send anything at the moment but when I get a chance I'll send some scrubbed data and a sample program.

I used sax about 20 years ago when I was using Java primarily, it was so much faster than the state of the art DOM loaders at the time.

Nice work!

@djedr djedr closed this as completed Oct 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants