Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet Maximum call stack size exceeded error & simple wasm benchmark #2144

Open
kylebarron opened this issue Apr 9, 2022 · 1 comment
Open

Comments

@kylebarron
Copy link
Collaborator

I was trying to do a simple benchmark of the JS parquet library in modules/parquet. With this example Parquet file (1 million rows, 1 row group, no compression) I got a Maximum call stack size exceeded error (traceback below).

I figured this might have something to do with having 1 million rows in a single row group, so I tried the same file with 20 row groups (i.e. with 50,000 rows in each row group). This file worked, but took 29.949s; for comparison a benchmark with the same file using the wasm loader took around 62ms (both in Node v16.14.0).

Given these results, I'd like to get the wasm parquet loader in #2103 cleaned up sometime soon.

I couldn't get the ParquetLoader to work in a standalone NPM project; even after installing polyfills I kept getting errors of Blob is not defined. The only way I could get the ParquetLoader to work is in the existing test cases, so I just modified one of the existing tests to load these new files:

test.only('load file', async (t) => {
  const url = '@loaders.gl/parquet/test/data/20-partition-none.parquet';
  console.time('load Parquet');
  const data = await load(url, ParquetLoader, {parquet: {url}, worker: false});
  console.timeEnd('load Parquet');

  t.equal(data.length, 1000000);
  t.end();
});

Call stack error when trying to load this Parquet file using the code:

test.only('load file', async (t) => {
  const url = '@loaders.gl/parquet/test/data/1-partition-none.parquet';
  const data = await load(url, ParquetLoader, {parquet: {url}, worker: false});
  t.end();
});
not ok 1 RangeError: Maximum call stack size exceeded
  ---
    operator: error
    expected: |-
      undefined
    actual: |-
      [RangeError: Maximum call stack size exceeded]
    at: bound (/Users/kyle/github/mapping/loaders.gl/node_modules/onetime/index.js:30:12)
    stack: |-
      RangeError: Maximum call stack size exceeded
          at Object.decodeValues (/Users/kyle/github/mapping/loaders.gl/modules/parquet/src/parquetjs/codecs/rle.ts:95:14)
          at decodeValues (/Users/kyle/github/mapping/loaders.gl/modules/parquet/src/parquetjs/parser/decoders.ts:216:35)
          at decodeDataPage (/Users/kyle/github/mapping/loaders.gl/modules/parquet/src/parquetjs/parser/decoders.ts:276:15)
          at decodePage (/Users/kyle/github/mapping/loaders.gl/modules/parquet/src/parquetjs/parser/decoders.ts:105:20)
          at decodeDataPages (/Users/kyle/github/mapping/loaders.gl/modules/parquet/src/parquetjs/parser/decoders.ts:58:24)
          at ParquetEnvelopeReader.readColumnChunk (/Users/kyle/github/mapping/loaders.gl/modules/parquet/src/parquetjs/parser/parquet-envelope-reader.ts:140:18)
          at processTicksAndRejections (node:internal/process/task_queues:96:5)
          at ParquetEnvelopeReader.readRowGroup (/Users/kyle/github/mapping/loaders.gl/modules/parquet/src/parquetjs/parser/parquet-envelope-reader.ts:81:43)
          at ParquetCursor.next (/Users/kyle/github/mapping/loaders.gl/modules/parquet/src/parquetjs/parser/parquet-cursor.ts:48:25)
          at parseParquetFileInBatches (/Users/kyle/github/mapping/loaders.gl/modules/parquet/src/lib/parse-parquet.ts:20:22)
  ...
@ibgreen
Copy link
Collaborator

ibgreen commented Apr 12, 2022

The JS / TypeScript version of the loader has not yet been optimized. The batches are read out row-by-row by a "row iterator" and then concatenated.

This can easily be made much faster. Probably a good WASM loader can be faster than JS but given the block memory loading model of parquet, I doubt perf differences would be significant between the two implementation. Instead

  • For me the selling point of the WASM loader would mainly be that parquet is a big spec and perhaps the rust version is a better maintained project.
  • The advantage of the typescript loader is that it is significantly easier for the typical loaders.gl user to maintain and modify that code.

Overall JS may also have a smaller bundle size, but that can be less of an issue if the code is loaded dynamically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants