Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WASM-based parquet #2103

Merged
merged 26 commits into from
Apr 28, 2022
Merged

WASM-based parquet #2103

merged 26 commits into from
Apr 28, 2022

Conversation

kylebarron
Copy link
Collaborator

No description provided.

@@ -42,6 +42,7 @@
"@loaders.gl/compression": "3.2.0-alpha.1",
"@loaders.gl/loader-utils": "3.2.0-alpha.1",
"@loaders.gl/schema": "3.2.0-alpha.1",
"apache-arrow": "^7.0.0",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss... arrow 7 is a breaking change. I have a PR that updates to arrow 7 but have been holding it off for loaders.gl 4.0.

Copy link
Collaborator Author

@kylebarron kylebarron Mar 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or make apache-arrow a peer dependency? The only functions used are arrow.tableFromIPC (for reading) and arrow.tableToIPC (for writing). Arrow 7 isn't necessary; I just did yarn add apache-arrow and it chose the latest version

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that those functions have changed name compared to pre v7 releases. v7 was a major refactor that massively improves tree-shaking of arrow but there are a bunch of breaking changes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I dropped the dev dependencies version down to ^4 and had to reimplement both tableToIPC and tableFromIPC using the lower level RecordBatchFileWriter and RecordBatchStreamReader.

export function tableToIPC(table: Table): Uint8Array {
return RecordBatchFileWriter.writeAll(table).toUint8Array(true);
}

function tableFromIPC(input: ArrayBuffer): Table {
const reader = RecordBatchStreamReader.from(input);
const recordBatches: RecordBatch[] = [];
for (const recordBatch of reader) {
recordBatches.push(recordBatch);
}
return new Table(recordBatches);
}

modules/parquet/test/index.js Outdated Show resolved Hide resolved
Comment on lines 32 to 36
{supportedJs: false, supportedWasm: true, title: 'data_index_bloom_encoding_stats', path: 'good/data_index_bloom_encoding_stats.parquet'},
{supportedJs: false, supportedWasm: false, title: 'delta_binary_packed', path: 'good/delta_binary_packed.parquet'},
{supportedJs: false, supportedWasm: false, title: 'delta_byte_array', path: 'good/delta_byte_array.parquet'},
{supportedJs: false, supportedWasm: false, title: 'delta_encoding_optional_column', path: 'good/delta_encoding_optional_column.parquet'},
{supportedJs: false, supportedWasm: false, title: 'delta_encoding_required_column', path: 'good/delta_encoding_required_column.parquet'},
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are new files from https://github.com/apache/parquet-testing that didn't previously exist here.

* @param type Whether to serialize the Table as a file or a stream.
*/
export function tableToIPC(table: Table): Uint8Array {
return RecordBatchFileWriter.writeAll(table).toUint8Array(true);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is writeAll() still present in v7 though? Those kind of static methods were a big part of what was removed in v7...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kylebarron
Copy link
Collaborator Author

I think the main thing that needs to be fixed in this PR is how to bundle the wasm in a way that works out of the box for node and browser. I tried to use a dynamic import(), though that gives an error when building:

modules/parquet/src/lib/encode-parquet-wasm.ts:10:24 - error TS1323: Dynamic imports are only supported when the '--module' flag is set to 'es2020', 'es2022', 'esnext', 'commonjs', 'amd', 'system', 'umd', 'node12', or 'nodenext'.

10   const module = await import('parquet-wasm/esm/arrow1');
                          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Previously, it was in an import/require ESM/CJS mess. Node was complaining about the use of require, even when TS was using import, so it's likely the current tsconfig has something to do with that.

I'm not too experienced with these bundling issues so help would be greatly appreciated 🙂

@kylebarron kylebarron marked this pull request as ready for review April 26, 2022 17:09
@@ -34,6 +34,7 @@ export function makeStream<ArrayBuffer>(
controller.close();
} else {
// TODO - ignores controller.desiredSize
// @ts-ignore
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should change this into a @ts-expect-error. I got an error about this originally but maybe it's now fixed on master.

@kylebarron kylebarron changed the title WIP: wasm-based parquet WASM-based parquet Apr 26, 2022
Copy link
Collaborator

@felixpalmer felixpalmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very excited for this PR! Added a few comments, although I must confess I have no experience bundling WASM modules so I may be totally off-base

modules/parquet/src/lib/parse-parquet-wasm.ts Outdated Show resolved Hide resolved
import {WASM_SUPPORTED_FILES} from './data/files';

const PARQUET_DIR = '@loaders.gl/parquet/test/data/apache';
const wasmUrl = 'http://localhost:5000/node_modules/parquet-wasm/arrow1/esm_bg.wasm';
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole flow of referencing an external wasm file and using the localhost:5000 override seems close to the flow we have for webworkers. Would it make sense to unify the two? I.e. create a small web worker script that includes the WASM and implements the parse interface? I know this would mean losing the non-worker loader, but how much use is this if need to dynamically load a WASM blob from an external URL?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And yes, I think both WASM and web workers need to be instantiated asynchronously, right? I don't know much about packaging web workers. In the geopackage loader (which relies on the WASM sql.js) we have a similar process where the WASM is loaded from an external source that the user can configure:

async function loadDatabase(arrayBuffer: ArrayBuffer, sqlJsCDN: string | null): Promise<Database> {
// In Node, `locateFile` must not be passed
let SQL: SqlJsStatic;
if (sqlJsCDN) {
SQL = await initSqlJs({
locateFile: (file) => `${sqlJsCDN}${file}`
});
} else {
SQL = await initSqlJs();
}
return new SQL.Database(new Uint8Array(arrayBuffer));
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was more thinking in terms of packaging. A number of loaders have a build-worker target in package.json which then automatically creates a worker file on the npm and provides a way to use a local version of the worker in unit tests. The idea was that you could create a similar build-worker step for parquet-wasm which would abstract away all the WASM loading into that worker

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felixpalmer In principle that is a good idea. Are you good at wasm bundling / loading? I have looked into this briefly in the past but overall this seemed like a messy corner of javascript and I didn't find any approach that I got excited about. If someone can demonstrate a convincing solution that seems reasonably simple, generic, portable and future proof I'd be happy to consider it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid I have no experience and share the view that it seems to be a messy part of JavaScript

modules/parquet/package.json Show resolved Hide resolved
yarn.lock Outdated Show resolved Hide resolved
@kylebarron
Copy link
Collaborator Author

Thanks. Should be finally green now.

@kylebarron kylebarron merged commit 4a35b19 into master Apr 28, 2022
@kylebarron kylebarron deleted the kyle/parquet-wasm branch April 28, 2022 19:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants