Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(arrow): Triangulate on worker #2789

Merged
merged 14 commits into from
Dec 5, 2023
Merged

feat(arrow): Triangulate on worker #2789

merged 14 commits into from
Dec 5, 2023

Conversation

ibgreen
Copy link
Collaborator

@ibgreen ibgreen commented Nov 17, 2023

Starts wiring up earcut into the new triangulation worker.

This requires making some of the existing functions async.

@lixun910 My sense is that this is the wrong approach. The workload is too small, and it introduces to much async into a sync util library. It would preferable to run the worker on one batch of geometries rather than on individual polygons.

@ibgreen
Copy link
Collaborator Author

ibgreen commented Nov 17, 2023

Follow-up to #2788

@kylebarron
Copy link
Collaborator

It would preferable to run the worker on one batch of geometries rather than on individual polygons

It looked like the current code is running on batched geometries?

@ibgreen
Copy link
Collaborator Author

ibgreen commented Nov 17, 2023

It looked like the current code is running on batched geometries?

I may need to take another look. The current goal is to allow this to run once on each arrow RecordBatch. Still getting up to speed on all the code. I mostly wanted to make sure the plumbing was in place for Xun.

@kylebarron
Copy link
Collaborator

FWIW, I'm interested in worker scaffolding for arbitrary Arrow data, where you can postMessage with any arrow object and it'll "just work", automatically using transferables.

One complication with Arrow is to ensure that the view for specific object instance you wish to transfer is not shared with other data. For example, if you load an Arrow Table with arrow.tableFromIPC, Arrow JS will make every Data object a view on the same original backing ArrayBuffer.

Sorry for the screenshot, but this is showing that this Table instance (in this case loaded from a 150MB Arrow IPC file) has PolygonData instances where the list offsets are both just views on the same 150MB ArrayBuffer:

image

If you transfered that arraybuffer, the table would stop working on the main thread. To get around that, we'd check that the view isn't shared, or do a memcopy to a new "owned" Data instance, and then transfer that.

As an aside, this is another reason why https://github.com/kylebarron/arrow-js-ffi will be so nice when it stabilizes. Right now, parquet-wasm uses only Arrow IPC for moving arrow data from Wasm to JS. This means that the output ArrowJS table will always be shared views on a single backing buffer. But using arrow-js-ffi to move data from wasm to JS means that every Data instance will have its own backing ArrayBuffer.

@ibgreen
Copy link
Collaborator Author

ibgreen commented Nov 17, 2023

Yes that pesky memory block problem is why I am focusing on adding triangulation for the binary data representation, after we have converted from arrow. On the upside that should allow this triangulation worker to work for all table loaders that can convert to BinaryGeometries

I am personally more interested in batched / streaming loads rather than the atomic load case in your example. In the streaming case the returned Arrow may be assembled from multiple incoming memory chunks. I am not 100% clear how the arrow library handles this in terms of ArrayBuffer sharing.

@lixun910
Copy link
Collaborator

lixun910 commented Nov 20, 2023

Thanks, @ibgreen! I will try to continue working on this PR. The current code is running on batched geometries, so what's in my mind is each worker can handle one batched geometries and output binary geometries + triangle indices. I hope there is no memory issue making each batch transferable.

@lixun910
Copy link
Collaborator

Thanks, @ibgreen! The worker infrastructure works great! I just added a job for parseGeoArrow.

From my experience with kepler:

  • Reading the arrow file into memory is very fast, e.g. 26 ms to read 1.7 GB arrow file 11 million polygons, so there is no need to use an arrow-worker
  • Parsing the geoarrow column including triangulating and getting centroids is slow on large data, and blocks the main thread.

To address this issue and implement e.g. progressive map rendering on big dataset, we can move the parsing job from the main thread to parallel web workers:

  • each web worker will parse the geoarrow column using one chunk/batch of arrow data, and return binary geometries to the main thread.
  • the app on the main thread will render the binary geometries, so that the parsing will not block the main thread.

I think maybe we could rename the worker from Triangulate to 'ParseGeoArrow`? With this worker, I think we can implement the progressive map rendering in kepler.

Let me know what you think. Thanks!

function parseGeoArrowBatch(data: ParseGeoArrowInput): ParseGeoArrowResult {
let binaryGeometries: BinaryDataFromGeoArrow | null = null;
const {arrowData, chunkIndex, geometryColumnName, geometryEncoding, meanCenter, triangle} = data;
const arrowTable = arrow.tableFromIPC(arrowData);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The raw Arrow arraybuffer is zero copied to each web worker

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure where this arrowData is coming from, but if you call tableToIPC somewhere, you're making a copy.

@kylebarron
Copy link
Collaborator

fyi for earcut specifically, I created https://github.com/geoarrow/geoarrow-js as a place for some arrow-specific operations (so far wraps area and winding operations from math.gl/polygon + earcut), and I plan to implement a helper for post messaging arrays

@ibgreen
Copy link
Collaborator Author

ibgreen commented Nov 22, 2023

I think maybe we could rename the worker from Triangulate to 'ParseGeoArrow`? With this worker, I think we can implement the progressive map rendering in kepler.

Thanks, @ibgreen! The worker infrastructure works great! I just added a job for parseGeoArrow.

From my experience with kepler:

  • Reading the arrow file into memory is very fast, e.g. 26 ms to read 1.7 GB arrow file 11 million polygons, so there is no need to use an arrow-worker

100% Agree. An ArrowWorkerLoader doesn't make much sense. But as you describe below, a GeoArrowWorkerLoader might make more sense (though currently we are trying to create a more generic worker).

  • Parsing the geoarrow column including triangulating and getting centroids is slow on large data, and blocks the main thread.

To address this issue and implement e.g. progressive map rendering on big dataset, we can move the parsing job from the main thread to parallel web workers:

  • each web worker will parse the geoarrow column using one chunk/batch of arrow data, and return binary geometries to the main thread.
  • the app on the main thread will render the binary geometries, so that the parsing will not block the main thread.

Yes. My goal is primarily to avoif blockin the main thread (parallel is a bonus).

I think maybe we could rename the worker from Triangulate to 'ParseGeoArrow`? With this worker, I think we can implement the progressive map rendering in kepler.

Let me know what you think. Thanks!

We could rename it. I would wait a little for a few reasons:

  1. there is a lot of plumbing involved so renaming in all places takes a while to get right. Let's make sure we know the final name before we rename it again.
  2. We can just expose a new method parseArrowOnWorker, then the app doesn't care what the worker is called.
  3. My hope was that we could create a common triangluation worker.

We have plenty of formats that need WKT WKB parsing

  • GeoArrow with arrow geos, wkb, wkt
  • GeoParquet with wkt and wkb
  • CSV with WKT
  • CSV with Hex WKB
    ...

And for triangulation, also GeoJSON, Shapefile, GeoPackage etc.

If we need to create one worker for every format it will get very tedious and error prone.

Copy link
Collaborator Author

@ibgreen ibgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good progress. See my questions and let's align quickly off-line before we land.

@@ -37,6 +37,7 @@ export function parseArrowInBatches(
shape: 'arrow-table',
batchType: 'data',
data: new arrow.Table([recordBatch]),
rawArrayBuffer: asyncIterator,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This (including the input async iterator in the batch) raises red flags. What are you trying to achieve?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be removed.

@@ -21,17 +22,24 @@ export type TriangulationWorkerOutput =

export type ParseGeoArrowInput = {
operation: 'parse-geoarrow';
arrowData: ArrayBuffer;
chunkData: {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting that you choose to use the Arrow API on the worker.

A problem with Arrow is that tables as JS classes (data + methods) and they can't be transferred to workers.

To call this worker we need to extract the binary arrays from arrow.

What do we gain by putting them back into arrow again to do our calculations?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My idea here is to pass the pure data (not the methods in the class) to the web worker, so we can reconstruct the arrow.Data and then arrow.Vector for parsing the specific chunk. Most of the pure data are meta data like offset, indices, length etc.

chunkData.length,
chunkData.nullCount,
chunkData.buffers,
chunkData.children,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that while many things here are pure data, I expect that children can contain class instances that are not transferrable to workers.

@ibgreen ibgreen marked this pull request as ready for review November 28, 2023 10:44
ibgreen and others added 11 commits November 28, 2023 15:46
@lixun910
Copy link
Collaborator

@ibgreen Is there any way to specify the value of transferList in the function postMessage(data: any, transferList?: any[]) at

postMessage(data: any, transferList?: any[]): void {
transferList = transferList || getTransferList(data);
// @ts-ignore
this.worker.postMessage(data, transferList);
}

Let me know if I was wrong: it looks like this function will only be called in worker-job.ts

postMessage(type: WorkerMessageType, payload: WorkerMessagePayload): void {
this.workerThread.postMessage({
source: 'loaders.gl', // Lets worker ignore unrelated messages
type,
payload
});
but we can't specify transferList and therefore getTransferList() will be called to automatically get transferrable objects in the message data. Thank you!

Some notes:

I've tried to use this worker in kepler (see pr here), but failed. Originally, I was thinking of sending each batch of data (arrow.Data) to a worker and reconstructing the arrow.Data and arrowVector. However, the batch data points to the same underneath buffer of the arrow table.

It will be put in the transferList by getTransferList(), and its memory will be detached and make the original arrow table not usable for e.g. showing table or mouse hover operations.

So I think we may need to make a hard copy of the slice of the buffer for each batch. By removing the raw buffer from transferList is a quick way to test, but it will copy the entire raw buffer of arrow table to each worker.

Or, we could borrow the _sliceBuffers function in apache-arrow, and replace subarray with slice to make a copy for arrow.Data from a specific part of the raw buffer. :
https://github.com/apache/arrow/blob/d555890141c55021077ea63380075e9cd1d68b1c/js/src/data.ts#L234

@ibgreen
Copy link
Collaborator Author

ibgreen commented Nov 30, 2023

but we can't specify transferList and therefore getTransferList() will be called to automatically get transferrable objects in the message data. Thank you!

I am not sure this is a good idea. I suspect that it requires enough plumbing that it is not super easy to add.

But more importantly if the objects are not transferred I assume they will be serialized/deserialized which will kill performance.

I would normally just strip any data that I don't want to be transferred, or replace it with copies. I think that is the way we should go. A bit more effort but let's do it right. I heard some positive noises from @kylebarron on this topic, let's see if we can leverage his work.

Signed-off-by: Xun Li <lixun910@gmail.com>
@kylebarron
Copy link
Collaborator

kylebarron commented Dec 1, 2023

"hard cloning" the array isn't enough; you also need to monkey patch the data type before moving to a thread

Signed-off-by: Xun Li <lixun910@gmail.com>
Signed-off-by: Xun Li <lixun910@gmail.com>
@ibgreen ibgreen merged commit b677f33 into master Dec 5, 2023
4 checks passed
@ibgreen ibgreen deleted the triangulate-on-worker branch December 5, 2023 20:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants