-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Data Manager RFC #4143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data Manager RFC #4143
Conversation
a5410ad
to
40d14c4
Compare
In `json`, binary data can be referenced using the following syntax: | ||
|
||
``` | ||
@@binary[<id>]/<type>/<offset>/<length> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels in spirit similar to the buffer, bufferViews, accessors of gltf binary format
https://kcoley.github.io/glTF/specification/2.0/figures/gltfOverview-2.0.0a.png
This looks fine to me–happy to implement it in @deck.gl/json and include it in pydeck. |
I'm expanding on my thoughts behind the gltf similarities with the goal being to broaden or create a record of the consideration rather than suggest we move to gltf. It has many graphics related features, but there is a big difference adopting that in it's entirety which is beyond what we need for deck.gl. tl;dr:
These issues are addressed in gltf by expanding and separating out how the binary metadata. Rather than simple and concise format of Nothing would prohibit moving forward with this proposal as is and expanding to this more flexible structure definition when necessary. --- Verbose section Having said that the goals for this RFC seem to be:
This proposal assume the json & binary data are provided. It may be useful include a workflow where just the json is provided but the binary data could be linked to by a URL. This provides some nice developer experience, where data could simply be linked and let the system handle fetching and hydrating. This line of thinking then would take influence from the gltf where they break down the "binary" description into actual structured elements, rather than the condensed structured string proposed here. Specifically it would look something like this:
This it not a complete counter-proposal, but hopefully enough that you can see where / why gltf did what it did and if there is anything we could leverage to balance the flexibility vs overhead of their approach. |
- `id`: key in `buffers` | ||
- `type`: one of `int8`, `uint8`, `int16`, `uint16`, `int32`, `uint32`, `float32`, `float64` | ||
- `offset` (optional): byte offset into the buffer, default `0` | ||
- `length` (optional): byte length of the data, default to end of buffer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This type of parameter is usually called size
(e.g.).
This approach seems sensible. This architecture of creating a "database" of buffers and referring to them by an identifier is common in 3D engines, primarily for the purpose of re-using buffers. As an alternative to the glTF model @twojtasz suggested, we could frame it along the lines of WebGPU, where buffer instances, their usage, and the mapping between the two are handled separately. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Highly appreciated!
For context, I have put Unfolded's high-level thoughts on the Transport protocol in a slide deck.
This proposes that the transport could be defined to handle multiple aspects of client server communication, (binary data transport being one).
If that makes sense, maybe this can be considered a sub-RFC and we could retitle this RFC to Binary Buffer Transport RFC
?
converter.convert(json, buffers); | ||
``` | ||
|
||
Where `buffers` is a list/map of ArrayBuffers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sending individual buffers is absolutely the reasonable starting point, though from an assumed typical user persona's perspective this could be considered an advanced use case.
For binary data, it could be desirable to work towards supporting binary formats that contain all buffers and some or all of metadata required to access them) in a single file, i.e. Arrow (and possibly glTF), so that they could just be sent as a unit and then just supplied directly as Layer.data
.
The canonical example being a pydeck user just needs to use pyarrow
to convert his pandas
data frame to a binary Arrow blob and then pass that blob as data to a deck.gl layer. Ideally the user has to provide as little mapping as possible on the deck.gl side. (The in-progress Arrow support RFCs have the beginnings of the required data mappings).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per my understanding of the current frameworks architecture, we don't want deck.gl to parse the Arrow format. Deck should understand the table category from loaders.gl. It's up to the application what format they want to use, and they can pick the corresponding loader, which outputs to the same, normalized format.
If that is the case, then there is no built-in support for "a single binary file." Deck will always receive a JSON descriptor, which contains accessors and arraybuffers. It is not deck's concern how a single binary is parsed to that JSON descriptor.
{ | ||
"@@type": "PointCloudLayer", | ||
"data": { | ||
"length": 10000, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideas for the longer term: To make it a little easier on the user it could perhaps be nice to be able to let the user assemble the buffer "soup" into a data source that could be referenced in the layers. Rather than have all these defs inline in the layer. Though that would require starting to think about the data sources API.
"dataSources": {
"bufferTable": {
buffer1: {props},
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is along the lines of the gltf or webGPU structure right.
Expand the `JSONConverter.convert` API to accept a second argument: | ||
|
||
```js | ||
converter.convert(json, buffers); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is not much detail on how buffers are transported, which is somewhat reasonable given that we may have widely varying implementations on different platforms.
But we probably need a buffer or data source manager (perhaps on the @deck.gl/json level) and it probably needs to have a defined API that the implementations can work against.
They requirement is of course that buffers can be sent separately from JSON and that JSON can be updated multiple times, referencing the same buffers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They requirement is of course that buffers can be sent separately from JSON and that JSON can be updated multiple times, referencing the same buffers.
This part seems to be an important requirement not really covered here. The idea is that there is a set of state (JSON and Buffers) and you can update both or one, and the "correct" behavior should happen.
Does this also mean supporting partial updates? For example, I want to only update 1 buffer out of the 10 that where sent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes look at the message sequence diagram in the linked slides. Ideally there should be the ability to send independent messages with data "blobs", and independent messages with JSON styles.
I am envisioning e.g. a pydeck user working interactively with pyarrow, creating binary blobs, sending them to deck, then sending new JSON layers referencing those new blobs to update the animation via e.g. transitions.
And yes, for that kind of use case it would be cool if "named" data blobs could be replaced (or as you say "partially updated", i.e. not just added.
if (data.type === 'json') { | ||
json = data.data; | ||
} else if (data.type === 'binary') { | ||
dataBuffer.push(data); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A potential question is also if named buffers can be replaced (e.g. allowing animation/transitions by sending new buffers).
It looks like it is possible here though it seems like old buffers would never be released?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They can definitely be released/replaced. The idea is that the application should control buffer management, not deck. In this code sample, let dataBuffer = <ring buffer>;
the application is using a ring buffer implementation to store the buffers that it received.
Xiaoji Chen seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
Is this still true? Just jotting down a few possible benefits to looking into Arrow for the binary transport format: Apply accessors on JS side The provided example is: {
"@@type": "PointCloudLayer",
"data": {
"length": 10000,
"attributes": {
"getPosition": "@@binary[0]/float64",
"getColor": "@@binary[1]/uint8>"
}
} This implies to me that layer accessors are always resolved in the binding instead of in Deck.gl. E.g. here That means that any time a user wants to update the color or any other attribute based on a function, the binding must apply the new function, and then send a new binary buffer. If a new buffer must be sent on every style update, that seems to minimize part of the benefit of separating the JSON and the data in the first place. I could envision passing a binary Arrow table from the binding to JS, and then possibly serializing a function definition in the JSON description to be applied in JS. If there was a way to make sure this " If accessors are applied only or mostly inside JS, that could reduce code duplication across languages. Though I also understand if you think parsing a function from the JSON attributes is too hacky. Picking If an entire table is sent to JS from the binding, picking is able to return metadata along with the feature, rather than solely its geometry. Less custom binding code Arrow is pretty widespread across languages, so i.e. in Python a user working with Pandas could pass an Arrow table to Deck.gl frictionlessly, and the binding wouldn't need custom code to serialize into buffers. GPUTable module? If the GPUTable module PR becomes merged and Arrow is used in the GPU, it's not that much of a stretch to use Arrow in JS internally for some things. Streaming Built-in support in Arrow for chunked table streaming. |
40d14c4
to
923b533
Compare
923b533
to
4217560
Compare
Trying to generalize the need that arose from pydeck's binary support.