Data Manager RFC #4143

Pessimistress · 2020-01-13T23:38:39Z

Trying to generalize the need that arose from pydeck's binary support.

coveralls · 2020-01-13T23:50:13Z

Coverage remained the same at 83.364% when pulling 40d14c4 on x/json-buffer-rfc into 8a832f3 on master.

twojtasz · 2020-01-13T23:52:03Z

dev-docs/RFCs/proposals/transport-format-rfc.md

+In `json`, binary data can be referenced using the following syntax:
+
+```
+@@binary[<id>]/<type>/<offset>/<length>


This feels in spirit similar to the buffer, bufferViews, accessors of gltf binary format
https://kcoley.github.io/glTF/specification/2.0/figures/gltfOverview-2.0.0a.png

ajduberstein · 2020-01-14T00:05:47Z

This looks fine to me–happy to implement it in @deck.gl/json and include it in pydeck.

twojtasz · 2020-01-15T18:56:27Z

I'm expanding on my thoughts behind the gltf similarities with the goal being to broaden or create a record of the consideration rather than suggest we move to gltf. It has many graphics related features, but there is a big difference adopting that in it's entirety which is beyond what we need for deck.gl.

tl;dr:
The current proposal is very specific to providing some mechanism for binary data and has a very tight scope. If we open that scope a bit i would want to consider:

binary blob that contains multiple actual binary arrays (single request for all data)
ability to encode binary data as a URI, making it part of the JSON rather than a side-car
flexibility & need for interleaved data in the binary buffer

These issues are addressed in gltf by expanding and separating out how the binary metadata. Rather than simple and concise format of @@binary[1]/uint8, it breaks this down into a few arrays of structured metadata related to the buffers, views into the buffer, and type information etc.

Nothing would prohibit moving forward with this proposal as is and expanding to this more flexible structure definition when necessary.

--- Verbose section

Having said that the goals for this RFC seem to be:

expand the json work to support binary encoded data
secondarily consider environments not limited to the browser (pydeck being python for example)

This proposal assume the json & binary data are provided. It may be useful include a workflow where just the json is provided but the binary data could be linked to by a URL. This provides some nice developer experience, where data could simply be linked and let the system handle fetching and hydrating.

This line of thinking then would take influence from the gltf where they break down the "binary" description into actual structured elements, rather than the condensed structured string proposed here.

Specifically it would look something like this:

   layers: [
       {
            "@@type": "PointCloudLayer",
            "data": {
            "length": 10000,
            "attributes": {
              "getPosition": "@@accessor[0]",
              "getColor": "@@accessor[1]"
            }
        }
   ],
   buffers: [
        { byteLength: 500, uri: "binary.bin" }
   },
   bufferViews: [
      {
           buffer: 0,
           byteOffset: 0,
           byteLength: 400
      },
      {
           buffer: 0,
           byteOffset: 400,
           byteLength: 100
      }
   ],
   accessor: [
       { 
            bufferView: 0
            type: "VEC2",
            componentType: 5126
       }
    ]
}

This it not a complete counter-proposal, but hopefully enough that you can see where / why gltf did what it did and if there is anything we could leverage to balance the flexibility vs overhead of their approach.

https://github.com/KhronosGroup/glTF-Tutorials/blob/master/gltfTutorial/gltfTutorial_005_BuffersBufferViewsAccessors.md

tsherif · 2020-01-21T22:05:37Z

dev-docs/RFCs/proposals/transport-format-rfc.md

+- `id`: key in `buffers`
+- `type`: one of `int8`, `uint8`, `int16`, `uint16`, `int32`, `uint32`, `float32`, `float64`
+- `offset` (optional): byte offset into the buffer, default `0`
+- `length` (optional): byte length of the data, default to end of buffer


This type of parameter is usually called size (e.g.).

tsherif · 2020-01-21T22:14:34Z

This approach seems sensible. This architecture of creating a "database" of buffers and referring to them by an identifier is common in 3D engines, primarily for the purpose of re-using buffers.

As an alternative to the glTF model @twojtasz suggested, we could frame it along the lines of WebGPU, where buffer instances, their usage, and the mapping between the two are handled separately.

ibgreen

Highly appreciated!

For context, I have put Unfolded's high-level thoughts on the Transport protocol in a slide deck.

This proposes that the transport could be defined to handle multiple aspects of client server communication, (binary data transport being one).

If that makes sense, maybe this can be considered a sub-RFC and we could retitle this RFC to Binary Buffer Transport RFC?

ibgreen · 2020-01-22T16:44:56Z

dev-docs/RFCs/proposals/transport-format-rfc.md

+converter.convert(json, buffers);
+```
+
+Where `buffers` is a list/map of ArrayBuffers.


Sending individual buffers is absolutely the reasonable starting point, though from an assumed typical user persona's perspective this could be considered an advanced use case.

For binary data, it could be desirable to work towards supporting binary formats that contain all buffers and some or all of metadata required to access them) in a single file, i.e. Arrow (and possibly glTF), so that they could just be sent as a unit and then just supplied directly as Layer.data.

The canonical example being a pydeck user just needs to use pyarrow to convert his pandas data frame to a binary Arrow blob and then pass that blob as data to a deck.gl layer. Ideally the user has to provide as little mapping as possible on the deck.gl side. (The in-progress Arrow support RFCs have the beginnings of the required data mappings).

Per my understanding of the current frameworks architecture, we don't want deck.gl to parse the Arrow format. Deck should understand the table category from loaders.gl. It's up to the application what format they want to use, and they can pick the corresponding loader, which outputs to the same, normalized format.

If that is the case, then there is no built-in support for "a single binary file." Deck will always receive a JSON descriptor, which contains accessors and arraybuffers. It is not deck's concern how a single binary is parsed to that JSON descriptor.

ibgreen · 2020-01-22T16:46:34Z

dev-docs/RFCs/proposals/transport-format-rfc.md

+{
+  "@@type": "PointCloudLayer",
+  "data": {
+    "length": 10000,


Ideas for the longer term: To make it a little easier on the user it could perhaps be nice to be able to let the user assemble the buffer "soup" into a data source that could be referenced in the layers. Rather than have all these defs inline in the layer. Though that would require starting to think about the data sources API.

"dataSources": { "bufferTable": { buffer1: {props},

This is along the lines of the gltf or webGPU structure right.

dev-docs/RFCs/proposals/transport-format-rfc.md

ibgreen · 2020-01-22T16:52:57Z

dev-docs/RFCs/proposals/transport-format-rfc.md

+Expand the `JSONConverter.convert` API to accept a second argument:
+
+```js
+converter.convert(json, buffers);


There is not much detail on how buffers are transported, which is somewhat reasonable given that we may have widely varying implementations on different platforms.

But we probably need a buffer or data source manager (perhaps on the @deck.gl/json level) and it probably needs to have a defined API that the implementations can work against.

They requirement is of course that buffers can be sent separately from JSON and that JSON can be updated multiple times, referencing the same buffers.

They requirement is of course that buffers can be sent separately from JSON and that JSON can be updated multiple times, referencing the same buffers.

This part seems to be an important requirement not really covered here. The idea is that there is a set of state (JSON and Buffers) and you can update both or one, and the "correct" behavior should happen.

Does this also mean supporting partial updates? For example, I want to only update 1 buffer out of the 10 that where sent.

Yes look at the message sequence diagram in the linked slides. Ideally there should be the ability to send independent messages with data "blobs", and independent messages with JSON styles.

I am envisioning e.g. a pydeck user working interactively with pyarrow, creating binary blobs, sending them to deck, then sending new JSON layers referencing those new blobs to update the animation via e.g. transitions.

And yes, for that kind of use case it would be cool if "named" data blobs could be replaced (or as you say "partially updated", i.e. not just added.

ibgreen · 2020-01-22T16:53:48Z

dev-docs/RFCs/proposals/transport-format-rfc.md

+  if (data.type === 'json') {
+    json = data.data;
+  } else if (data.type === 'binary') {
+    dataBuffer.push(data);


A potential question is also if named buffers can be replaced (e.g. allowing animation/transitions by sending new buffers).

It looks like it is possible here though it seems like old buffers would never be released?

They can definitely be released/replaced. The idea is that the application should control buffer management, not deck. In this code sample, let dataBuffer = <ring buffer>; the application is using a ring buffer implementation to store the buffers that it received.

CLAassistant · 2020-04-12T01:29:09Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.

Xiaoji Chen seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

kylebarron · 2020-04-12T02:48:13Z

Per my understanding of the current frameworks architecture, we don't want deck.gl to parse the Arrow format

Is this still true? Just jotting down a few possible benefits to looking into Arrow for the binary transport format:

Apply accessors on JS side

The provided example is:

{
  "@@type": "PointCloudLayer",
  "data": {
    "length": 10000,
    "attributes": {
      "getPosition": "@@binary[0]/float64",
      "getColor": "@@binary[1]/uint8>"
  }
}

This implies to me that layer accessors are always resolved in the binding instead of in Deck.gl. E.g. here getColor accesses RGB values directly, so if the user wants to apply a function to generate colors, that function must be applied in the binding.

That means that any time a user wants to update the color or any other attribute based on a function, the binding must apply the new function, and then send a new binary buffer. If a new buffer must be sent on every style update, that seems to minimize part of the benefit of separating the JSON and the data in the first place.

I could envision passing a binary Arrow table from the binding to JS, and then possibly serializing a function definition in the JSON description to be applied in JS. If there was a way to make sure this "eval" is safe, it could mean faster data updates. (I don't know how unsafe eval is in the browser, specifically).

If accessors are applied only or mostly inside JS, that could reduce code duplication across languages. Though I also understand if you think parsing a function from the JSON attributes is too hacky.

Picking

If an entire table is sent to JS from the binding, picking is able to return metadata along with the feature, rather than solely its geometry.

Less custom binding code

Arrow is pretty widespread across languages, so i.e. in Python a user working with Pandas could pass an Arrow table to Deck.gl frictionlessly, and the binding wouldn't need custom code to serialize into buffers.

GPUTable module?

If the GPUTable module PR becomes merged and Arrow is used in the GPU, it's not that much of a stretch to use Arrow in JS internally for some things.

Streaming

Built-in support in Arrow for chunked table streaming.

coveralls · 2020-05-27T22:41:25Z

Coverage increased (+2.4%) to 81.514% when pulling dc2522d on x/json-buffer-rfc into 79e6564 on master.

Pessimistress requested review from ajduberstein and ibgreen January 13, 2020 23:38

Pessimistress force-pushed the x/json-buffer-rfc branch from a5410ad to 40d14c4 Compare January 13, 2020 23:38

Pessimistress requested a review from twojtasz January 13, 2020 23:39

twojtasz reviewed Jan 13, 2020

View reviewed changes

ajduberstein approved these changes Jan 14, 2020

View reviewed changes

Pessimistress requested a review from tsherif January 21, 2020 21:25

tsherif approved these changes Jan 21, 2020

View reviewed changes

ibgreen reviewed Jan 22, 2020

View reviewed changes

transport format rfc

adb17ab

Pessimistress force-pushed the x/json-buffer-rfc branch from 40d14c4 to 923b533 Compare May 27, 2020 22:28

update

4217560

Pessimistress force-pushed the x/json-buffer-rfc branch from 923b533 to 4217560 Compare May 27, 2020 22:28

rename to data manager RFC

dc2522d

Pessimistress changed the title ~~Transport format RFC~~ Data Manager RFC May 29, 2020

Pessimistress marked this pull request as ready for review May 29, 2020 05:17

Pessimistress merged commit 19be47d into master May 29, 2020

Pessimistress deleted the x/json-buffer-rfc branch May 29, 2020 05:19

Data Manager RFC #4143

Data Manager RFC #4143

Uh oh!

Conversation

Pessimistress commented Jan 13, 2020

Uh oh!

coveralls commented Jan 13, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ajduberstein commented Jan 14, 2020

Uh oh!

twojtasz commented Jan 15, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tsherif commented Jan 21, 2020

Uh oh!

ibgreen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Pessimistress Jan 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CLAassistant commented Apr 12, 2020

Uh oh!

kylebarron commented Apr 12, 2020

Uh oh!

coveralls commented May 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Pessimistress Jan 24, 2020 •

edited

Loading

coveralls commented May 27, 2020 •

edited

Loading