Skip to content

Conversation

Pessimistress
Copy link
Collaborator

Trying to generalize the need that arose from pydeck's binary support.

@coveralls
Copy link

Coverage Status

Coverage remained the same at 83.364% when pulling 40d14c4 on x/json-buffer-rfc into 8a832f3 on master.

In `json`, binary data can be referenced using the following syntax:

```
@@binary[<id>]/<type>/<offset>/<length>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels in spirit similar to the buffer, bufferViews, accessors of gltf binary format
https://kcoley.github.io/glTF/specification/2.0/figures/gltfOverview-2.0.0a.png

@ajduberstein
Copy link
Collaborator

This looks fine to me–happy to implement it in @deck.gl/json and include it in pydeck.

@twojtasz
Copy link
Contributor

I'm expanding on my thoughts behind the gltf similarities with the goal being to broaden or create a record of the consideration rather than suggest we move to gltf. It has many graphics related features, but there is a big difference adopting that in it's entirety which is beyond what we need for deck.gl.

tl;dr:
The current proposal is very specific to providing some mechanism for binary data and has a very tight scope. If we open that scope a bit i would want to consider:

  1. binary blob that contains multiple actual binary arrays (single request for all data)
  2. ability to encode binary data as a URI, making it part of the JSON rather than a side-car
  3. flexibility & need for interleaved data in the binary buffer

These issues are addressed in gltf by expanding and separating out how the binary metadata. Rather than simple and concise format of @@binary[1]/uint8, it breaks this down into a few arrays of structured metadata related to the buffers, views into the buffer, and type information etc.

Nothing would prohibit moving forward with this proposal as is and expanding to this more flexible structure definition when necessary.

--- Verbose section

Having said that the goals for this RFC seem to be:

  1. expand the json work to support binary encoded data
  2. secondarily consider environments not limited to the browser (pydeck being python for example)

This proposal assume the json & binary data are provided. It may be useful include a workflow where just the json is provided but the binary data could be linked to by a URL. This provides some nice developer experience, where data could simply be linked and let the system handle fetching and hydrating.

This line of thinking then would take influence from the gltf where they break down the "binary" description into actual structured elements, rather than the condensed structured string proposed here.

Specifically it would look something like this:

   layers: [
       {
            "@@type": "PointCloudLayer",
            "data": {
            "length": 10000,
            "attributes": {
              "getPosition": "@@accessor[0]",
              "getColor": "@@accessor[1]"
            }
        }
   ],
   buffers: [
        { byteLength: 500, uri: "binary.bin" }
   },
   bufferViews: [
      {
           buffer: 0,
           byteOffset: 0,
           byteLength: 400
      },
      {
           buffer: 0,
           byteOffset: 400,
           byteLength: 100
      }
   ],
   accessor: [
       { 
            bufferView: 0
            type: "VEC2",
            componentType: 5126
       }
    ]
}

This it not a complete counter-proposal, but hopefully enough that you can see where / why gltf did what it did and if there is anything we could leverage to balance the flexibility vs overhead of their approach.

https://github.com/KhronosGroup/glTF-Tutorials/blob/master/gltfTutorial/gltfTutorial_005_BuffersBufferViewsAccessors.md

@Pessimistress Pessimistress requested a review from tsherif January 21, 2020 21:25
- `id`: key in `buffers`
- `type`: one of `int8`, `uint8`, `int16`, `uint16`, `int32`, `uint32`, `float32`, `float64`
- `offset` (optional): byte offset into the buffer, default `0`
- `length` (optional): byte length of the data, default to end of buffer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type of parameter is usually called size (e.g.).

@tsherif
Copy link
Contributor

tsherif commented Jan 21, 2020

This approach seems sensible. This architecture of creating a "database" of buffers and referring to them by an identifier is common in 3D engines, primarily for the purpose of re-using buffers.

As an alternative to the glTF model @twojtasz suggested, we could frame it along the lines of WebGPU, where buffer instances, their usage, and the mapping between the two are handled separately.

Copy link
Collaborator

@ibgreen ibgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Highly appreciated!

For context, I have put Unfolded's high-level thoughts on the Transport protocol in a slide deck.

This proposes that the transport could be defined to handle multiple aspects of client server communication, (binary data transport being one).

If that makes sense, maybe this can be considered a sub-RFC and we could retitle this RFC to Binary Buffer Transport RFC?

converter.convert(json, buffers);
```

Where `buffers` is a list/map of ArrayBuffers.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sending individual buffers is absolutely the reasonable starting point, though from an assumed typical user persona's perspective this could be considered an advanced use case.

For binary data, it could be desirable to work towards supporting binary formats that contain all buffers and some or all of metadata required to access them) in a single file, i.e. Arrow (and possibly glTF), so that they could just be sent as a unit and then just supplied directly as Layer.data.

The canonical example being a pydeck user just needs to use pyarrow to convert his pandas data frame to a binary Arrow blob and then pass that blob as data to a deck.gl layer. Ideally the user has to provide as little mapping as possible on the deck.gl side. (The in-progress Arrow support RFCs have the beginnings of the required data mappings).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per my understanding of the current frameworks architecture, we don't want deck.gl to parse the Arrow format. Deck should understand the table category from loaders.gl. It's up to the application what format they want to use, and they can pick the corresponding loader, which outputs to the same, normalized format.

If that is the case, then there is no built-in support for "a single binary file." Deck will always receive a JSON descriptor, which contains accessors and arraybuffers. It is not deck's concern how a single binary is parsed to that JSON descriptor.

{
"@@type": "PointCloudLayer",
"data": {
"length": 10000,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideas for the longer term: To make it a little easier on the user it could perhaps be nice to be able to let the user assemble the buffer "soup" into a data source that could be referenced in the layers. Rather than have all these defs inline in the layer. Though that would require starting to think about the data sources API.

"dataSources": {
   "bufferTable": {
     buffer1: {props},

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is along the lines of the gltf or webGPU structure right.

Expand the `JSONConverter.convert` API to accept a second argument:

```js
converter.convert(json, buffers);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is not much detail on how buffers are transported, which is somewhat reasonable given that we may have widely varying implementations on different platforms.

But we probably need a buffer or data source manager (perhaps on the @deck.gl/json level) and it probably needs to have a defined API that the implementations can work against.

They requirement is of course that buffers can be sent separately from JSON and that JSON can be updated multiple times, referencing the same buffers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They requirement is of course that buffers can be sent separately from JSON and that JSON can be updated multiple times, referencing the same buffers.

This part seems to be an important requirement not really covered here. The idea is that there is a set of state (JSON and Buffers) and you can update both or one, and the "correct" behavior should happen.

Does this also mean supporting partial updates? For example, I want to only update 1 buffer out of the 10 that where sent.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes look at the message sequence diagram in the linked slides. Ideally there should be the ability to send independent messages with data "blobs", and independent messages with JSON styles.

I am envisioning e.g. a pydeck user working interactively with pyarrow, creating binary blobs, sending them to deck, then sending new JSON layers referencing those new blobs to update the animation via e.g. transitions.

And yes, for that kind of use case it would be cool if "named" data blobs could be replaced (or as you say "partially updated", i.e. not just added.

if (data.type === 'json') {
json = data.data;
} else if (data.type === 'binary') {
dataBuffer.push(data);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A potential question is also if named buffers can be replaced (e.g. allowing animation/transitions by sending new buffers).

It looks like it is possible here though it seems like old buffers would never be released?

Copy link
Collaborator Author

@Pessimistress Pessimistress Jan 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They can definitely be released/replaced. The idea is that the application should control buffer management, not deck. In this code sample, let dataBuffer = <ring buffer>; the application is using a ring buffer implementation to store the buffers that it received.

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


Xiaoji Chen seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@kylebarron
Copy link
Collaborator

Per my understanding of the current frameworks architecture, we don't want deck.gl to parse the Arrow format

Is this still true? Just jotting down a few possible benefits to looking into Arrow for the binary transport format:

Apply accessors on JS side

The provided example is:

{
  "@@type": "PointCloudLayer",
  "data": {
    "length": 10000,
    "attributes": {
      "getPosition": "@@binary[0]/float64",
      "getColor": "@@binary[1]/uint8>"
  }
}

This implies to me that layer accessors are always resolved in the binding instead of in Deck.gl. E.g. here getColor accesses RGB values directly, so if the user wants to apply a function to generate colors, that function must be applied in the binding.

That means that any time a user wants to update the color or any other attribute based on a function, the binding must apply the new function, and then send a new binary buffer. If a new buffer must be sent on every style update, that seems to minimize part of the benefit of separating the JSON and the data in the first place.

I could envision passing a binary Arrow table from the binding to JS, and then possibly serializing a function definition in the JSON description to be applied in JS. If there was a way to make sure this "eval" is safe, it could mean faster data updates. (I don't know how unsafe eval is in the browser, specifically).

If accessors are applied only or mostly inside JS, that could reduce code duplication across languages. Though I also understand if you think parsing a function from the JSON attributes is too hacky.

Picking

If an entire table is sent to JS from the binding, picking is able to return metadata along with the feature, rather than solely its geometry.

Less custom binding code

Arrow is pretty widespread across languages, so i.e. in Python a user working with Pandas could pass an Arrow table to Deck.gl frictionlessly, and the binding wouldn't need custom code to serialize into buffers.

GPUTable module?

If the GPUTable module PR becomes merged and Arrow is used in the GPU, it's not that much of a stretch to use Arrow in JS internally for some things.

Streaming

Built-in support in Arrow for chunked table streaming.

@coveralls
Copy link

coveralls commented May 27, 2020

Coverage Status

Coverage increased (+2.4%) to 81.514% when pulling dc2522d on x/json-buffer-rfc into 79e6564 on master.

@Pessimistress Pessimistress changed the title Transport format RFC Data Manager RFC May 29, 2020
@Pessimistress Pessimistress marked this pull request as ready for review May 29, 2020 05:17
@Pessimistress Pessimistress merged commit 19be47d into master May 29, 2020
@Pessimistress Pessimistress deleted the x/json-buffer-rfc branch May 29, 2020 05:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants