Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve serialization of Pandas DataFrames to ipyvega #2471

Closed
3 tasks done
jdfekete opened this issue Jun 4, 2021 · 5 comments
Closed
3 tasks done

Improve serialization of Pandas DataFrames to ipyvega #2471

jdfekete opened this issue Jun 4, 2021 · 5 comments

Comments

@jdfekete
Copy link

jdfekete commented Jun 4, 2021

Hi,
Thanks for Altair. I have created a feature request issue for ipyvega that could also impact Altair:
vega/ipyvega#345

It boils down to creating a serializer to efficiently send a Pandas DataFrame to vega. Currently, the communication in notebooks between python and fs is very inefficient, especially with the row-wise verbose json format. It limits the amount of data that can be reasonably sent to js, and limits the visible performance of Altair.

I am interested to see if this point is important, critical, or just secondary to Altair's adoption. I think that the limitation of data size is an issue but I may be biased. Please, comment on my feature request so I can decide how to address it.

Thanks in advance,
Jean-Daniel

Please follow these steps to make it more efficient to respond to your feature request.

  • Since Altair is a Python wrapper around the Vega-Lite visualization grammar, most feature requests should be reported directly to Vega-Lite. You can click the Action Button of your Altair chart and "Open in Vega Editor" to create a reproducible Vega-Lite example.
  • Search for duplicate issues.
  • Describe the feature's goal, motivating use cases, and its expected behavior.
@jakevdp
Copy link
Collaborator

jakevdp commented Jun 4, 2021

More efficient data serialization would be useful, but such changes would first have to be supported in Vega-Lite.

@jheer
Copy link
Member

jheer commented Jun 4, 2021

Thanks @jdfekete for raising the issue, and also flagging @domoritz.

Scale is a recurring issue for Altair users, at least as evidenced in my visualization courses at UW. (Some students benefit from the altair data server package, but that is not a one-size-fits-all solution.) Right now the scalability experience in Observable notebooks (where the data is already in JS) is often much better than with Altair due to this serialization overhead.

While I agree with @jakevdp that more might be done in Vega itself, perhaps there is also space for handling data serialization in the generated HTML/JS prior to invoking Vega/Vega-Lite. For example, one could imagine serializing a data table to an Apache Arrow byte array in Python and then passing that instead (even if only as a base64-encoded string) to be deserialized using the Arrow JS or Arquero libraries. If so, it seems to me the costs involved would largely be (1) having to load additional JS libraries client side, and (2) format-contingent HTML/JS code generation for deserializing data before passing it to Vega.

How feasible might it be to have some kind of small plug-in system in Altair and/or ipyvega that allowed customized code for (a) serializing data on the Python side, and (b) adding library imports and deserialization code on the client side?

@domoritz
Copy link
Member

domoritz commented Jun 4, 2021

I absolutely agree that improving data serialization would be a huge improvement.

The way I see it, Altair is a Python API to generate Vega-Lite specs and these specs can be rendered in different platforms. Therefore, we may need to look at each of the platforms and improve serialization there.

When I was working with Streamlit, I added some code to separate the data from the chart specification so that the data can be sent as an Arrow table. You can see how I did it at https://github.com/streamlit/streamlit/blob/9714e3e6f852c26e3f8a155d39c2d5028dff1d71/lib/streamlit/elements/altair.py#L305. We could do something similar in ipyvega (vega/ipyvega#345). I think sending the data as Arrow makes the most sense since it's columnar and even binary so e.g. floating point numbers are much more compact than as strings.

I don't think the overhead of Arrow JS in ipyvega is too large so I think we could always add it. We should measure the impact of serialization/deserialization compared to JSON to determine whether we want a flag to control whether the data is transferred as Arrow or JSON.

@joelostblom
Copy link
Contributor

Closing this as there is nothing to do on the Altair side of things. See vega/ipyvega#346 for the current progress on this feature.

@joelostblom joelostblom closed this as not planned Won't fix, can't repro, duplicate, stale Jan 6, 2023
@domoritz
Copy link
Member

I also want to point to https://vegafusion.io, which not only has efficient transport but also offloads computation to the backend making charts much more responsive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants