Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support DataFrame Interchange Protocol #251

Merged
merged 2 commits into from
Mar 15, 2023
Merged

Support DataFrame Interchange Protocol #251

merged 2 commits into from
Mar 15, 2023

Conversation

jonmmease
Copy link
Collaborator

@jonmmease jonmmease commented Mar 15, 2023

Closes #243

This PR adds support for Python objects that implement the DataFrame Interchange Protocol. Altair is adding support for this protocol in version 5.

Here is an example with polars (and Altair 5 and pyarrow 11.0.0)

import vegafusion as vf
import altair as alt
import polars as pl

vf.enable()
pl_df = pl.read_parquet("https://vegafusion-datasets.s3.amazonaws.com/vega/movies_1m.parquet")

chart = alt.Chart(pl_df).mark_bar().encode(
    alt.X("IMDB_Rating:Q", bin=True),
    alt.Y("count()")
)
chart

visualization (1)

Internally, the polars DataFrame is zero-copy converted into a PyArrow table, which is then processed by DataFusion. No conversion to/from pandas is involved, so it's really fast. Here's a speed comparison

Polars

%%time
pl_df = pl.read_parquet("https://vegafusion-datasets.s3.amazonaws.com/vega/movies_1m.parquet")
CPU times: user 297 ms, sys: 136 ms, total: 433 ms
Wall time: 1.2 s
%%time
chart = alt.Chart(pl_df).mark_bar().encode(
    alt.X("IMDB_Rating:Q", bin=True),
    alt.Y("count()")
)
vf.transformed_data(chart)
CPU times: user 231 ms, sys: 69.5 ms, total: 301 ms
Wall time: 223 ms

pandas

%%time
pd_df = pd.read_parquet("https://vegafusion-datasets.s3.amazonaws.com/vega/movies_1m.parquet")
CPU times: user 454 ms, sys: 203 ms, total: 657 ms
Wall time: 3.2 s
%%time
chart = alt.Chart(pd_df).mark_bar().encode(
    alt.X("IMDB_Rating:Q", bin=True),
    alt.Y("count()")
)
vf.transformed_data(chart)
CPU times: user 558 ms, sys: 101 ms, total: 659 ms
Wall time: 601 ms

Geo Interface

By being more careful about the input data types, this PR also closes #250

@jonmmease jonmmease merged commit 1b5bd6e into main Mar 15, 2023
@mattijn mattijn mentioned this pull request Mar 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

support geo-interface Support dataframe protocol
1 participant