Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for type inference of dataframes using the DataFrame Interchange Protocol #3112

Closed
mattijn opened this issue Jul 17, 2023 · 7 comments · Fixed by #3114
Closed

Support for type inference of dataframes using the DataFrame Interchange Protocol #3112

mattijn opened this issue Jul 17, 2023 · 7 comments · Fixed by #3114

Comments

@mattijn
Copy link
Contributor

mattijn commented Jul 17, 2023

As was was raised in #3109, the DataFrame Interchange Protocol is still experimental and it currently lacks features as type inference.

The current type inference for pandas dataframes can not be used for all dataframes (that are parsed through the dataframe interchange protocol).

Altair has adopted pyarrow for support of the DataFrame Interchange Protocol, so there will be a need to infer these pyarrow datatypes to the available encoding data types of Altair.

The current implementation of type inference for columns in Pandas DataFrames happens around here, which calls this infer_vegalite_type function.
This function needs expansion. Initially it is probably best to do it side-by-side so we keep the current implementation for pandas dataframes and a new implementation for dataframes that are parsed through the DataFrame Interchange Protocol.

Some example data of a pyarrow table that can be used during development of this feature request:

import pyarrow as pa
from datetime import datetime

dt_quantitiative = pa.array([2, 4, 5])
dt_nominal = pa.array(["flamingo", "horse", "centipede"])
dt_temporal = pa.array([datetime(2004, 8, 1), datetime(2004, 9, 1), datetime(2004, 10, 1)])
dt_categorical = pa.array(['A', 'B', 'A'], pa.string()).dictionary_encode()
names = ["q", "n", "t", "c"]


pa_table = pa.Table.from_arrays([dt_quantitiative, dt_nominal, dt_temporal, dt_categorical], names=names)
for col in pa_table.columns:
    print(col.type)
int64
string
timestamp[us]
dictionary<values=string, indices=int32, ordered=0>
@jonmmease
Copy link
Contributor

One thing to be careful about here. I don't want to force the evaluation of lazy dataframe-like objects just to get the schema. For example, we don't want the trigger a full Ibis query just to get the schema info out.

For plain Altair, this isn't a big deal as long as we convert to Arrow at the same time as extracting the schema info. But for the "vegafusion" data transformer to have the chance to push computation down to the native data structure (e.g. into Ibis eventually) we don't want to convert the whole thing to arrow up front.

@jcrist, do you know if there's a way to get schema info from the DataFrame interchange protocol without triggering a Ibis query?

If not, we might need to add some specialized schema extraction logic (which doesn't trigger full evaluation) for the backends that VegaFusion supports.

@jcrist
Copy link

jcrist commented Jul 19, 2023

@jcrist, do you know if there's a way to get schema info from the DataFrame interchange protocol without triggering a Ibis query?

Not with the way we currently implement the __dataframe__ protocol. With a small-ish amount of work we could change that though, if given a motivating use case (e.g. altair making use of those APIs). In that case we'd implement our own shim layer to make extracting certain query information possible without executing the query (schema, column names, ...). If you think this is the best way forward I'd be happy to add this feature.

If vegafusion has its own abstract layer though (as written about in vega/vegafusion#355), wouldn't you immediately convert to that wrapper class and use the generic apis described there instead? Or does altair still need access to the schema separately?

@jonmmease
Copy link
Contributor

If vegafusion has its own abstract layer though (as written about in vega/vegafusion#355), wouldn't you immediately convert to that wrapper class and use the generic apis described there instead? Or does altair still need access to the schema separately?

VegaFusion will always be optional for Altair, so we do need a way to support this in Altair core. I think the ideal situation is that core Altair only knows about the __dataframe__ protocol. And all of the specialization logic (e.g. automatically wrapping an Ibis table in a VegaFusion IbisDataset) is in VegaFusion.

I need to read the spec in more detail, do you know of any examples of using __dataframe__ to pull out column type info? Or is the shim approach you mentioned something separate from the spec?

If there's a path to libraries providing the type through __dataframe__ without materialization, I'd be very interested in updating Altair to use this approach for schema inference.

@jonmmease
Copy link
Contributor

Oh, never mind. Just playing with pandas and pyarrow this does look straightforward from the Altair side:

import pandas as pd
df = pd.DataFrame({"a": [1, 2, 3], "b": ["A", "BB", "CCC"]})
dfi = df.__dataframe__()
dfi.column_names()
Index(['a', 'b'], dtype='object')
dt = dfi.get_column_by_name('b').dtype
dt[0].name
STRING

@jonmmease
Copy link
Contributor

@mattijn, it looks like we don't need pyarrow to use the __dataframe__ protocol to get the column type info. So we may be able to use this approach all the time (at least with new enough pandas versions, I haven't looked into when it was added to pandas).

I'll give this a try soon.

@mattijn
Copy link
Contributor Author

mattijn commented Jul 19, 2023

Nice! Good find👍

@jcrist
Copy link

jcrist commented Jul 19, 2023

I need to read the spec in more detail, do you know of any examples of using dataframe to pull out column type info? Or is the shim approach you mentioned something separate from the spec?

The spec makes this possible (as you found in a later comment), but I don't know of any libraries currently consuming this in a way where making this lazy on ibis's side would be useful. The altair use case here would be the first one. If y'all want to go down this path I'll push up a patch to ibis so that calling these methods won't require executing the query.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Ecosystem integration
Development

Successfully merging a pull request may close this issue.

3 participants