Supporting different dataframe libraries #719
Replies: 2 comments 5 replies
-
Thanks for starting the discussion! Let's look at an example -
So, I don't think that the Consortium's DataFrame Standard (in its current form at least) would be sufficient for this task, because:
I'm tempted to suggest you just use the interchange protocol to interchange to pandas, using |
Beta Was this translation helpful? Give feedback.
-
Coming back from travels. My thoughts: Question 1 (what to do with an input that is not a pandas dataframe):My answer: use the dataframe API to convert it to pandas (interesting suggestions on how to do it above, eg by @MarcoGorelli). But, maybe worry about this after release 0.1. Consider vendoring Question 2 (how to support many dataframe implementation):My answer: favor the dataframe API, but consider special-casing a few implementations (for now I have in mind pandas and polars) to have efficient codebases (eg lazy for polars, .query for pandas). Also, cover these in tests. |
Beta Was this translation helpful? Give feedback.
-
Hey all,
As we progressively broaden the scope of skrub, some long-term questions need to be addressed.
Question 1
We initially intended to work with pandas dataframes and numpy arrays, as scikit-learn does. However, with the rise of polars, duckdb, and many more, we have to wonder which modules to support.
We can stay lean and only choose to add polars for now, but what do we want in the long term?
[Edit]
It's important to note that for the moment, all skrub transformers convert dataframes to numpy arrays —some transformers like
MinHashEncoder
orDatetimeEncoder
don't use scikit-learn estimators and could use dataframes instead of numpy arrays though.For transformers that require scikit-learn, like the
GapEncoder
, theTargetEncoder
, and theJoiner
, we need to make a decision. Should we:Question 2
We shouldn't try to adapt the codebase to each dataframe module because it would quickly become unsustainable. Instead, we have two options:
Option 1: The dataframe API
As introduced during EuroScipy 2023 by @MarcoGorelli, the dataframe API is now available for both pandas and polars dataframes, plus polars lazyframes!
We can use it as follows:
See the API specs.
However, the dataframe API is still in its early days, and important features such as joining are not there (yet?).
Option 2: Ibis
Ibis has been around for some time and supports a wide array of different modules.
As an SQL engine, it can also directly query databases like PostgreSQL or Snowflake, which is a significant feature in the long term if we want to operate on remote databases.
I also like the simpler syntax (except for the setup boilerplate) and features that are already there.
Both solutions introduce an additional dependency (dataframe_api_compat or ibis) plus the desired modules.
What do you all think of this? Do you see some other ways forward?
Beta Was this translation helpful? Give feedback.
All reactions