Support Ibis Backend #1105

cosmicBboy · 2023-03-09T16:07:41Z

Is your feature request related to a problem? Please describe.

Pandera currently doesn't support validating data in a persistent datastore (e.g. MySQL, Postgres, etc). It would benefit users to be able to write pandera schemas that can then be compiled to a query language (like SQL), executed on a remote DB, that either:

validates the data in-place, returning an error report if the data is invalid
validates the data and load it into memory using some framework (e.g. pandas) for further processing

A high-leverage integration to enable this behavior would be with ibis, a data analytics framework that hooks into various backends (duckdb, mysql, postgres, etc).

Describe the solution you'd like

For the MVP integration with ibis:

Implement a schema specification for ibis
Implement a backend validator for ibis
Support writing custom checks

Describe alternatives you've considered
NA

cpcloud · 2023-03-13T16:07:16Z

@cosmicBboy Hey 👋🏻!

This looks pretty interesting!

Is there anything we can do over in ibis to help enable this? Happy to help!

cosmicBboy · 2023-03-13T22:06:00Z

Thanks @cpcloud ! The pandera internals re-write is still happening (the last PR should be merged soon #1109), after which I'm gonna start chipping away at a pandera-ibis package to see how well the rewritten internals fit the ibis programming model.

Is there anything we can do over in ibis to help enable this? Happy to help!

At this stage some conceptual help would be much appreciated! The main uncertainty in my mind is how well the current pandera abstractoins fit into ibis.

It would be awesome if the ibis team can take a look at the Schema and Schema Components classes described here and answer this high-level question:

Roughly speaking, how do pandas abstractions map into ibis?

For example:

pandas -> ibis

DataFrame -> Table
Series -> ?
Column -> ?
Index -> ?
MultiIndex -> ?

And a follow-up to this would be:

Do the pandera schema and schema components specification cover most of the properties that users would like to validate in an ibis table?

And finally, because pandera relies a lot on user-defined validation Checks:

How do ibis users specify custom operations that they want to do on tables/columns?

cosmicBboy · 2023-07-15T19:09:39Z

@gforsyth this tracks the ibis integration! I'll circle back when I have capacity to get started on an integration in earnest

csubhodeep · 2024-06-10T15:59:10Z

Hello all, I and my team are working on a couple of data projects and for quite a few months we have been using pandera heavily for most of our row-wise data validation tasks.

Lately, we have moved to sourcing all our tables from a data warehouse and also our data size has grown (some tables have >50 columns ranging from 100-250 GB). And since we did not want to refactor most of our transformation steps, earlier written in pandas, we migrated to using ibis.
While most of the migration was smooth, we are now facing difficulties in pulling the entire table into memory as a pandas.DataFrame and then validating it using pandera (which seems quite natural).

We did try and explore some hand-rolled alternatives by implementing a thin wrapper around validate method of pandera that pulls the table in chunks and then validates it. But it was not as efficient as we would have liked it to be.

Then, to my surprise, I found this thread !

I know that a few core developers from the ibis project have started putting some efforts (#1451 ) to support for pandera-ibis integration. While I am sure this will be a great addition to the pandera ecosystem, I had few questions:

When we would use pandera with ibis, would we be able to validate the data by using the compute power of the query engine that the ibis backend is connected to (here I assume we are only connected to DB engine-like backends such as DuckDB, BigQuery etc.)?
Would it be possible to do the validation without pulling the entire table into memory? (i.e. not materializing it as a memtable or pandas.DataFrame under the hood)
If yes, what would be the type and schema of the failure_cases?
- Type as in an in-memory data structure or a table in the backend? - Like a temporary/persistent table/view on the DB.
- Schema as in currently by default it provides an "index" column to help the developers find which row has failed for which check. But usually a table in the backend may or may not have a primary key that can be used to identify the row uniquely. How would the failure_cases table present this information?

Nonetheless, I am eagerly waiting for this feature to be rolled out so that my team can get their hands on it.

P.S. I am not an expert in pandera or ibis internals, just a happy & enthusiastic user. 😄

Thanks !

deepyaman · 2024-06-13T16:08:20Z

When we would use pandera with ibis, would we be able to validate the data by using the compute power of the query engine that the ibis backend is connected to (here I assume we are only connected to DB engine-like backends such as DuckDB, BigQuery etc.)?

Yes!

Would it be possible to do the validation without pulling the entire table into memory? (i.e. not materializing it as a memtable or pandas.DataFrame under the hood)

Also yes! At least, the goal is to offload as much computation as possible to the Ibis backend.

If yes, what would be the type and schema of the failure_cases?

Type as in an in-memory data structure or a table in the backend? - Like a temporary/persistent table/view on the DB.

Schema as in currently by default it provides an "index" column to help the developers find which row has failed for which check. But usually a table in the backend may or may not have a primary key that can be used to identify the row uniquely. How would the failure_cases table present this information?

I will need to look into this; I just tested using the DuckDB backend for the example in #1451, and failure_cases is a pandas dataframe. However, I don't recall trying to actually implement anything for failure_cases, so I will need to do some digging and figure out (1) what type is failure_cases for other pandera backends and (2) is there any reason it can't be an Ibis table for the Ibis backend. I imagine it could be an Ibis table, but I'm not knowledgeable enough on pandera right now to say so confidently.

Hope that helps a bit! And sorry I have left the Ibis backend work in a dangling state; number of priorities keep coming up, but I do hope to resume progress on it soon!

csubhodeep · 2024-06-13T22:46:25Z

Thanks a lot @deepyaman for your insights ! Looking forward 🤞🏽 😃

cosmicBboy added the enhancement New feature or request label Mar 9, 2023

cosmicBboy mentioned this issue Mar 14, 2023

Allow fallback coercion function as column/field argument #1082

Closed

MarcoGorelli mentioned this issue Mar 17, 2023

potentially relevant usage patterns / targets for a developer-focused API data-apis/dataframe-api#71

Open

cosmicBboy mentioned this issue Oct 2, 2023

Design Data Types Library That Supports Both PySpark & Pandas #1360

Open

deepyaman mentioned this issue Dec 20, 2023

Implement basic validation backend for Ibis tables #1451

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Ibis Backend #1105

Support Ibis Backend #1105

cosmicBboy commented Mar 9, 2023 •

edited

cpcloud commented Mar 13, 2023

cosmicBboy commented Mar 13, 2023

cosmicBboy commented Jul 15, 2023

csubhodeep commented Jun 10, 2024

deepyaman commented Jun 13, 2024

csubhodeep commented Jun 13, 2024

Support Ibis Backend #1105

Support Ibis Backend #1105

Comments

cosmicBboy commented Mar 9, 2023 • edited

cpcloud commented Mar 13, 2023

cosmicBboy commented Mar 13, 2023

cosmicBboy commented Jul 15, 2023

csubhodeep commented Jun 10, 2024

deepyaman commented Jun 13, 2024

csubhodeep commented Jun 13, 2024

cosmicBboy commented Mar 9, 2023 •

edited