Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Ibis Backend #1105

Open
cosmicBboy opened this issue Mar 9, 2023 · 4 comments
Open

Support Ibis Backend #1105

cosmicBboy opened this issue Mar 9, 2023 · 4 comments
Labels
enhancement New feature or request

Comments

@cosmicBboy
Copy link
Collaborator

cosmicBboy commented Mar 9, 2023

Is your feature request related to a problem? Please describe.

Pandera currently doesn't support validating data in a persistent datastore (e.g. MySQL, Postgres, etc). It would benefit users to be able to write pandera schemas that can then be compiled to a query language (like SQL), executed on a remote DB, that either:

  1. validates the data in-place, returning an error report if the data is invalid
  2. validates the data and load it into memory using some framework (e.g. pandas) for further processing

A high-leverage integration to enable this behavior would be with ibis, a data analytics framework that hooks into various backends (duckdb, mysql, postgres, etc).

Describe the solution you'd like

For the MVP integration with ibis:

  • Implement a schema specification for ibis
  • Implement a backend validator for ibis
  • Support writing custom checks

Describe alternatives you've considered
NA

@cosmicBboy cosmicBboy added the enhancement New feature or request label Mar 9, 2023
@cpcloud
Copy link

cpcloud commented Mar 13, 2023

@cosmicBboy Hey 👋🏻!

This looks pretty interesting!

Is there anything we can do over in ibis to help enable this? Happy to help!

@cosmicBboy
Copy link
Collaborator Author

Thanks @cpcloud ! The pandera internals re-write is still happening (the last PR should be merged soon #1109), after which I'm gonna start chipping away at a pandera-ibis package to see how well the rewritten internals fit the ibis programming model.

Is there anything we can do over in ibis to help enable this? Happy to help!

At this stage some conceptual help would be much appreciated! The main uncertainty in my mind is how well the current pandera abstractoins fit into ibis.

It would be awesome if the ibis team can take a look at the Schema and Schema Components classes described here and answer this high-level question:

Roughly speaking, how do pandas abstractions map into ibis?

For example:

pandas -> ibis

  • DataFrame -> Table
  • Series -> ?
  • Column -> ?
  • Index -> ?
  • MultiIndex -> ?

And a follow-up to this would be:

Do the pandera schema and schema components specification cover most of the properties that users would like to validate in an ibis table?

And finally, because pandera relies a lot on user-defined validation Checks:

How do ibis users specify custom operations that they want to do on tables/columns?

@cosmicBboy
Copy link
Collaborator Author

@gforsyth this tracks the ibis integration! I'll circle back when I have capacity to get started on an integration in earnest

@csubhodeep
Copy link

Hello all, I and my team are working on a couple of data projects and for quite a few months we have been using pandera heavily for most of our row-wise data validation tasks.

Lately, we have moved to sourcing all our tables from a data warehouse and also our data size has grown (some tables have >50 columns ranging from 100-250 GB). And since we did not want to refactor most of our transformation steps, earlier written in pandas, we migrated to using ibis.
While most of the migration was smooth, we are now facing difficulties in pulling the entire table into memory as a pandas.DataFrame and then validating it using pandera (which seems quite natural).

We did try and explore some hand-rolled alternatives by implementing a thin wrapper around validate method of pandera that pulls the table in chunks and then validates it. But it was not as efficient as we would have liked it to be.

Then, to my surprise, I found this thread !

I know that a few core developers from the ibis project have started putting some efforts (#1451 ) to support for pandera-ibis integration. While I am sure this will be a great addition to the pandera ecosystem, I had few questions:

  1. When we would use pandera with ibis, would we be able to validate the data by using the compute power of the query engine that the ibis backend is connected to (here I assume we are only connected to DB engine-like backends such as DuckDB, BigQuery etc.)?
  2. Would it be possible to do the validation without pulling the entire table into memory? (i.e. not materializing it as a memtable or pandas.DataFrame under the hood)
  3. If yes, what would be the type and schema of the failure_cases?
    • Type as in an in-memory data structure or a table in the backend? - Like a temporary/persistent table/view on the DB.
    • Schema as in currently by default it provides an "index" column to help the developers find which row has failed for which check. But usually a table in the backend may or may not have a primary key that can be used to identify the row uniquely. How would the failure_cases table present this information?

Nonetheless, I am eagerly waiting for this feature to be rolled out so that my team can get their hands on it.

P.S. I am not an expert in pandera or ibis internals, just a happy & enthusiastic user. 😄

Thanks !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants