Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement basic validation backend for Ibis tables #1451

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

deepyaman
Copy link
Contributor

@deepyaman deepyaman commented Dec 20, 2023

Taking a stab at #1105.

No need to review, still very much a WIP. Working on implementing the backend classes now.

Initial goal is to get this example running:

>>> import ibis
>>> import pandas as pd
>>> import pandera.ibis as pa
>>>
>>> df = pd.DataFrame({
...     "probability": [0.1, 0.4, 0.52, 0.23, 0.8, 0.76],
...     "category": ["dog", "dog", "cat", "duck", "dog", "dog"],
... })
>>> t = ibis.memtable(df, name="t")
>>>
>>> schema_withchecks = pa.DataFrameSchema({
...     "probability": pa.Column(
...         float, pa.Check(lambda s: (s >= 0) & (s <= 1))),
...
...     # check that the "category" column contains a few discrete
...     # values, and the majority of the entries are dogs.
...     "category": pa.Column(
...         str, [
...             pa.Check(lambda s: s.isin(["dog", "cat", "duck"])),
...             pa.Check(lambda s: (s == "dog").mean() > 0.5),
...         ]),
... })
>>>
>>> schema_withchecks.validate(t)[["probability", "category"]]
   probability category
0         0.10      dog
1         0.40      dog
2         0.52      cat
3         0.23     duck
4         0.80      dog
5         0.76      dog

See #1451 (comment) for the current state.

Copy link

codecov bot commented Dec 20, 2023

Codecov Report

Attention: Patch coverage is 77.66323% with 65 lines in your changes missing coverage. Please review.

Project coverage is 82.38%. Comparing base (812b2a8) to head (132e803).
Report is 111 commits behind head on main.

Files Patch % Lines
pandera/backends/ibis/base.py 29.26% 29 Missing ⚠️
pandera/api/ibis/model.py 73.33% 12 Missing ⚠️
pandera/backends/ibis/container.py 78.26% 10 Missing ⚠️
pandera/engines/ibis_engine.py 80.00% 9 Missing ⚠️
pandera/backends/ibis/components.py 93.33% 3 Missing ⚠️
pandera/api/polars/container.py 0.00% 1 Missing ⚠️
pandera/api/polars/model.py 0.00% 1 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (812b2a8) and HEAD (132e803). Click for more details.

HEAD has 103 uploads more than BASE | Flag | BASE (812b2a8) | HEAD (132e803) | |------|------|------| ||48|151|
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #1451       +/-   ##
===========================================
- Coverage   94.28%   82.38%   -11.91%     
===========================================
  Files          91      127       +36     
  Lines        7013     8962     +1949     
===========================================
+ Hits         6612     7383      +771     
- Misses        401     1579     +1178     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@cosmicBboy
Copy link
Collaborator

This is a great start @deepyaman ! I just created a ibis-dev branch that we can merge all of our work into: https://github.com/unionai-oss/pandera/tree/ibis-dev

@cosmicBboy cosmicBboy changed the base branch from main to ibis-dev December 22, 2023 15:14
@deepyaman
Copy link
Contributor Author

deepyaman commented Jan 5, 2024

@cosmicBboy Happy belated New Year! Hope you enjoyed the holidays.

I made some (admittedly-slow) progress on the Ibis backend, and I'd be happy to get a review to make sure things are on the right track. Been learning a lot about how Pandera works.

This is very incomplete, but I think I'm implemented a happy and unhappy path that work:

Happy path

>>> import ibis
>>> import pandas as pd
>>> import pandera.ibis as pa
/opt/miniconda3/envs/pandera-dev/lib/python3.11/site-packages/pyspark/pandas/__init__.py:50: UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched.
  warnings.warn(
>>> 
>>> df = pd.DataFrame({
...     "probability": [0.1, 0.4, 0.52, 0.23, 0.8, 0.76],
...     "category": ["dog", "dog", "cat", "duck", "dog", "dog"],
... })
>>> t = ibis.memtable(df, name="t")
>>> schema_withchecks = pa.DataFrameSchema({"probability": pa.Column(float)})
>>> schema_withchecks.validate(t)[["probability", "category"]]
r0 := InMemoryTable
  data:
    PandasDataFrameProxy:
         probability category
      0         0.10      dog
      1         0.40      dog
      2         0.52      cat
      3         0.23     duck
      4         0.80      dog
      5         0.76      dog

Selection[r0]
  selections:
    probability: r0.probability
    category:    r0.category
>>> 

Unhappy path

>>> import ibis
>>> import pandas as pd
>>> import pandera.ibis as pa
/opt/miniconda3/envs/pandera-dev/lib/python3.11/site-packages/pyspark/pandas/__init__.py:50: UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched.
  warnings.warn(
>>> 
>>> df = pd.DataFrame({
...     "probability": [1, 4, 52, 23, 8, 76],
...     "category": ["dog", "dog", "cat", "duck", "dog", "dog"],
... })
>>> t = ibis.memtable(df, name="t")
>>> schema_withchecks = pa.DataFrameSchema({"probability": pa.Column(float)})
>>> schema_withchecks.validate(t)[["probability", "category"]]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/deepyaman/github/deepyaman/pandera/pandera/api/ibis/container.py", line 80, in validate
    return self.get_backend(check_obj).validate(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/deepyaman/github/deepyaman/pandera/pandera/backends/ibis/container.py", line 72, in validate
    error_handler.collect_error(
  File "/Users/deepyaman/github/deepyaman/pandera/pandera/error_handlers.py", line 38, in collect_error
    raise schema_error from original_exc
  File "/Users/deepyaman/github/deepyaman/pandera/pandera/backends/ibis/container.py", line 103, in run_schema_component_checks
    result = schema_component.validate(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/deepyaman/github/deepyaman/pandera/pandera/api/pandas/components.py", line 169, in validate
    return self.get_backend(check_obj).validate(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/deepyaman/github/deepyaman/pandera/pandera/backends/ibis/components.py", line 69, in validate
    error_handler.collect_error(  # Why indent (unlike in container.py)?
  File "/Users/deepyaman/github/deepyaman/pandera/pandera/error_handlers.py", line 38, in collect_error
    raise schema_error from original_exc
pandera.errors.SchemaError: expected column 'probability' to have type float64, got int64
>>> 

As a next step (if this looks good), I can probably work on writing tests for this functionality, and then implement additional checks (with the goal of getting the example in the PR description working first). Probably start with getting the value check on the same column (pa.Check(lambda s: (s >= 0) & (s <= 1))) working.

for result in results:
if result.passed:
continue
# Why cast `results` only in components.py, not in container.py?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cosmicBboy This is just a question I had and haven't yet answered myself, but I've tried to keep things consistent with the other backends.

Suggested change
# Why cast `results` only in components.py, not in container.py?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite remember, but I suspect the answer is as silly as: mypy complained about the type here and not in container.py 😅

check_output=result.check_output,
reason_code=result.reason_code,
)
error_handler.collect_error( # Why indent (unlike in container.py)?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cosmicBboy This is another thing I don't understand yet; in the case result.schema_error is not None, would the error handler not be triggered in component.py? But, in a very similar case, it would be triggered in container.py?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that looks like a bug!

type: Any = dataclasses.field(repr=False, init=False)
"""Native Ibis dtype boxed by the data type."""

def __init__(self, dtype: Any):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think I've written any code that hits this yet.

@deepyaman deepyaman marked this pull request as ready for review January 5, 2024 02:35
@datajoely
Copy link

I was searching to see if this exists and what a delight to see your name here @deepyaman thanks for kicking this off!

@cosmicBboy
Copy link
Collaborator

cosmicBboy commented Feb 19, 2024

Thanks for this PR @deepyaman thanks so much for all your work on kicking this effort off!

I'm still in the process of looking through all the changes in this PR, but the examples you provide here is the right direction. If you can get some tests for these written, we can check this into the ibis-dev branch so we can get the initial foothold on this feature merged into the main repo.

Super excited to get this shipped! 🚀

@deepyaman
Copy link
Contributor Author

Thanks for this PR @deepyaman thanks so much for all your work on kicking this effort off!

I'm still in the process of looking through all the changes in this PR, but the examples you provide here is the right direction. If you can get some tests for these written, we can check this into the ibis-dev branch so we can get the initial foothold on this feature merged into the main repo.

Super excited to get this shipped! 🚀

@cosmicBboy Sounds great! I'm on holiday/unavailable until the end of the month, but I will prioritize adding the tests after that (and then continue making incremental process)!

@cosmicBboy
Copy link
Collaborator

Cool, enjoy your time off! Btw the beta release for polars support is out https://github.com/unionai-oss/pandera/releases/tag/v0.19.0b0

Part of it comes with a pandera.api.dataframe module that generalizes some of the common functionality between pandas and polars. It currently only does this for DataFrameModel, but I'm planning on doing the same for DataFrameSchema.

You may want to leverage these classes in your implementation to reduce repetitive logic, depending on how similar/different it is from polars/pandas.

@datajoely
Copy link

Hey @cosmicBboy @deepyaman I'd like to jump in and get this over the line. It's a little hard to follow where's best to get stuck in, do either of you have any recommendations?

@cosmicBboy
Copy link
Collaborator

cosmicBboy commented May 15, 2024

Thanks @datajoely! I think @deepyaman may still be on leave, so maybe we wait on him to provide more context, but at a high level, there are parts that are fairly easy to parallelize. For example, for the core pandera functionality with the polars integration, we have the following modules:

Work in each of these sections is fairly parallelizable. To help with implementation, pandera also provides library-agnostic base classes for some of the common class definitions:

This PR already implements a bunch of the pieces above. Probably the best way to start is to run the happy path code path described here, and start poking around.

@deepyaman
Copy link
Contributor Author

Thanks @datajoely! I think @deepyaman may still be on leave, so maybe we wait on him to provide more context, but at a high level, there are parts that are fairly easy to parallelize. For example, for the core pandera functionality with the polars integration, we have the following modules:

[...]

@cosmicBboy Thanks! This is actually a very helpful breakdown. I also briefly chatted with @datajoely earlier today about this (should've updated the issue), but it seems he's taken a look at the Polars backend more recently, and will push up a branch with some things he'd been trying. As a first step, it may make sense to update this PR to leverage some of the generalized functionality more; since I wrote this last December, I referenced the polars-dev branch a lot, but it was a WIP and things must have changed. I just haven't gotten a chance to look again yet. :)

I will have some time to dedicate to this once I get the stuff I'm currently working on out the door, hopefully by later this month. 🤞 But also happy to try to unblock @datajoely if anything is there I can do sooner.

@datajoely
Copy link

Excellent - super helpful :)

@datajoely
Copy link

So I've raised #1651 in draft - it would be great to get some thoughts whether it's smarter to mirror the quite complicated Polars approach or to continue with this basic approach.

@deepyaman
Copy link
Contributor Author

FYI I am working on rebasing this to main today and leveraging the new constructs mentioned! Hope to push something up soon.

@cosmicBboy
Copy link
Collaborator

Amazing @deepyaman! let me know if you need any help

@deepyaman
Copy link
Contributor Author

Amazing @deepyaman! let me know if you need any help

@cosmicBboy Thanks! Can you update unionai-oss:ibis-dev to main?

Also, would it be possible to set up ibis-dev to run the checks, at least for Ibis, if we're using that as the "main" while building out Ibis integration?

@cosmicBboy
Copy link
Collaborator

Amazing! I just updated the ibis-dev branch to catch up with main

@deepyaman deepyaman force-pushed the ibis-dev branch 2 times, most recently from ec1c6fc to 31a3fc9 Compare June 20, 2024 14:25
@deepyaman
Copy link
Contributor Author

Seems I hadn't fully rebased onto latest main before? In any case, done now, and the happy/unhappy paths from #1451 (comment) are still passing.

Let me add some more tests; other than that, I think I need to make sure Ibis is included in the requirements, and I was running into a few issues running mypy locally (getting a bunch of errors on other parts of the codebase), but should be almost ready to be merged. 🤞

I've also talked briefly to @datajoely. My suggestion was:

Maybe you can look at implementing the functionality corresponding to tests/polars/test_polars_dtypes.py, around coercion? I haven't really looked into it. Can start with the numeric types (I've only added a couple of numeric types so far).

I was planning to try adding a non-dtype check next myself. Once that machinery is there, I think can also parallelize implementation of checks.

@cosmicBboy if you have a better suggestion for where to get started/don't think coercion is a good place, I think that would be very much welcome.

@deepyaman deepyaman changed the base branch from main to ibis-dev June 25, 2024 21:04
@deepyaman deepyaman changed the base branch from ibis-dev to main June 26, 2024 05:46
@deepyaman deepyaman force-pushed the ibis-dev branch 2 times, most recently from 553af04 to 6578bab Compare June 26, 2024 06:24
@deepyaman deepyaman changed the base branch from main to ibis-dev June 26, 2024 13:46
@deepyaman deepyaman changed the base branch from ibis-dev to main June 26, 2024 21:29
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
@deepyaman
Copy link
Contributor Author

FYI @cosmicBboy I rebased this onto latest main (to get CI to run). The main commit I rebased on top of is a041e06.

Looks like Pandera still supports 3.8, despite following NEP-29 (which is fine). However, Ibis dropped 3.8 support a while back, so ibis-framework resolves to 5.1.0 on 3.8—a very old version I don't plan to build for. Shall we skip Ibis tests on 3.8?

I think this is in a reviewable (potentially mergeable?) state. I will take a look if I can add some more tests. Would it be possible to rebase-merge this to ibis-dev instead of squashing? Have tried to maintain a clean history, and I think it may help in trying to debug some decisions as continue development on the ibis-dev branch.

@deepyaman deepyaman changed the base branch from main to ibis-dev June 26, 2024 21:55
@deepyaman deepyaman changed the base branch from ibis-dev to main June 26, 2024 21:56
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
@cosmicBboy
Copy link
Collaborator

cosmicBboy commented Jun 27, 2024

Thanks @deepyaman, I'll take a look at this PR next week.

Shall we skip Ibis tests on 3.8?

Yes!

Looks like Pandera still supports 3.8, despite following NEP-29 (which is fine)

Yes, I've been meaning to drop 3.8 support, maybe this happens when ibis-dev is merged onto main when it's ready for a beta release?

Would it be possible to rebase-merge this to ibis-dev instead of squashing?

Yep, we can do this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants