Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandera Polars datatype 'check' method is not provided a 'data_container' #1616

Closed
3 tasks done
cstabnick-datasense opened this issue May 6, 2024 · 6 comments
Closed
3 tasks done
Labels
bug Something isn't working

Comments

@cstabnick-datasense
Copy link

Describe the bug
When registering a custom datatype in the polars_engine, the check function is only able to do dtype validation, not validation across the data_container like we are able to do in the example here: https://pandera.readthedocs.io/en/stable/dtypes.html#logical-data-types.

I am curious if this is intentional, or if this validation would be a desired enhancement. If so, I would be happy to create a PR for this change. Thank you.

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of pandera.
  • (optional) I have confirmed this bug exists on the main branch of pandera.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

from typing import Optional, Union, Iterable

import pandera.polars as pa
import polars as pl
from pandera import dtypes
from pandera.engines import polars_engine
from pandera.engines.polars_engine import PolarsDataContainer


@polars_engine.Engine.register_dtype
class MyLeadingIDStringType(polars_engine.Object):
    def check(self, pandera_dtype: dtypes.DataType, data_container: Optional[PolarsDataContainer] = None) -> Union[bool, Iterable[bool]]:
        if data_container is not None:
            # unreachable, data_container is always None
            print("data_container is not None")
            if key := data_container.key:
                if (
                    len(data_container.collect().select(pl.col(key).str.starts_with("id_").arg_true()))
                    == data_container.select(pl.col(key).len()).collect().item()
                ):
                    return True
                else:
                    return False
            else:
                raise NotImplementedError("Dataframe case unhandled")
        else:
            return True


class MySchema(pa.DataFrameModel):
    id: MyLeadingIDStringType


data = pl.DataFrame({"id": ["id_1", "id_2", "id_3", "id_4"]})
x = MySchema.validate(data.lazy())
print(x.collect())

Expected behavior

I am expecting my 'check' method to be provided a reference to the data_container when working with the polars_engine, similar to how the pandas implementation works.

Desktop (please complete the following information):

  • OS: Windows 11
  • Browser:
  • Version: pandera==0.19.0b0
  • Python==3.11.5
  • polars==0.20.16

Screenshots

If applicable, add screenshots to help explain your problem.

Additional context

This is because the call to our schema.dtype.check is not provided the check_obj here:

passed=schema.dtype.check(obj_dtype),

Where the pandas implementation is providing this check_obj here
dtype_check_results = schema.dtype.check(
Engine.dtype(check_obj.dtype),
check_obj,
)

 

@cstabnick-datasense cstabnick-datasense added the bug Something isn't working label May 6, 2024
@cosmicBboy
Copy link
Collaborator

this is a bug! looking into it

@cosmicBboy
Copy link
Collaborator

Okay, so this should fix it:

from pandera.api.polars.types import PolarsData

...
              CoreCheckResult(
                    passed=schema.dtype.check(
                        obj_dtype,
                        PolarsData(check_obj_subset, schema.selector),
                    ),
           ...
             )

Would you be able to open up a PR with some unit tests?

@cstabnick-datasense
Copy link
Author

Yes, I will be happy to do so later this evening. Thank you very much

@cstabnick
Copy link
Contributor

@cosmicBboy I have created the pr for this change.
#1622
Please let me know if you would like any changes.
Thank you

@cstabnick
Copy link
Contributor

cstabnick commented May 6, 2024

@cosmicBboy

Apologies if you clicked that while it was deleted, recreated to address the signoff requirement #1623

@cosmicBboy
Copy link
Collaborator

fixed by #1623

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants