Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom check erroneously passes when validating pl.LazyFrame #1566

Closed
2 of 3 tasks
philiporlando opened this issue Apr 11, 2024 · 8 comments
Closed
2 of 3 tasks

Custom check erroneously passes when validating pl.LazyFrame #1566

philiporlando opened this issue Apr 11, 2024 · 8 comments
Labels
bug Something isn't working

Comments

@philiporlando
Copy link
Contributor

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of pandera.
  • (optional) I have confirmed this bug exists on the master branch of pandera.

Code Sample, a copy-pastable example

I've created a custom check function that should never return True based on my sample data. However, pandera does not raise an error when validating the fruit column. This may be related to #1565.

import polars as pl
import pandera.polars as pa


# Custom check function
def check_len(v: str) -> bool:
    return len(v) == 20

schema = pa.DataFrameSchema(
    {
        "fruit": pa.Column(
            dtype=str,
            checks=pa.Check(check_len, element_wise=True),
        ),
    }
)

lf = pl.LazyFrame(
    {
        "fruit": ["apple", "pear", "banana"],
    }
)

lf.pipe(schema.validate).collect()
# shape: (3, 1)
# ┌────────┐
# │ fruit  │
# │ ---    │
# │ str    │
# ╞════════╡
# │ apple  │
# │ pear   │
# │ banana │
# └────────┘

Converting from LazyFrame to DataFrame before performing the schema validation appears to raise the expected error:

import polars as pl
import pandera.polars as pa


# Custom check function
def check_len(v: str) -> bool:
    return len(v) == 20

schema = pa.DataFrameSchema(
    {
        "fruit": pa.Column(
            dtype=str,
            checks=pa.Check(check_len, element_wise=True),
        ),
    }
)

lf = pl.LazyFrame(
    {
        "fruit": ["apple", "pear", "banana"],
    }
)

df = lf.collect()
df.pipe(schema.validate)
# C:\local\.venv\Lib\site-packages\pandera\backends\polars\base.py:74: MapWithoutReturnDtypeWarning: Calling `map_elements` without specifying `return_dtype` can lead to unpredictable results. Specify `return_dtype` to silence this warning.
#   passed = check_result.check_passed.collect().item()
# C:\local\.venv\Lib\site-packages\pandera\backends\polars\base.py:88: MapWithoutReturnDtypeWarning: Calling `map_elements` without specifying `return_dtype` can lead to unpredictable results. Specify `return_dtype` to silence this warning.
#   failure_cases = check_result.failure_cases.collect()
# C:\local\.venv\Lib\site-packages\pandera\backends\polars\base.py:112: MapWithoutReturnDtypeWarning: Calling `map_elements` without specifying `return_dtype` can lead to unpredictable results. Specify `return_dtype` to silence this warning.
#   check_output=check_result.check_output.collect(),
# Traceback (most recent call last):
#   File "<stdin>", line 1, in <module>
#   File "C:\local\.venv\Lib\site-packages\polars\dataframe\frame.py", line 5150, in pipe
#     return function(self, *args, **kwargs)
#            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "C:\local\.venv\Lib\site-packages\pandera\api\polars\container.py", line 58, in validate
#     output = self.get_backend(check_obj).validate(
#              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "C:\local\.venv\Lib\site-packages\pandera\backends\polars\container.py", line 114, in validate
#     error_handler.collect_error(
#   File "C:\local\.venv\Lib\site-packages\pandera\api\base\error_handler.py", line 54, in collect_error
#     raise schema_error from original_exc
#   File "C:\local\.venv\Lib\site-packages\pandera\backends\polars\container.py", line 182, in run_schema_component_checks
#     result = schema_component.validate(check_obj, lazy=lazy)
#              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "C:\local\.venv\Lib\site-packages\pandera\api\polars\components.py", line 141, in validate
#     output = self.get_backend(check_obj).validate(
#              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "C:\local\.venv\Lib\site-packages\pandera\backends\polars\components.py", line 81, in validate
#     error_handler = self.run_checks_and_handle_errors(
#                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "C:\local\.venv\Lib\site-packages\pandera\backends\polars\components.py", line 147, in run_checks_and_handle_errors
#     error_handler.collect_error(
#   File "C:\local\.venv\Lib\site-packages\pandera\api\base\error_handler.py", line 54, in collect_error
#     raise schema_error from original_exc
# pandera.errors.SchemaError: Column 'fruit' failed validator number 0: <Check check_len> failure case examples: [{'fruit': 'apple'}, {'fruit': 'pear'}, {'fruit': 'banana'}]

Expected behavior

I would expect to see a schema validation error raised with the LazyFrame here since none of the fruit values have a string length of 20 characters.

Desktop (please complete the following information):

  • OS: Windows 10
  • Browser Chrome
  • Version pandera==0.19.0b1
@philiporlando philiporlando added the bug Something isn't working label Apr 11, 2024
@cosmicBboy
Copy link
Collaborator

cosmicBboy commented Apr 11, 2024

See the docs here https://pandera.readthedocs.io/en/latest/polars.html#error-reporting

This is intended behavior: LazyFrame validation will only to schema-level checks (so as not to materialize the data in a lazy method chain). Currently, pandera assumes that all custom checks operate on data. You can force data-level checks by explicitly setting export PANDERA_VALIDATION_ENABLED=SCHEMA_AND_DATA.

@cosmicBboy
Copy link
Collaborator

Is this a duplicate of #1565?

@philiporlando
Copy link
Contributor Author

See the docs here https://pandera.readthedocs.io/en/latest/polars.html#error-reporting

This is intended behavior: LazyFrame validation will only to schema-level checks (so as not to materialize the data in a lazy method chain). Currently, pandera assumes that all custom checks operate on data. You can force data-level checks by explicitly setting export PANDERA_VALIDATION_ENABLED=SCHEMA_AND_DATA.

This is super helpful and makes total sense. Thanks for the feedback.

@philiporlando
Copy link
Contributor Author

Is this a duplicate of #1565?

I don't think so. The error that I'm experiencing in #1565 is specific to pl.DataFrame.

@cosmicBboy
Copy link
Collaborator

Gotcha, yeah looks like a bug, looking.

@cosmicBboy
Copy link
Collaborator

@philiporlando would it make sense to add some logging at validation time to explicitly say what types of checks are being run? If so, would it make sense as logging.info, debug or something else?

@philiporlando
Copy link
Contributor Author

@philiporlando would it make sense to add some logging at validation time to explicitly say what types of checks are being run? If so, would it make sense as logging.info, debug or something else?

I'm in favor of this! At the very least, I think it would be helpful to communicate which data-level checks are ignored whenever a LazyFrame is validated instead of a DataFrame. It might even make sense to log a warning here?

@philiporlando
Copy link
Contributor Author

Gotcha, yeah looks like a bug, looking.

Thank you for looking into it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants