Skip to content

Bugfix/882 coercing twice #901

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Aug 10, 2022

Conversation

ng-henry
Copy link
Contributor

@ng-henry ng-henry commented Aug 2, 2022

This fixes the coercing twice bug described in #882 + shows all errors even if coercion fails.

@cosmicBboy About your message here:

In summary, if a column cannot be coerced to the intended type, in this case float, pandera won't apply any of the downstream Checks to that column (which is why the nullability check is not picked up in failure_cases.
The reasoning is that if the column is not even of the correct type, then it's reasonable to assume that the validation checks (which assume that type) wouldn't work on that column.

I think that even if some values cannot be coerced to the correct type, we can still do downstream checks. For example, unique and nullable checks can still be applied.

This PR starts to implement some of these features and also fixes the twice coercion bug. It applies downstream checks even if coercion failed. A better approach (TODO) is applying downstream checks only for cells that pass coercion. Thus, it's guaranteed that downstream checks receive the correct type.

If you give the go-ahead, then I can start implementing the TODO.

@ng-henry ng-henry force-pushed the bugfix/882-coercing-twice branch from acbdb8f to 7c92917 Compare August 2, 2022 19:28
@@ -948,6 +948,7 @@ def test_frictionless_schema_parses_correctly(frictionless_schema):
{"check": "not_nullable", "failure_case": "NaN"},
{"check": "isin({1.0, 2.0, 3.0})", "failure_case": 1.1},
{"check": "isin({1.0, 2.0, 3.0})", "failure_case": 3.8},
{"check": "dtype('float64')", "failure_case": "object"},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's this change for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we are not running coerce in schema_components anymore, we run validate_column in here.

This means wrong values will be passed to validate_column. This error gets caught here because the dtype is wrong.

It just adds another error to the error lists, which is why we have that additional error.

@cosmicBboy
Copy link
Collaborator

cosmicBboy commented Aug 5, 2022

I think that even if some values cannot be coerced to the correct type, we can still do downstream checks. For example, unique and nullable checks can still be applied.

A better approach (TODO) is applying downstream checks only for cells that pass coercion. Thus, it's guaranteed that downstream checks receive the correct type.

Agreed that this will improve UX and efficiency! The main concern here is that it introduces additional complexity in the validation routine. Among other things, it would need to:

  • partition coerced and uncoerced cells in the dataframe, handling cases where coerced cells across columns are potentially unaligned.
  • pass only coerced values to downstream checks, handling both column- and dataframe-level checks
  • ensure that the failure case reporting takes account of all of this complexity

I think this will be a lot easier to reason about and implement with the overhauled core API:
https://github.com/unionai-oss/pandera/tree/core-schema

The main idea is that the pandera.core subpackage is the schema specification, while the pandera.backends subpackage actually implements the validation logic.

For example, here's the DataFrameSchemaBackend, which implements a validate method and a few core parsers and checks as methods: https://github.com/unionai-oss/pandera/blob/core-schema/pandera/backends/pandas/container.py#L16.

If you can create another issue for the problem of applying checks to an uncoercible column, we can try fleshing out a solution there, working off of the core-schema branch.

@codecov
Copy link

codecov bot commented Aug 7, 2022

Codecov Report

Merging #901 (2151741) into dev (56265ff) will decrease coverage by 0.02%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##              dev     #901      +/-   ##
==========================================
- Coverage   97.41%   97.38%   -0.03%     
==========================================
  Files          43       43              
  Lines        4171     4172       +1     
==========================================
  Hits         4063     4063              
- Misses        108      109       +1     
Impacted Files Coverage Δ
pandera/schemas.py 99.24% <100.00%> (+<0.01%) ⬆️
pandera/schema_components.py 99.06% <0.00%> (-0.47%) ⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@ng-henry
Copy link
Contributor Author

ng-henry commented Aug 7, 2022

@cosmicBboy Not sure why some of these tests are failing here. I'm not too familiar with Dask and why it prints "2 graph layers" instead of "4 tasks" even though the dataframe is completely the same.

Expected:
    Dask DataFrame Structure:
                    state    city  price
    npartitions=2
    0              object  object  int64
    3                 ...     ...    ...
    5                 ...     ...    ...
    Dask Name: validate, 4 tasks
Got:
    Dask DataFrame Structure:
                    state    city  price
    npartitions=2                       
    0              object  object  int64
    3                 ...     ...    ...
    5                 ...     ...    ...
    Dask Name: validate, 2 graph layers

@cosmicBboy
Copy link
Collaborator

yeah @ng-henry this isn't related to your changes. I just fixed this in #877, if you rebase on dev the errors should be handled

@cosmicBboy cosmicBboy force-pushed the bugfix/882-coercing-twice branch from 6091e3e to d2b39d7 Compare August 9, 2022 23:48
@ng-henry ng-henry force-pushed the bugfix/882-coercing-twice branch from df6b1e2 to 2151741 Compare August 10, 2022 12:36
@cosmicBboy cosmicBboy merged commit 13312a1 into unionai-oss:dev Aug 10, 2022
cosmicBboy added a commit that referenced this pull request Aug 10, 2022
* fix datetime strategy

* comment out coercing twice part

* add float64 error to test

* fix formatting

* implement fixes

* Update strategies.py

* fix datetime strategy

* comment out coercing twice part

* fix formatting

* implement fixes

* Update strategies.py

* ignore mypy

Co-authored-by: Niels Bantilan <niels.bantilan@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants