Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correctly handling single string constraints #670

Merged
merged 4 commits into from Nov 10, 2021
Merged

Correctly handling single string constraints #670

merged 4 commits into from Nov 10, 2021

Conversation

rbngz
Copy link
Contributor

@rbngz rbngz commented Nov 5, 2021

Hello

This PR should fix #591

Might also need to check the deserialization, but it seems that this part is correctly handled.

To verify:

Passing the following json to the from_frictionless_schema()-function:

import pandas as pd
from pandera.io import from_frictionless_schema

FRICTIONLESS_SCHEMA = {
    "fields": [{
        "name": "foo",
        "constraints": {
            "maxLength": 20
        },
        "type": "string"
    }]
}

schema = from_frictionless_schema(FRICTIONLESS_SCHEMA)

The frictionless schema is correctly parsed to a pandera schema:

schema.columns['foo'].checks

Output: [<Check str_length: str_length(None, 20)>]

However, serializing the schema to YAML using the to_yaml()-function yielded the following output:

schema_type: dataframe
version: 0.7.0
columns:
  foo:
    dtype: string[python]
    nullable: true
    checks:
      str_length: 20
    allow_duplicates: true
    coerce: true
    required: true
    regex: false
checks: null
index: null
coerce: true
strict: true

With proposed changes:

schema_type: dataframe
version: 0.7.0
columns:
  foo:
    dtype: string[python]
    nullable: true
    checks:
      str_length: 
        max_value: 20
    allow_duplicates: true
    coerce: true
    required: true
    regex: false
checks: null
index: null
coerce: true
strict: true

@cosmicBboy
Copy link
Collaborator

hi @rbngz thanks for the PR!

The change you proposed will break the serialization behavior for unary checks, see here: https://github.com/pandera-dev/pandera/runs/4118556698?check_suite_focus=true#step:9:576

So I think the more systemic fix for this issue is to delete this line over here:
https://github.com/pandera-dev/pandera/blob/dev/pandera/checks.py#L56

            check.statistics = {
                stat: args_dict.get(stat)
                for stat in statistics_args
                if args_dict.get(stat) is not None  # << delete this right here
            }

The problem is that statistics that are None will not be included in the check.statistics attribute. By deleting that line all statistics for multi-arg checks will be included.

You'll have to update test_io.py since a bunch of the serialized checks will change (including fixing the str_length issue). You can do pytest tests/core/test_io.py to make sure tests are happy

@rbngz
Copy link
Contributor Author

rbngz commented Nov 8, 2021

Hi @cosmicBboy thank you for the helpful suggestion!

Unfortunately, when removing the line that you pointed out, the check_statistics where args_dict.get(stat) is None for the in_range check will also be displayed in the resulting yaml. (Not sure if this is desired)

checks:
    in_range:
        min_value: 10
        max_value: 99
        include_min: null
        include_max: null

Therefore, I needed to exclude the in_range check from the proposed changes. (Unfortunately not as systemic as the initial proposal)

Let me know what you think :)

@cosmicBboy
Copy link
Collaborator

hi @rbngz

checks:
    in_range:
        min_value: 10
        max_value: 99
        include_min: null
        include_max: null

This is actually the correct behavior, no need to exclude in_range https://github.com/pandera-dev/pandera/pull/670/files#diff-833d2db138a3428cfce92ef7d355be247af30b170c32e778859498bd226615d7R56

Reading/writing more concise yaml format can be part of another PR, as I agree that keys with null bloats the yaml output, but for now it's better to have fully-specified yaml than the incorrect serialization/deserialization behavior.

But the unit tests will need to be updated so that the all the kwargs are included in the yaml file.

@codecov
Copy link

codecov bot commented Nov 9, 2021

Codecov Report

Merging #670 (30f64aa) into dev (c786f67) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##              dev     #670   +/-   ##
=======================================
  Coverage   98.85%   98.85%           
=======================================
  Files          30       30           
  Lines        3398     3398           
=======================================
  Hits         3359     3359           
  Misses         39       39           
Impacted Files Coverage Δ
pandera/checks.py 98.50% <ø> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c786f67...30f64aa. Read the comment docs.

@cosmicBboy
Copy link
Collaborator

awesome, thanks for the contribution @rbngz!

@cosmicBboy cosmicBboy merged commit c6d234b into unionai-oss:dev Nov 10, 2021
cosmicBboy pushed a commit that referenced this pull request Nov 11, 2021
* correctly handling single string contraints

* formatted changes with black

* Switch to a more systemic fix | update tests

* including null values for in_range checks

Co-authored-by: Gonzalez Avilés, Ruben <ruben.gonzalez_aviles@baloise.com>
@rbngz rbngz deleted the bugfix/591 branch November 11, 2021 07:50
cosmicBboy pushed a commit that referenced this pull request Nov 11, 2021
* correctly handling single string contraints

* formatted changes with black

* Switch to a more systemic fix | update tests

* including null values for in_range checks

Co-authored-by: Gonzalez Avilés, Ruben <ruben.gonzalez_aviles@baloise.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants