Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add add_missing_columns DataFrame schema config per enhancement #687 #1186

Merged
merged 10 commits into from
Jun 30, 2023
Merged

Add add_missing_columns DataFrame schema config per enhancement #687 #1186

merged 10 commits into from
Jun 30, 2023

Conversation

derinwalters
Copy link
Contributor

@derinwalters derinwalters commented May 13, 2023

I took a crack at adding the 'add_missing_columns' DataFrame schema config, including tests. One thing I could use some advice on is that I got missing attribute pylint issues in file pandera/backends/pandas/container.py before I even changed anything. They look related to the 'schema' method argument variable not having any typing annotation such that the referenced column schema attributes for 'required', 'dtype', and 'coerce' are unknown. The typing annotation surrounding this variable looks rather complicated and I was afraid to make changes, so I marked as "# type: ignore[union-attr]" in this pull request.

import pandas as pd
from pandera import Column, DataFrameSchema

schema = DataFrameSchema(
    columns={i: Column(int, default=9) for i in ["a", "b", "c"]},
    strict=True,
    add_missing_columns=True,
)

print(schema.validate(pd.DataFrame(data=[[1, 3]], columns=["a", "c"])))

# Output:
#    a  b  c
# 0  1  9  3

Signed-off-by: Derin Walters <derin.c.walters@rijjin.com>
@cosmicBboy cosmicBboy closed this Jun 14, 2023
@cosmicBboy cosmicBboy reopened this Jun 14, 2023
@codecov
Copy link

codecov bot commented Jun 14, 2023

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.05 🎉

Comparison is base (26470c6) 97.25% compared to head (33645c0) 97.31%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1186      +/-   ##
==========================================
+ Coverage   97.25%   97.31%   +0.05%     
==========================================
  Files          65       65              
  Lines        5106     5140      +34     
==========================================
+ Hits         4966     5002      +36     
+ Misses        140      138       -2     
Impacted Files Coverage Δ
pandera/api/pandas/model.py 92.25% <ø> (ø)
pandera/backends/pandas/base.py 100.00% <ø> (ø)
pandera/io/pandas_io.py 100.00% <ø> (ø)
pandera/api/pandas/container.py 99.26% <100.00%> (+<0.01%) ⬆️
pandera/api/pandas/model_config.py 100.00% <100.00%> (ø)
pandera/backends/pandas/array.py 98.38% <100.00%> (+0.02%) ⬆️
pandera/backends/pandas/container.py 98.54% <100.00%> (+0.17%) ⬆️
pandera/errors.py 100.00% <100.00%> (ø)

... and 2 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Collaborator

@cosmicBboy cosmicBboy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @derinwalters thanks for this PR, just getting to review this right now, last few weeks has been super busy ! This looks great overall, I can take a look at the pylint thing.

One question I have based on the docstring:

add missing column names with either default value, if specified in column schema, or NaN if column is nullable

What happens in the case that the default value is None and the column is not nullable? It may be best to raise a SchemaInitError exception in this case, as we know before hand that this is an invalid case.

…ng_columns is enabled and non-nullable columns without a default are added, per enhancement #687

Signed-off-by: Derin Walters <derin.c.walters@rijjin.com>
@derinwalters
Copy link
Contributor Author

derinwalters commented Jun 16, 2023

Fantastic idea for SchemaInitError. I added to pandera/api/pandas/container.py along with related tests. How about this?

import pandas as pd
from pandera import Column, DataFrameSchema

schema = DataFrameSchema(
    columns={i: Column(int, default=9) for i in ["a", "b", "c"]},
    strict=True,
    add_missing_columns=True,
)
print(schema.validate(pd.DataFrame(data=[[1, 3]], columns=["a", "c"])))
# Output:
#    a  b  c
# 0  1  9  3

schema = DataFrameSchema(
    columns={i: Column(int, nullable=True) for i in ["a", "b", "c"]},
    strict=True,
    add_missing_columns=True,
)
print(schema.validate(pd.DataFrame(data=[[1, 3]], columns=["a", "c"])))
# Output:
#    a    b  c
# 0  1  NaN  3

schema = DataFrameSchema(
    columns={i: Column(int, nullable=False) for i in ["a", "b", "c"]},
    strict=True,
    add_missing_columns=True,
)
# Output:
# pandera.errors.SchemaInitError: column 'a' requires a default value when non-nullable add_missing_columns is enabled

@derinwalters
Copy link
Contributor Author

@cosmicBboy actually.....after thinking about this more, I believe it would be better to dynamically validate each missing column in the backend along with all the other checks; I'd suggest a SchemaError instead. The reason is that doing the check in DataFrameSchema essentially forces an all-or-nothing approach, meaning all columns must be declared with either a default value or with nullable. At least in my use case, which I believe would be common, there are some columns where I would be okay with adding a missing column using a default value, and some that I would not and would want an exception for missing column raised instead. Does this make sense?

…dd_missing_columns is enabled and non-nullable columns without a default are added, per enhancement #687"

This reverts commit 2a0ef1c.
…issing non-nullable columns without a default are found, per enhancement #687

Signed-off-by: Derin Walters <derin.c.walters@rijjin.com>
@derinwalters
Copy link
Contributor Author

derinwalters commented Jun 16, 2023

I reverted the SchemaInitError from this pull request in favor of the SchemaError approach. Testing is doing what I expected except for the fact that column "b" in the second validation example is getting assigned to 9, which is the default value for column "a", and not NaN as I expected. I confirmed that the add_missing_columns parser is returning the correct dataframe with NaN but the value is getting modified downstream in run_schema_component_checks. Please give me a little time to investigate before we proceed further.

import pandas as pd
from pandera import Column, DataFrameSchema

schema = DataFrameSchema(
    columns={
        "a": Column(int, default=9),
        "b": Column(int, nullable=True),
        "c": Column(int),
    },
    strict=True,
    add_missing_columns=True,
)
print(schema.validate(pd.DataFrame(data=[[2, 3]], columns=["b", "c"])))
# Output
#    a  b  c
# 0  9  2  3

print(schema.validate(pd.DataFrame(data=[[1, 3]], columns=["a", "c"])))
# Output
#    a  b  c
# 0  1  9  3

print(schema.validate(pd.DataFrame(data=[[1, 2]], columns=["a", "b"])))
# Output
# pandera.errors.SchemaError: column 'c' in DataFrameSchema {'a': <Schema Column(name=a, type=DataType(int64))>, 'b': <Schema Column(name=b, type=DataType(int64))>, 'c': <Schema Column(name=c, type=DataType(int64))>} requires a default value when non-nullable add_missing_columns is enabled

… fills the entire dataframe, even in unrelated columns, per issue #1193

Signed-off-by: Derin Walters <derin.c.walters@rijjin.com>
@derinwalters
Copy link
Contributor Author

Turns out the strange default assignment mentioned above is related to issue #1193. I included a fix for that too. Now we have something that looks like it should.

import pandas as pd
from pandera import Column, DataFrameSchema

schema = DataFrameSchema(
    columns={
        "a": Column(int, default=9),
        "b": Column(pd.Int64Dtype(), nullable=True),
        "c": Column(int, nullable=True),
        "d": Column(int),
    },
    strict=True,
    coerce=True,
    add_missing_columns=True,
)
print(schema.validate(pd.DataFrame(data=[[2, 3, 4]], columns=["b", "c", "d"])))
# Output
#    a  b  c  d
# 0  9  2  3  4

print(schema.validate(pd.DataFrame(data=[[1, 3, 4]], columns=["a", "c", "d"])))
# Output
#    a     b  c  d
# 0  1  <NA>  3  4

try:
    print(schema.validate(pd.DataFrame(data=[[1, 2, 4]], columns=["a", "b", "d"])))
except Exception as e:
    print(e)
# Output
# Error while coercing 'c' to type int64: Could not coerce <class 'pandas.core.series.Series'> data_container into type int64:
#    index failure_case
# 0      0          NaN

try:
    print(schema.validate(pd.DataFrame(data=[[1, 2, 3]], columns=["a", "b", "c"])))
except Exception as e:
    print(e)
# Output
# column 'd' in DataFrameSchema {'a': <Schema Column(name=a, type=DataType(int64))>, 'b': <Schema Column(name=b, type=DataType(Int64))>, 'c': <Schema Column(name=c, type=DataType(int64))>, 'd': <Schema Column(name=d, type=DataType(int64))>} requires a default value when non-nullable add_missing_columns is enabled

@cosmicBboy
Copy link
Collaborator

thanks @derinwalters ! looks like there failing linter and unit tests. check out the contributing guide for instructions on how to run linters/tests locally.

@derinwalters
Copy link
Contributor Author

Hmmm, I read the instructions and have been following them correctly, as far as I know. Perhaps I did something goofy when patching the default value issue and didn't properly check. I'll look again, apologies for that. For lint is there something else other than "pre-commit run --all" because it's showing clean for me?

(pandera-dev) ➜  pandera git:(feature/687) pre-commit run --all

check python ast.........................................................Passed
check for case conflicts.................................................Passed
check for merge conflicts................................................Passed
check yaml...............................................................Passed
debug statements (python)................................................Passed
fix end of files.........................................................Passed
trim trailing whitespace.................................................Passed
mixed line ending........................................................Passed
isort....................................................................Passed
flynt....................................................................Passed
black....................................................................Passed
pylint...................................................................Passed
mypy.....................................................................Passed
- hook id: mypy
- duration: 0.88s

Success: no issues found in 125 source files

@cosmicBboy
Copy link
Collaborator

weird, I'm able to reproduce CI:

❯ pre-commit run --all
check python ast.........................................................Passed
check for case conflicts.................................................Passed
check for merge conflicts................................................Passed
check yaml...............................................................Passed
debug statements (python)................................................Passed
fix end of files.........................................................Passed
trim trailing whitespace.................................................Passed
mixed line ending........................................................Passed
isort....................................................................Passed
flynt....................................................................Passed
black....................................................................Failed
- hook id: black
- files were modified by this hook

reformatted pandera/api/pandas/model.py

All done! ✨ 🍰 ✨
1 file reformatted, 134 files left unchanged.

pylint...................................................................Failed
- hook id: pylint
- exit code: 8

************* Module pandera.api.pandas.container
pandera/api/pandas/container.py:32:4: R0914: Too many local variables (16/15) (too-many-locals)

--------------------------------------------------------------------
Your code has been rated at 10.00/10 (previous run: 10.00/10, -0.00)

mypy.....................................................................Passed
- hook id: mypy
- duration: 10.1s

Success: no issues found in 125 source files

Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>
except SchemaError as exc:
error_handler.collect_error(exc.reason_code, exc)
elif schema.coerce:
check_obj[schema.name] = self.coerce_dtype(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should add a test case for this part of the conditional: basically make sure coercing works in the case the check_obj is not a column/index

Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>
@cosmicBboy
Copy link
Collaborator

this PR's in good shape! last thing @derinwalters ... would you mind updating this docs page with a subheader showing how this feature works?

Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>
@cosmicBboy
Copy link
Collaborator

I got to it first 😉 congrats on your first contribution to pandera @derinwalters !

@cosmicBboy cosmicBboy merged commit 631089c into unionai-oss:main Jun 30, 2023
41 checks passed
@derinwalters
Copy link
Contributor Author

@cosmicBboy That was incredibly kind of you to close out the remaining issues. I was away from home for a few days. Thank you!

@derinwalters derinwalters deleted the feature/687 branch July 1, 2023 23:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants