-
-
Notifications
You must be signed in to change notification settings - Fork 284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add add_missing_columns DataFrame schema config per enhancement #687 #1186
Conversation
Signed-off-by: Derin Walters <derin.c.walters@rijjin.com>
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #1186 +/- ##
==========================================
+ Coverage 97.25% 97.31% +0.05%
==========================================
Files 65 65
Lines 5106 5140 +34
==========================================
+ Hits 4966 5002 +36
+ Misses 140 138 -2
☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @derinwalters thanks for this PR, just getting to review this right now, last few weeks has been super busy ! This looks great overall, I can take a look at the pylint thing.
One question I have based on the docstring:
add missing column names with either default value, if specified in column schema, or NaN if column is nullable
What happens in the case that the default value is None and the column is not nullable? It may be best to raise a SchemaInitError
exception in this case, as we know before hand that this is an invalid case.
…ng_columns is enabled and non-nullable columns without a default are added, per enhancement #687 Signed-off-by: Derin Walters <derin.c.walters@rijjin.com>
Fantastic idea for SchemaInitError. I added to pandera/api/pandas/container.py along with related tests. How about this? import pandas as pd
from pandera import Column, DataFrameSchema
schema = DataFrameSchema(
columns={i: Column(int, default=9) for i in ["a", "b", "c"]},
strict=True,
add_missing_columns=True,
)
print(schema.validate(pd.DataFrame(data=[[1, 3]], columns=["a", "c"])))
# Output:
# a b c
# 0 1 9 3
schema = DataFrameSchema(
columns={i: Column(int, nullable=True) for i in ["a", "b", "c"]},
strict=True,
add_missing_columns=True,
)
print(schema.validate(pd.DataFrame(data=[[1, 3]], columns=["a", "c"])))
# Output:
# a b c
# 0 1 NaN 3
schema = DataFrameSchema(
columns={i: Column(int, nullable=False) for i in ["a", "b", "c"]},
strict=True,
add_missing_columns=True,
)
# Output:
# pandera.errors.SchemaInitError: column 'a' requires a default value when non-nullable add_missing_columns is enabled |
@cosmicBboy actually.....after thinking about this more, I believe it would be better to dynamically validate each missing column in the backend along with all the other checks; I'd suggest a SchemaError instead. The reason is that doing the check in DataFrameSchema essentially forces an all-or-nothing approach, meaning all columns must be declared with either a default value or with nullable. At least in my use case, which I believe would be common, there are some columns where I would be okay with adding a missing column using a default value, and some that I would not and would want an exception for missing column raised instead. Does this make sense? |
…issing non-nullable columns without a default are found, per enhancement #687 Signed-off-by: Derin Walters <derin.c.walters@rijjin.com>
I reverted the SchemaInitError from this pull request in favor of the SchemaError approach. Testing is doing what I expected except for the fact that column "b" in the second validation example is getting assigned to 9, which is the default value for column "a", and not NaN as I expected. I confirmed that the add_missing_columns parser is returning the correct dataframe with NaN but the value is getting modified downstream in run_schema_component_checks. Please give me a little time to investigate before we proceed further. import pandas as pd
from pandera import Column, DataFrameSchema
schema = DataFrameSchema(
columns={
"a": Column(int, default=9),
"b": Column(int, nullable=True),
"c": Column(int),
},
strict=True,
add_missing_columns=True,
)
print(schema.validate(pd.DataFrame(data=[[2, 3]], columns=["b", "c"])))
# Output
# a b c
# 0 9 2 3
print(schema.validate(pd.DataFrame(data=[[1, 3]], columns=["a", "c"])))
# Output
# a b c
# 0 1 9 3
print(schema.validate(pd.DataFrame(data=[[1, 2]], columns=["a", "b"])))
# Output
# pandera.errors.SchemaError: column 'c' in DataFrameSchema {'a': <Schema Column(name=a, type=DataType(int64))>, 'b': <Schema Column(name=b, type=DataType(int64))>, 'c': <Schema Column(name=c, type=DataType(int64))>} requires a default value when non-nullable add_missing_columns is enabled |
… fills the entire dataframe, even in unrelated columns, per issue #1193 Signed-off-by: Derin Walters <derin.c.walters@rijjin.com>
Turns out the strange default assignment mentioned above is related to issue #1193. I included a fix for that too. Now we have something that looks like it should. import pandas as pd
from pandera import Column, DataFrameSchema
schema = DataFrameSchema(
columns={
"a": Column(int, default=9),
"b": Column(pd.Int64Dtype(), nullable=True),
"c": Column(int, nullable=True),
"d": Column(int),
},
strict=True,
coerce=True,
add_missing_columns=True,
)
print(schema.validate(pd.DataFrame(data=[[2, 3, 4]], columns=["b", "c", "d"])))
# Output
# a b c d
# 0 9 2 3 4
print(schema.validate(pd.DataFrame(data=[[1, 3, 4]], columns=["a", "c", "d"])))
# Output
# a b c d
# 0 1 <NA> 3 4
try:
print(schema.validate(pd.DataFrame(data=[[1, 2, 4]], columns=["a", "b", "d"])))
except Exception as e:
print(e)
# Output
# Error while coercing 'c' to type int64: Could not coerce <class 'pandas.core.series.Series'> data_container into type int64:
# index failure_case
# 0 0 NaN
try:
print(schema.validate(pd.DataFrame(data=[[1, 2, 3]], columns=["a", "b", "c"])))
except Exception as e:
print(e)
# Output
# column 'd' in DataFrameSchema {'a': <Schema Column(name=a, type=DataType(int64))>, 'b': <Schema Column(name=b, type=DataType(Int64))>, 'c': <Schema Column(name=c, type=DataType(int64))>, 'd': <Schema Column(name=d, type=DataType(int64))>} requires a default value when non-nullable add_missing_columns is enabled |
thanks @derinwalters ! looks like there failing linter and unit tests. check out the contributing guide for instructions on how to run linters/tests locally. |
Hmmm, I read the instructions and have been following them correctly, as far as I know. Perhaps I did something goofy when patching the default value issue and didn't properly check. I'll look again, apologies for that. For lint is there something else other than "pre-commit run --all" because it's showing clean for me?
|
weird, I'm able to reproduce CI:
|
except SchemaError as exc: | ||
error_handler.collect_error(exc.reason_code, exc) | ||
elif schema.coerce: | ||
check_obj[schema.name] = self.coerce_dtype( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should add a test case for this part of the conditional: basically make sure coercing works in the case the check_obj
is not a column/index
Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>
this PR's in good shape! last thing @derinwalters ... would you mind updating this docs page with a subheader showing how this feature works? |
Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>
I got to it first 😉 congrats on your first contribution to pandera @derinwalters ! |
@cosmicBboy That was incredibly kind of you to close out the remaining issues. I was away from home for a few days. Thank you! |
I took a crack at adding the 'add_missing_columns' DataFrame schema config, including tests. One thing I could use some advice on is that I got missing attribute pylint issues in file pandera/backends/pandas/container.py before I even changed anything. They look related to the 'schema' method argument variable not having any typing annotation such that the referenced column schema attributes for 'required', 'dtype', and 'coerce' are unknown. The typing annotation surrounding this variable looks rather complicated and I was afraid to make changes, so I marked as "# type: ignore[union-attr]" in this pull request.