Add add_missing_columns DataFrame schema config per enhancement #687 #1186

derinwalters · 2023-05-13T09:19:18Z

I took a crack at adding the 'add_missing_columns' DataFrame schema config, including tests. One thing I could use some advice on is that I got missing attribute pylint issues in file pandera/backends/pandas/container.py before I even changed anything. They look related to the 'schema' method argument variable not having any typing annotation such that the referenced column schema attributes for 'required', 'dtype', and 'coerce' are unknown. The typing annotation surrounding this variable looks rather complicated and I was afraid to make changes, so I marked as "# type: ignore[union-attr]" in this pull request.

import pandas as pd
from pandera import Column, DataFrameSchema

schema = DataFrameSchema(
    columns={i: Column(int, default=9) for i in ["a", "b", "c"]},
    strict=True,
    add_missing_columns=True,
)

print(schema.validate(pd.DataFrame(data=[[1, 3]], columns=["a", "c"])))

# Output:
#    a  b  c
# 0  1  9  3

Signed-off-by: Derin Walters <derin.c.walters@rijjin.com>

codecov · 2023-06-14T22:23:51Z

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.05 🎉

Comparison is base (26470c6) 97.25% compared to head (33645c0) 97.31%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1186      +/-   ##
==========================================
+ Coverage   97.25%   97.31%   +0.05%     
==========================================
  Files          65       65              
  Lines        5106     5140      +34     
==========================================
+ Hits         4966     5002      +36     
+ Misses        140      138       -2

Impacted Files	Coverage Δ
pandera/api/pandas/model.py	`92.25% <ø> (ø)`
pandera/backends/pandas/base.py	`100.00% <ø> (ø)`
pandera/io/pandas_io.py	`100.00% <ø> (ø)`
pandera/api/pandas/container.py	`99.26% <100.00%> (+<0.01%)`	⬆️
pandera/api/pandas/model_config.py	`100.00% <100.00%> (ø)`
pandera/backends/pandas/array.py	`98.38% <100.00%> (+0.02%)`	⬆️
pandera/backends/pandas/container.py	`98.54% <100.00%> (+0.17%)`	⬆️
pandera/errors.py	`100.00% <100.00%> (ø)`

... and 2 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

cosmicBboy

Hey @derinwalters thanks for this PR, just getting to review this right now, last few weeks has been super busy ! This looks great overall, I can take a look at the pylint thing.

One question I have based on the docstring:

add missing column names with either default value, if specified in column schema, or NaN if column is nullable

What happens in the case that the default value is None and the column is not nullable? It may be best to raise a SchemaInitError exception in this case, as we know before hand that this is an invalid case.

…ng_columns is enabled and non-nullable columns without a default are added, per enhancement #687 Signed-off-by: Derin Walters <derin.c.walters@rijjin.com>

derinwalters · 2023-06-16T03:52:22Z

Fantastic idea for SchemaInitError. I added to pandera/api/pandas/container.py along with related tests. How about this?

import pandas as pd
from pandera import Column, DataFrameSchema

schema = DataFrameSchema(
    columns={i: Column(int, default=9) for i in ["a", "b", "c"]},
    strict=True,
    add_missing_columns=True,
)
print(schema.validate(pd.DataFrame(data=[[1, 3]], columns=["a", "c"])))
# Output:
#    a  b  c
# 0  1  9  3

schema = DataFrameSchema(
    columns={i: Column(int, nullable=True) for i in ["a", "b", "c"]},
    strict=True,
    add_missing_columns=True,
)
print(schema.validate(pd.DataFrame(data=[[1, 3]], columns=["a", "c"])))
# Output:
#    a    b  c
# 0  1  NaN  3

schema = DataFrameSchema(
    columns={i: Column(int, nullable=False) for i in ["a", "b", "c"]},
    strict=True,
    add_missing_columns=True,
)
# Output:
# pandera.errors.SchemaInitError: column 'a' requires a default value when non-nullable add_missing_columns is enabled

derinwalters · 2023-06-16T04:24:12Z

@cosmicBboy actually.....after thinking about this more, I believe it would be better to dynamically validate each missing column in the backend along with all the other checks; I'd suggest a SchemaError instead. The reason is that doing the check in DataFrameSchema essentially forces an all-or-nothing approach, meaning all columns must be declared with either a default value or with nullable. At least in my use case, which I believe would be common, there are some columns where I would be okay with adding a missing column using a default value, and some that I would not and would want an exception for missing column raised instead. Does this make sense?

…dd_missing_columns is enabled and non-nullable columns without a default are added, per enhancement #687" This reverts commit 2a0ef1c.

…issing non-nullable columns without a default are found, per enhancement #687 Signed-off-by: Derin Walters <derin.c.walters@rijjin.com>

derinwalters · 2023-06-16T06:49:19Z

I reverted the SchemaInitError from this pull request in favor of the SchemaError approach. Testing is doing what I expected except for the fact that column "b" in the second validation example is getting assigned to 9, which is the default value for column "a", and not NaN as I expected. I confirmed that the add_missing_columns parser is returning the correct dataframe with NaN but the value is getting modified downstream in run_schema_component_checks. Please give me a little time to investigate before we proceed further.

import pandas as pd
from pandera import Column, DataFrameSchema

schema = DataFrameSchema(
    columns={
        "a": Column(int, default=9),
        "b": Column(int, nullable=True),
        "c": Column(int),
    },
    strict=True,
    add_missing_columns=True,
)
print(schema.validate(pd.DataFrame(data=[[2, 3]], columns=["b", "c"])))
# Output
#    a  b  c
# 0  9  2  3

print(schema.validate(pd.DataFrame(data=[[1, 3]], columns=["a", "c"])))
# Output
#    a  b  c
# 0  1  9  3

print(schema.validate(pd.DataFrame(data=[[1, 2]], columns=["a", "b"])))
# Output
# pandera.errors.SchemaError: column 'c' in DataFrameSchema {'a': <Schema Column(name=a, type=DataType(int64))>, 'b': <Schema Column(name=b, type=DataType(int64))>, 'c': <Schema Column(name=c, type=DataType(int64))>} requires a default value when non-nullable add_missing_columns is enabled

… fills the entire dataframe, even in unrelated columns, per issue #1193 Signed-off-by: Derin Walters <derin.c.walters@rijjin.com>

derinwalters · 2023-06-16T15:25:44Z

Turns out the strange default assignment mentioned above is related to issue #1193. I included a fix for that too. Now we have something that looks like it should.

import pandas as pd
from pandera import Column, DataFrameSchema

schema = DataFrameSchema(
    columns={
        "a": Column(int, default=9),
        "b": Column(pd.Int64Dtype(), nullable=True),
        "c": Column(int, nullable=True),
        "d": Column(int),
    },
    strict=True,
    coerce=True,
    add_missing_columns=True,
)
print(schema.validate(pd.DataFrame(data=[[2, 3, 4]], columns=["b", "c", "d"])))
# Output
#    a  b  c  d
# 0  9  2  3  4

print(schema.validate(pd.DataFrame(data=[[1, 3, 4]], columns=["a", "c", "d"])))
# Output
#    a     b  c  d
# 0  1  <NA>  3  4

try:
    print(schema.validate(pd.DataFrame(data=[[1, 2, 4]], columns=["a", "b", "d"])))
except Exception as e:
    print(e)
# Output
# Error while coercing 'c' to type int64: Could not coerce <class 'pandas.core.series.Series'> data_container into type int64:
#    index failure_case
# 0      0          NaN

try:
    print(schema.validate(pd.DataFrame(data=[[1, 2, 3]], columns=["a", "b", "c"])))
except Exception as e:
    print(e)
# Output
# column 'd' in DataFrameSchema {'a': <Schema Column(name=a, type=DataType(int64))>, 'b': <Schema Column(name=b, type=DataType(Int64))>, 'c': <Schema Column(name=c, type=DataType(int64))>, 'd': <Schema Column(name=d, type=DataType(int64))>} requires a default value when non-nullable add_missing_columns is enabled

cosmicBboy · 2023-06-28T12:43:39Z

thanks @derinwalters ! looks like there failing linter and unit tests. check out the contributing guide for instructions on how to run linters/tests locally.

derinwalters · 2023-06-28T13:47:24Z

Hmmm, I read the instructions and have been following them correctly, as far as I know. Perhaps I did something goofy when patching the default value issue and didn't properly check. I'll look again, apologies for that. For lint is there something else other than "pre-commit run --all" because it's showing clean for me?

(pandera-dev) ➜  pandera git:(feature/687) pre-commit run --all

check python ast.........................................................Passed
check for case conflicts.................................................Passed
check for merge conflicts................................................Passed
check yaml...............................................................Passed
debug statements (python)................................................Passed
fix end of files.........................................................Passed
trim trailing whitespace.................................................Passed
mixed line ending........................................................Passed
isort....................................................................Passed
flynt....................................................................Passed
black....................................................................Passed
pylint...................................................................Passed
mypy.....................................................................Passed
- hook id: mypy
- duration: 0.88s

Success: no issues found in 125 source files

cosmicBboy · 2023-06-28T14:09:27Z

weird, I'm able to reproduce CI:

❯ pre-commit run --all
check python ast.........................................................Passed
check for case conflicts.................................................Passed
check for merge conflicts................................................Passed
check yaml...............................................................Passed
debug statements (python)................................................Passed
fix end of files.........................................................Passed
trim trailing whitespace.................................................Passed
mixed line ending........................................................Passed
isort....................................................................Passed
flynt....................................................................Passed
black....................................................................Failed
- hook id: black
- files were modified by this hook

reformatted pandera/api/pandas/model.py

All done! ✨ 🍰 ✨
1 file reformatted, 134 files left unchanged.

pylint...................................................................Failed
- hook id: pylint
- exit code: 8

************* Module pandera.api.pandas.container
pandera/api/pandas/container.py:32:4: R0914: Too many local variables (16/15) (too-many-locals)

--------------------------------------------------------------------
Your code has been rated at 10.00/10 (previous run: 10.00/10, -0.00)

mypy.....................................................................Passed
- hook id: mypy
- duration: 10.1s

Success: no issues found in 125 source files

Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>

cosmicBboy · 2023-06-28T21:57:36Z

pandera/backends/pandas/array.py

-            except SchemaError as exc:
-                error_handler.collect_error(exc.reason_code, exc)
+            elif schema.coerce:
+                check_obj[schema.name] = self.coerce_dtype(


we should add a test case for this part of the conditional: basically make sure coercing works in the case the check_obj is not a column/index

Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>

cosmicBboy · 2023-06-29T12:37:25Z

this PR's in good shape! last thing @derinwalters ... would you mind updating this docs page with a subheader showing how this feature works?

Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>

cosmicBboy · 2023-06-30T16:12:52Z

I got to it first 😉 congrats on your first contribution to pandera @derinwalters !

derinwalters · 2023-07-01T23:31:59Z

@cosmicBboy That was incredibly kind of you to close out the remaining issues. I was away from home for a few days. Thank you!

Add add_missing_columns DataFrame schema config per enhancement #687

c43f3b7

Signed-off-by: Derin Walters <derin.c.walters@rijjin.com>

cosmicBboy closed this Jun 14, 2023

cosmicBboy reopened this Jun 14, 2023

cosmicBboy reviewed Jun 15, 2023

View reviewed changes

Add checks to DataFrameSchema to throw a SchemaInitError if add_missi…

2a0ef1c

…ng_columns is enabled and non-nullable columns without a default are added, per enhancement #687 Signed-off-by: Derin Walters <derin.c.walters@rijjin.com>

derinwalters added 2 commits June 16, 2023 14:35

Revert "Add checks to DataFrameSchema to throw a SchemaInitError if a…

1b8e88d

…dd_missing_columns is enabled and non-nullable columns without a default are added, per enhancement #687" This reverts commit 2a0ef1c.

Throw a SchemaError exception if add_missing_columns is enabled and m…

ec1be9e

…issing non-nullable columns without a default are found, per enhancement #687 Signed-off-by: Derin Walters <derin.c.walters@rijjin.com>

derinwalters mentioned this pull request Jun 16, 2023

Wrong default value used when keyword used in multiple columns #1193

Open

3 tasks

Fix bug in default value fill where first column with a default value…

8c023d6

… fills the entire dataframe, even in unrelated columns, per issue #1193 Signed-off-by: Derin Walters <derin.c.walters@rijjin.com>

Merge branch 'main' into feature/687

2bd44b7

cosmicBboy added 2 commits June 28, 2023 10:10

Merge branch 'main' into feature/687

19cbe08

fix lint

50ef588

Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>

cosmicBboy approved these changes Jun 28, 2023

View reviewed changes

cosmicBboy reviewed Jun 28, 2023

View reviewed changes

add column coercion test

49c1d8f

Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>

add documentation

33645c0

Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>

cosmicBboy merged commit 631089c into unionai-oss:main Jun 30, 2023
41 checks passed

derinwalters deleted the feature/687 branch July 1, 2023 23:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add add_missing_columns DataFrame schema config per enhancement #687 #1186

Add add_missing_columns DataFrame schema config per enhancement #687 #1186

derinwalters commented May 13, 2023 •

edited

codecov bot commented Jun 14, 2023 •

edited

cosmicBboy left a comment

derinwalters commented Jun 16, 2023 •

edited

derinwalters commented Jun 16, 2023

derinwalters commented Jun 16, 2023 •

edited

derinwalters commented Jun 16, 2023

cosmicBboy commented Jun 28, 2023

derinwalters commented Jun 28, 2023

cosmicBboy commented Jun 28, 2023

cosmicBboy Jun 28, 2023

cosmicBboy commented Jun 29, 2023

cosmicBboy commented Jun 30, 2023

derinwalters commented Jul 1, 2023

Add add_missing_columns DataFrame schema config per enhancement #687 #1186

Add add_missing_columns DataFrame schema config per enhancement #687 #1186

Conversation

derinwalters commented May 13, 2023 • edited

codecov bot commented Jun 14, 2023 • edited

Codecov Report

cosmicBboy left a comment

Choose a reason for hiding this comment

derinwalters commented Jun 16, 2023 • edited

derinwalters commented Jun 16, 2023

derinwalters commented Jun 16, 2023 • edited

derinwalters commented Jun 16, 2023

cosmicBboy commented Jun 28, 2023

derinwalters commented Jun 28, 2023

cosmicBboy commented Jun 28, 2023

cosmicBboy Jun 28, 2023

Choose a reason for hiding this comment

cosmicBboy commented Jun 29, 2023

cosmicBboy commented Jun 30, 2023

derinwalters commented Jul 1, 2023

derinwalters commented May 13, 2023 •

edited

codecov bot commented Jun 14, 2023 •

edited

derinwalters commented Jun 16, 2023 •

edited

derinwalters commented Jun 16, 2023 •

edited