Improve warning message when using the same Column object multiple times in DataFrameSchema #511

Anders-E · 2021-06-04T20:52:25Z

Question about pandera

When using Pandera, I've tried to reuse Column objects for multiple rows. However, since there are side effects on any Column objects used in the DataFrameSchema constructor, I find myslef having to do a workaround using lambdas.

For example, the following code triggers a warning:

import pandas as pd
import pandera as pa

df = pd.DataFrame({
    'positive': [1],
    'also_positive': [4],
    'negative': [-1]
})

positive_check = pa.Column(pa.Int, pa.Check.greater_than_or_equal_to(0))
negative_check = pa.Column(pa.Int, pa.Check.less_than(0))

print(positive_check)

schema = pa.DataFrameSchema({
    'positive': positive_check,
    'also_positive': positive_check,
    'negative': negative_check
})

print(positive_check)

Output

Before creating schema: <Schema Column(name=None, type=int)>
/usr/local/lib/python3.9/site-packages/pandera/schemas.py:220: UserWarning: resetting column for <Schema Column(name=positive, type=int)> to 'also_positive'.
  warnings.warn(
After creating schema: <Schema Column(name=also_positive, type=int)>

Whereas the following code runs with no issues:

import pandas as pd
import pandera as pa

df = pd.DataFrame({
    'positive': [1, 2, 3],
    'also_positive': [4, 5, 6],
    'negative': [-1, -2, -3]
})

positive_check = lambda: pa.Column(pa.Int, pa.Check.greater_than_or_equal_to(0))
negative_check = lambda: pa.Column(pa.Int, pa.Check.less_than(0))

schema = pa.DataFrameSchema({
    'positive': positive_check(),
    'also_positive': positive_check(),
    'negative': negative_check()
})

I find it to be a common use case to reuse checks. Is this really the intended behavior of the DataFrameSchema constructor? I realize that changing this would be a breaking change but is it something you would consider for a future v1.0?

The text was updated successfully, but these errors were encountered:

cosmicBboy · 2021-06-06T15:50:16Z

hey @Anders-E, the user warning you're seeing is because checks should be reusable but by assigning a Column object to a variable

positive_check = pa.Column(pa.Int, pa.Check.greater_than_or_equal_to(0))

You're basically re-using the entire column definition for two different columns (a Column is a schema-like object that holds metadata about the column, not a Check). There are several ways of applying the same set of checks to a group of columns. I'll list them in order of my personal preference on how I use pandera:

Specify a different `Column` object for each column

For simple cases where the column definition doesn't involve a lot of checks/option kwargs

schema = pa.DataFrameSchema({
    'positive': pa.Column(pa.Int, pa.Check.ge(0)),  # ge is the shorter hand for greater_than_or_equal_to
    'also_positive': pa.Column(pa.Int, pa.Check.ge(0)),
    'negative': pa.Column(pa.Int, pa.Check.less_than(0)),
})

# here you can even define a variable for the common arguments
positive_col_args = [pa.Int, pa.Check.ge(0)]

schema = pa.DataFrameSchema({
    'positive': pa.Column(*positive_col_args),
    'also_positive': pa.Column(*positive_col_args),
    'negative': pa.Column(pa.Int, pa.Check.less_than(0)),
})

Use `regex=True`

This applies a single column definition to a set of columns, which are identified at validation time with the column key by a regular expression.

schema = pa.DataFrameSchema({
    '(positive|also_positive)': pa.Column(pa.Int, pa.Check.ge(0), regex=True),
    'negative': pa.Column(pa.Int, pa.Check.less_than(0)),
})

Use `partial`

This is similar to the lambda solution: it guarantees that the column objects are unique, while allowing you to differentiate positive and also_positive with different kwarg options.

from functools import partial

PositiveColumn = partial(Column, pa.Int, pa.Check.ge(0))

schema = pa.DataFrameSchema({
    'positive': PositiveColumn(coerce=True),
    'also_positive': PositiveColumn(coerce=False),
    'negative': pa.Column(pa.Int, pa.Check.less_than(0)),
})

Reusing `Check`s

As a related but similar topic, reusing checks is supported by pandera, just keep in mind they need to be supplied to different Column object instances

reusable_checks = [
    pa.Check.ge(0),
    ... # a long list of checks
]

schema = pa.DataFrameSchema({
    'positive': pa.Column(pa.Int, reusable_checks),
    'also_positive': pa.Column(pa.Int, reusable_checks),
    'negative': pa.Column(pa.Int, pa.Check.less_than(0)),
})

Let me know if you have any other questions!

Anders-E · 2021-06-08T20:46:43Z

Thank you for the very detailed reply @cosmicBboy !

This is indeed very helpful for my use cases, and I particularly like the solution using functools.partial.

So long story short, Column objects are not supposed to be reused. Do you think it would be a good idea to elaborate on this in the user warning?

Anders-E · 2021-07-06T13:07:37Z

@cosmicBboy Do you think the user warning could be improved (see my comment above)? If not I think this issue can be closed.

cosmicBboy · 2021-07-06T13:52:38Z

hey @Anders-E, a contribution on that front would be very welcome. I think better than improving the warning message, we can make this case invalid:

in DataFrameSchema.__init__, check whether a Column object is used more than once
if so, raise a SchemaInitError saying something to the effect of "you cannot use a Column object multiple times in the same DataFrameSchema.".

Anders-E · 2021-07-06T14:21:19Z

I think that's actually a better idea yes, cause I really can't see why you would want to proceed with using the same Column multiple times.

Do you mind if I go ahead and implement it?

cosmicBboy · 2021-07-06T18:02:55Z

yes, please do! make sure to base your changes off of the dev branch and make a PR against that branch when it's ready!

Also be sure to check out the contributing guide for recommended ways of setting up your local env, running pre-commit hooks, and tests.

Let me know if you have any other questions!

Anders-E · 2021-07-07T07:06:34Z

Thank you very much and thank you for the quick replies!

NickCrews · 2022-04-14T14:39:13Z

Hi! I just looked up this bug after running into this warning. Thanks for your work on all this @cosmicBboy

Can we make it so that on DataFrameSchema init, the Column is copied into a new Column with a modified name, instead of modifying the Column in place? This was very surprising behavior to me. This would also be in line with how pandas does it, where the names of Series is not modified when you pass them into a Dataframe:

I'd guess the cost of copying a Column wouldn't be very much?

Or, I think this would be even more disruptive, but I think it would make more sense, consider if Column's didn't even have a name attribute, all they did was define what was INSIDE a column, but didn't know anything about where the data is actually stored.

Thank you!

Anders-E · 2022-05-04T08:57:52Z

I completely agree with you @NickCrews , the behavior feels strange to me as it goes against the immutable style found in pandas.

This is obviously not my project or anything, but I would highly prefer having the Columns copied. As to dropping the name attribute, I don't really know what it's used for under the hood so can't really comment on that.

cosmicBboy · 2022-05-04T13:51:56Z

Agreed on copying the Column (I don't think the cost of copying would be high)

I think for this issue, let's do the following:

Change Column.set_name so that it makes a copy of the Column object
Remove the warning, since the original column is no longer modified in place.

As to dropping the name attribute, I don't really know what it's used for under the hood so can't really comment on that.

consider if Column's didn't even have a name attribute, all they did was define what was INSIDE a column, but didn't know anything about where the data is actually stored.

So schema components actually have slightly different semantics than schemas... they take in a data structure and validate some part of it, so you can actually use Column objects to validate a dataframe by itself (without having to use it within a DataFrameSchema), see these docs. This is the reason it has a name attribute.

cosmicBboy · 2022-05-04T13:52:12Z

@Anders-E @NickCrews do either of you want to make a PR for ^^?

NickCrews · 2022-05-04T16:44:18Z

I could do a PR for this. That change to Column.set_name would be breaking for users. Do you want any sort of warning or deprecation schedule? ie leave it as is, add a warning, and in 6 months actually do the change? IDK what Pandera's policy on this is.

That makes sense why Columns need to store the name, so that they can lookup the right column from a DF, thanks! I think with this change a lot of my friction should disappear.

cosmicBboy · 2022-05-04T19:44:02Z

That change to Column.set_name would be breaking for users

would it? how?

cosmicBboy · 2022-05-04T19:45:44Z

thanks @NickCrews !

I'm not super concerned that Column.set_name will be a breaking change... as you and @Anders-E pointed out, in-plce mutation of the Column when it's fed into a DataFrameSchema is a bug, in that it's sort of unexpected/unintuitive behavior.

NickCrews · 2022-05-04T23:40:36Z

I was thinking that someone might be directly calling Column.set_name, since that is part of the public API, outside of the context of a DataFreSchema that we've been talking about

cosmicBboy · 2022-05-06T13:25:36Z

I not super concerned that this will have a big impact, since I'm not sure how many people using Columns by themselves rely on the fact that it's mutated in the context of a DataFrameSchema... but you're right, to be safe let's manually to a copy.deepcopy of the Column before calling set_name here: https://github.com/pandera-dev/pandera/blob/master/pandera/schemas.py#L282-L289

NickCrews · 2022-05-06T17:00:02Z

Oops sorry I didn't see your comment "how?". I wasn't thinking of people relying on Column getting mutated in DataFrameSchema, I was thinking of this:

c1 = Column(name="name1")
c1.set_name("name2")
# keep doing stuff with c1, expecting it to be named `name2`.
# After this change though, it will still be named `name1`

How do you want to deal with this?

cosmicBboy · 2022-05-06T20:54:43Z

gotcha, agreed, so let's do what I mentioned in #511 (comment):

to be safe let's manually to a copy.deepcopy of the Column before calling set_name here: https://github.com/pandera-dev/pandera/blob/master/pandera/schemas.py#L282-L289

NickCrews · 2022-12-14T22:54:07Z

@cosmicBboy MANY months later 🤣 , but how's that PR look?

cosmicBboy · 2022-12-15T19:06:34Z

Thanks @NickCrews, and no worries about the delay!

Looks like the test tests/core/test_schemas.py::test_dataframe_reset_column_name needs to be modified, I don't think a warning is raised anymore.

Also, working on some CI issues: #1056, may need to rebase on that once it's merged

Fixes unionai-oss#511 Also restructures so that the conversion is a pure function, which is easier to reason about and is less prone to breaking in the future from unintentional side effects from self. Also move all verifications to top of __init__, so all the assignments to self happen in one block. Also contains a small typing fix from previous commit. Signed-off-by: Nick Crews <nicholas.b.crews@gmail.com>

Anders-E added the question Further information is requested label Jun 4, 2021

cosmicBboy changed the title ~~Is DataFrameSchema supposed to have side-effects on Column objects?~~ Improve warning message when using the same Column object multiple times in DataFrameSchema Jul 12, 2021

cosmicBboy added enhancement New feature or request and removed question Further information is requested labels Jul 12, 2021

NickCrews mentioned this issue Dec 14, 2022

Feature/511: Copy columns in DataFrameSchema.__init__() #1055

Merged

cosmicBboy closed this as completed in #1055 Dec 16, 2022

cosmicBboy closed this as completed in 0301dec Dec 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve warning message when using the same Column object multiple times in DataFrameSchema #511

Improve warning message when using the same Column object multiple times in DataFrameSchema #511

Anders-E commented Jun 4, 2021

cosmicBboy commented Jun 6, 2021

Anders-E commented Jun 8, 2021

Anders-E commented Jul 6, 2021

cosmicBboy commented Jul 6, 2021

Anders-E commented Jul 6, 2021

cosmicBboy commented Jul 6, 2021

Anders-E commented Jul 7, 2021

NickCrews commented Apr 14, 2022 •

edited

Loading

Anders-E commented May 4, 2022

cosmicBboy commented May 4, 2022

cosmicBboy commented May 4, 2022

NickCrews commented May 4, 2022

cosmicBboy commented May 4, 2022

cosmicBboy commented May 4, 2022

NickCrews commented May 4, 2022

cosmicBboy commented May 6, 2022

NickCrews commented May 6, 2022

cosmicBboy commented May 6, 2022

NickCrews commented Dec 14, 2022

cosmicBboy commented Dec 15, 2022

Improve warning message when using the same Column object multiple times in DataFrameSchema #511

Improve warning message when using the same Column object multiple times in DataFrameSchema #511

Comments

Anders-E commented Jun 4, 2021

Question about pandera

Output

cosmicBboy commented Jun 6, 2021

Specify a different Column object for each column

Use regex=True

Use partial

Reusing Checks

Anders-E commented Jun 8, 2021

Anders-E commented Jul 6, 2021

cosmicBboy commented Jul 6, 2021

Anders-E commented Jul 6, 2021

cosmicBboy commented Jul 6, 2021

Anders-E commented Jul 7, 2021

NickCrews commented Apr 14, 2022 • edited Loading

Anders-E commented May 4, 2022

cosmicBboy commented May 4, 2022

cosmicBboy commented May 4, 2022

NickCrews commented May 4, 2022

cosmicBboy commented May 4, 2022

cosmicBboy commented May 4, 2022

NickCrews commented May 4, 2022

cosmicBboy commented May 6, 2022

NickCrews commented May 6, 2022

cosmicBboy commented May 6, 2022

NickCrews commented Dec 14, 2022

cosmicBboy commented Dec 15, 2022

Specify a different `Column` object for each column

Use `regex=True`

Use `partial`

Reusing `Check`s

NickCrews commented Apr 14, 2022 •

edited

Loading