Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve warning message when using the same Column object multiple times in DataFrameSchema #511

Closed
Anders-E opened this issue Jun 4, 2021 · 20 comments · Fixed by #1055
Closed
Labels
enhancement New feature or request

Comments

@Anders-E
Copy link

Anders-E commented Jun 4, 2021

Question about pandera

When using Pandera, I've tried to reuse Column objects for multiple rows. However, since there are side effects on any Column objects used in the DataFrameSchema constructor, I find myslef having to do a workaround using lambdas.

For example, the following code triggers a warning:

import pandas as pd
import pandera as pa

df = pd.DataFrame({
    'positive': [1],
    'also_positive': [4],
    'negative': [-1]
})

positive_check = pa.Column(pa.Int, pa.Check.greater_than_or_equal_to(0))
negative_check = pa.Column(pa.Int, pa.Check.less_than(0))

print(positive_check)

schema = pa.DataFrameSchema({
    'positive': positive_check,
    'also_positive': positive_check,
    'negative': negative_check
})

print(positive_check)

Output

Before creating schema: <Schema Column(name=None, type=int)>
/usr/local/lib/python3.9/site-packages/pandera/schemas.py:220: UserWarning: resetting column for <Schema Column(name=positive, type=int)> to 'also_positive'.
  warnings.warn(
After creating schema: <Schema Column(name=also_positive, type=int)>

Whereas the following code runs with no issues:

import pandas as pd
import pandera as pa

df = pd.DataFrame({
    'positive': [1, 2, 3],
    'also_positive': [4, 5, 6],
    'negative': [-1, -2, -3]
})

positive_check = lambda: pa.Column(pa.Int, pa.Check.greater_than_or_equal_to(0))
negative_check = lambda: pa.Column(pa.Int, pa.Check.less_than(0))

schema = pa.DataFrameSchema({
    'positive': positive_check(),
    'also_positive': positive_check(),
    'negative': negative_check()
})

I find it to be a common use case to reuse checks. Is this really the intended behavior of the DataFrameSchema constructor? I realize that changing this would be a breaking change but is it something you would consider for a future v1.0?

@Anders-E Anders-E added the question Further information is requested label Jun 4, 2021
@cosmicBboy
Copy link
Collaborator

hey @Anders-E, the user warning you're seeing is because checks should be reusable but by assigning a Column object to a variable

positive_check = pa.Column(pa.Int, pa.Check.greater_than_or_equal_to(0))

You're basically re-using the entire column definition for two different columns (a Column is a schema-like object that holds metadata about the column, not a Check). There are several ways of applying the same set of checks to a group of columns. I'll list them in order of my personal preference on how I use pandera:

Specify a different Column object for each column

For simple cases where the column definition doesn't involve a lot of checks/option kwargs

schema = pa.DataFrameSchema({
    'positive': pa.Column(pa.Int, pa.Check.ge(0)),  # ge is the shorter hand for greater_than_or_equal_to
    'also_positive': pa.Column(pa.Int, pa.Check.ge(0)),
    'negative': pa.Column(pa.Int, pa.Check.less_than(0)),
})

# here you can even define a variable for the common arguments
positive_col_args = [pa.Int, pa.Check.ge(0)]

schema = pa.DataFrameSchema({
    'positive': pa.Column(*positive_col_args),
    'also_positive': pa.Column(*positive_col_args),
    'negative': pa.Column(pa.Int, pa.Check.less_than(0)),
})

Use regex=True

This applies a single column definition to a set of columns, which are identified at validation time with the column key by a regular expression.

schema = pa.DataFrameSchema({
    '(positive|also_positive)': pa.Column(pa.Int, pa.Check.ge(0), regex=True),
    'negative': pa.Column(pa.Int, pa.Check.less_than(0)),
})

Use partial

This is similar to the lambda solution: it guarantees that the column objects are unique, while allowing you to differentiate positive and also_positive with different kwarg options.

from functools import partial

PositiveColumn = partial(Column, pa.Int, pa.Check.ge(0))

schema = pa.DataFrameSchema({
    'positive': PositiveColumn(coerce=True),
    'also_positive': PositiveColumn(coerce=False),
    'negative': pa.Column(pa.Int, pa.Check.less_than(0)),
})

Reusing Checks

As a related but similar topic, reusing checks is supported by pandera, just keep in mind they need to be supplied to different Column object instances

reusable_checks = [
    pa.Check.ge(0),
    ... # a long list of checks
]

schema = pa.DataFrameSchema({
    'positive': pa.Column(pa.Int, reusable_checks),
    'also_positive': pa.Column(pa.Int, reusable_checks),
    'negative': pa.Column(pa.Int, pa.Check.less_than(0)),
})

Let me know if you have any other questions!

@Anders-E
Copy link
Author

Anders-E commented Jun 8, 2021

Thank you for the very detailed reply @cosmicBboy !

This is indeed very helpful for my use cases, and I particularly like the solution using functools.partial.

So long story short, Column objects are not supposed to be reused. Do you think it would be a good idea to elaborate on this in the user warning?

@Anders-E
Copy link
Author

Anders-E commented Jul 6, 2021

@cosmicBboy Do you think the user warning could be improved (see my comment above)? If not I think this issue can be closed.

@cosmicBboy
Copy link
Collaborator

hey @Anders-E, a contribution on that front would be very welcome. I think better than improving the warning message, we can make this case invalid:

  • in DataFrameSchema.__init__, check whether a Column object is used more than once
  • if so, raise a SchemaInitError saying something to the effect of "you cannot use a Column object multiple times in the same DataFrameSchema.".

@Anders-E
Copy link
Author

Anders-E commented Jul 6, 2021

I think that's actually a better idea yes, cause I really can't see why you would want to proceed with using the same Column multiple times.

Do you mind if I go ahead and implement it?

@cosmicBboy
Copy link
Collaborator

yes, please do! make sure to base your changes off of the dev branch and make a PR against that branch when it's ready!

Also be sure to check out the contributing guide for recommended ways of setting up your local env, running pre-commit hooks, and tests.

Let me know if you have any other questions!

@Anders-E
Copy link
Author

Anders-E commented Jul 7, 2021

Thank you very much and thank you for the quick replies!

@cosmicBboy cosmicBboy changed the title Is DataFrameSchema supposed to have side-effects on Column objects? Improve warning message when using the same Column object multiple times in DataFrameSchema Jul 12, 2021
@cosmicBboy cosmicBboy added enhancement New feature or request and removed question Further information is requested labels Jul 12, 2021
@NickCrews
Copy link
Contributor

NickCrews commented Apr 14, 2022

Hi! I just looked up this bug after running into this warning. Thanks for your work on all this @cosmicBboy

Can we make it so that on DataFrameSchema init, the Column is copied into a new Column with a modified name, instead of modifying the Column in place? This was very surprising behavior to me. This would also be in line with how pandas does it, where the names of Series is not modified when you pass them into a Dataframe:

image

I'd guess the cost of copying a Column wouldn't be very much?

Or, I think this would be even more disruptive, but I think it would make more sense, consider if Column's didn't even have a name attribute, all they did was define what was INSIDE a column, but didn't know anything about where the data is actually stored.

Thank you!

@Anders-E
Copy link
Author

Anders-E commented May 4, 2022

I completely agree with you @NickCrews , the behavior feels strange to me as it goes against the immutable style found in pandas.

This is obviously not my project or anything, but I would highly prefer having the Columns copied. As to dropping the name attribute, I don't really know what it's used for under the hood so can't really comment on that.

@cosmicBboy
Copy link
Collaborator

Agreed on copying the Column (I don't think the cost of copying would be high)

I think for this issue, let's do the following:

  • Change Column.set_name so that it makes a copy of the Column object
  • Remove the warning, since the original column is no longer modified in place.

As to dropping the name attribute, I don't really know what it's used for under the hood so can't really comment on that.

consider if Column's didn't even have a name attribute, all they did was define what was INSIDE a column, but didn't know anything about where the data is actually stored.

So schema components actually have slightly different semantics than schemas... they take in a data structure and validate some part of it, so you can actually use Column objects to validate a dataframe by itself (without having to use it within a DataFrameSchema), see these docs. This is the reason it has a name attribute.

@cosmicBboy
Copy link
Collaborator

@Anders-E @NickCrews do either of you want to make a PR for ^^?

@NickCrews
Copy link
Contributor

I could do a PR for this. That change to Column.set_name would be breaking for users. Do you want any sort of warning or deprecation schedule? ie leave it as is, add a warning, and in 6 months actually do the change? IDK what Pandera's policy on this is.

That makes sense why Columns need to store the name, so that they can lookup the right column from a DF, thanks! I think with this change a lot of my friction should disappear.

@cosmicBboy
Copy link
Collaborator

That change to Column.set_name would be breaking for users

would it? how?

@cosmicBboy
Copy link
Collaborator

thanks @NickCrews !

I'm not super concerned that Column.set_name will be a breaking change... as you and @Anders-E pointed out, in-plce mutation of the Column when it's fed into a DataFrameSchema is a bug, in that it's sort of unexpected/unintuitive behavior.

@NickCrews
Copy link
Contributor

I was thinking that someone might be directly calling Column.set_name, since that is part of the public API, outside of the context of a DataFreSchema that we've been talking about

@cosmicBboy
Copy link
Collaborator

I not super concerned that this will have a big impact, since I'm not sure how many people using Columns by themselves rely on the fact that it's mutated in the context of a DataFrameSchema... but you're right, to be safe let's manually to a copy.deepcopy of the Column before calling set_name here: https://github.com/pandera-dev/pandera/blob/master/pandera/schemas.py#L282-L289

@NickCrews
Copy link
Contributor

Oops sorry I didn't see your comment "how?". I wasn't thinking of people relying on Column getting mutated in DataFrameSchema, I was thinking of this:

c1 = Column(name="name1")
c1.set_name("name2")
# keep doing stuff with c1, expecting it to be named `name2`.
# After this change though, it will still be named `name1`

How do you want to deal with this?

@cosmicBboy
Copy link
Collaborator

gotcha, agreed, so let's do what I mentioned in #511 (comment):

to be safe let's manually to a copy.deepcopy of the Column before calling set_name here: https://github.com/pandera-dev/pandera/blob/master/pandera/schemas.py#L282-L289

@NickCrews
Copy link
Contributor

@cosmicBboy MANY months later 🤣 , but how's that PR look?

@cosmicBboy
Copy link
Collaborator

Thanks @NickCrews, and no worries about the delay!

Looks like the test tests/core/test_schemas.py::test_dataframe_reset_column_name needs to be modified, I don't think a warning is raised anymore.

Also, working on some CI issues: #1056, may need to rebase on that once it's merged

NickCrews added a commit to NickCrews/pandera that referenced this issue Dec 15, 2022
Fixes unionai-oss#511

Also restructures so that the conversion is a pure function,
which is easier to reason about
and is less prone to breaking in the future
from unintentional side effects from
self.

Also move all verifications to top of __init__, so all the
assignments to self happen in one block.

Also contains a small typing fix from previous commit.

Signed-off-by: Nick Crews <nicholas.b.crews@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants