-
-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve warning message when using the same Column object multiple times in DataFrameSchema #511
Comments
hey @Anders-E, the user warning you're seeing is because checks should be reusable but by assigning a positive_check = pa.Column(pa.Int, pa.Check.greater_than_or_equal_to(0)) You're basically re-using the entire column definition for two different columns (a Specify a different
|
Thank you for the very detailed reply @cosmicBboy ! This is indeed very helpful for my use cases, and I particularly like the solution using So long story short, |
@cosmicBboy Do you think the user warning could be improved (see my comment above)? If not I think this issue can be closed. |
hey @Anders-E, a contribution on that front would be very welcome. I think better than improving the warning message, we can make this case invalid:
|
I think that's actually a better idea yes, cause I really can't see why you would want to proceed with using the same Do you mind if I go ahead and implement it? |
yes, please do! make sure to base your changes off of the Also be sure to check out the contributing guide for recommended ways of setting up your local env, running pre-commit hooks, and tests. Let me know if you have any other questions! |
Thank you very much and thank you for the quick replies! |
Hi! I just looked up this bug after running into this warning. Thanks for your work on all this @cosmicBboy Can we make it so that on DataFrameSchema init, the Column is copied into a new Column with a modified name, instead of modifying the Column in place? This was very surprising behavior to me. This would also be in line with how pandas does it, where the names of Series is not modified when you pass them into a Dataframe: I'd guess the cost of copying a Column wouldn't be very much? Or, I think this would be even more disruptive, but I think it would make more sense, consider if Column's didn't even have a Thank you! |
I completely agree with you @NickCrews , the behavior feels strange to me as it goes against the immutable style found in pandas. This is obviously not my project or anything, but I would highly prefer having the Columns copied. As to dropping the |
Agreed on copying the Column (I don't think the cost of copying would be high) I think for this issue, let's do the following:
So schema components actually have slightly different semantics than schemas... they take in a data structure and validate some part of it, so you can actually use |
@Anders-E @NickCrews do either of you want to make a PR for ^^? |
I could do a PR for this. That change to That makes sense why Columns need to store the name, so that they can lookup the right column from a DF, thanks! I think with this change a lot of my friction should disappear. |
would it? how? |
thanks @NickCrews ! I'm not super concerned that |
I was thinking that someone might be directly calling |
I not super concerned that this will have a big impact, since I'm not sure how many people using |
Oops sorry I didn't see your comment "how?". I wasn't thinking of people relying on Column getting mutated in DataFrameSchema, I was thinking of this:
How do you want to deal with this? |
gotcha, agreed, so let's do what I mentioned in #511 (comment):
|
@cosmicBboy MANY months later 🤣 , but how's that PR look? |
Thanks @NickCrews, and no worries about the delay! Looks like the test Also, working on some CI issues: #1056, may need to rebase on that once it's merged |
Fixes unionai-oss#511 Also restructures so that the conversion is a pure function, which is easier to reason about and is less prone to breaking in the future from unintentional side effects from self. Also move all verifications to top of __init__, so all the assignments to self happen in one block. Also contains a small typing fix from previous commit. Signed-off-by: Nick Crews <nicholas.b.crews@gmail.com>
Question about pandera
When using Pandera, I've tried to reuse
Column
objects for multiple rows. However, since there are side effects on anyColumn
objects used in theDataFrameSchema
constructor, I find myslef having to do a workaround using lambdas.For example, the following code triggers a warning:
Output
Whereas the following code runs with no issues:
I find it to be a common use case to reuse checks. Is this really the intended behavior of the
DataFrameSchema
constructor? I realize that changing this would be a breaking change but is it something you would consider for a future v1.0?The text was updated successfully, but these errors were encountered: