Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/922 add other ways to report unique errors as an argument #914

Merged
merged 13 commits into from
Sep 22, 2022

Conversation

ng-henry
Copy link
Contributor

This implements feature request #902 by allowing the unique parameter to take in more arguments, like "first", "last", "all". These control how unique values are reported (all unique values? all except for the first occurrence? all except for the last occurrence?).

@codecov
Copy link

codecov bot commented Aug 12, 2022

Codecov Report

Base: 96.55% // Head: 96.56% // Increases project coverage by +0.01% 🎉

Coverage data is based on head (d115973) compared to base (d6538a5).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##              dev     #914      +/-   ##
==========================================
+ Coverage   96.55%   96.56%   +0.01%     
==========================================
  Files          43       43              
  Lines        4179     4198      +19     
==========================================
+ Hits         4035     4054      +19     
  Misses        144      144              
Impacted Files Coverage Δ
pandera/strategies.py 98.25% <ø> (ø)
pandera/dtypes.py 94.64% <100.00%> (+0.04%) ⬆️
pandera/model_components.py 95.61% <100.00%> (+0.07%) ⬆️
pandera/schema_components.py 99.07% <100.00%> (+<0.01%) ⬆️
pandera/schemas.py 99.26% <100.00%> (+0.01%) ⬆️
pandera/decorators.py 98.80% <0.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@cosmicBboy
Copy link
Collaborator

thanks @ng-henry ! this will be the last PR before the 0.12.0 release.

Please take a look at #913, which is a WIP PR for overhauling pandera internals to be more extensible.

I think the other features you'd like to support (like partitioning coercible and uncoercible cells) will be easier to reason about and implement with the new internals #906, although I think there's still a bit of design work to figure out how to implement it.

Copy link
Collaborator

@cosmicBboy cosmicBboy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, actually one last thing to do here is to update the docs and docstrings in the relevant areas:

Docstrings

The __init__ args for the following:

  • DataFrameSchema, Column, Index, MultiIndex, Field

Docs

@cosmicBboy
Copy link
Collaborator

@ng-henry after considering this PR a little more, I think conflating the unique property with the keep argument is confusing. In pandera, options should intuitively map to properties that are validated at runtime.

I think it would make more sense to either:

  1. introduce a new kwarg like report_duplicates or something like this, where options are "all", "exclude_first", "exclude_last" to make it super clear what's happening.
  2. only have series.duplicated(keep=False) to report all duplicates... I might be missing something, but I think there's a strong use case for identifying all duplicates and a pretty weak use case for only reporting all except the first/last duplicate value.

We can merge this as part of the 0.13.0 release, before the major 0.14.0 overhaul of the backend.

What do you think?

@ng-henry
Copy link
Contributor Author

Yep I think that's a good idea. BTW we are not setting keep = False when using unique = True on the Column constructor. We are only doing that in the DataFrameSchema constructor. See issue #902 for reference.

@ng-henry
Copy link
Contributor Author

I'll just continue work on the report_duplicates argument on this branch, and make a new branch to fix the duplicate issue when using the Column constructor.

@ng-henry
Copy link
Contributor Author

@cosmicBboy codecov says there's zero coverage on pandera/mypy.py but not quite sure why that is. I don't think I changed anything on the mypy side of things.

@cosmicBboy
Copy link
Collaborator

codecov says there's zero coverage on pandera/mypy.py but not quite sure why that is. I don't think I changed anything on the mypy side of things.

That's not a problem.

introduce a new kwarg like report_duplicates or something like this, where options are "all", "exclude_first", "exclude_last" to make it super clear what's happening.

Can we stick to this naming? I think unique_keep_setting is too pandas-specific (in a way that's confusing). report_duplicates with the "all", "exclude_first", "exclude_last" options are a lot clearer in what's happening: pandera will report "all" duplicates, "exclude the first" or "exclude the last" duplicate.

@ng-henry
Copy link
Contributor Author

@cosmicBboy implemented but codecov is lower. Do you want to add more tests?

elif unique == "all":
keep_argument = False
else:
raise ValueError(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @ng-henry do you mind adding a test case for raising this ValueError ?

@cosmicBboy
Copy link
Collaborator

awesome, thanks @ng-henry !

@cosmicBboy cosmicBboy merged commit 206c883 into unionai-oss:dev Sep 22, 2022
cosmicBboy pushed a commit that referenced this pull request Oct 5, 2022
* add unique "keep" setting

* add test

* add tests + fix pre-commit errors

* move UniqueSettings to dtypes.py and fix errors

* fix mypy type error

* fix mypy type error

* a

* add keep settings argument

* new changes

* update documentation

* update

* redo CI

* add ValueError test for invalid unique settings
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants