Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for dropping invalid rows for pyspark backend #1639

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

zaheerabbas21
Copy link

@zaheerabbas21 zaheerabbas21 commented May 12, 2024

Solves issue - #1540

Tasks to be completed as per this comment:

  • Introduce PANDERA_FULL_TABLE_VALIDATION configuration. By default, it should be None and should be set depending on the validation backend. It should be True for the pandas check backend but False for the pyspark backend.
  • Modify all of the pyspark builtin checks to have two execution modes:
    • PANDERA_FULL_TABLE_VALIDATION=False is the current behavior
    • PANDERA_FULL_TABLE_VALIDATION=True should return a boolean column indicating which element in the column passed the check.
  • Make any additional changes to the pyspark backend to support a boolean column as the output of a check (we can take inspiration from the polars check backend on how to do this).
  • Add support for the drop_invalid_rows option
  • Add info logging at validation time to let the user know if full table validation is happening or not
  • Add documentation discussing the performance implications of turning on full table validation.
  • Add unit test cases in the testing pipeline to support the PANDERA_FULL_TABLE_VALIDATION config and drop_invalid_rows option

PS: New to the repo 😄 , so please call out if I am not following repo guidelines or code style. Appreciate your help!

- Add full table validation support for pyspark backend

Signed-off-by: Zaheer Abbas <contact@zaheerabbas.dev>
…alidation

Signed-off-by: Zaheer Abbas <contact@zaheerabbas.dev>
def equal_to(
data: PysparkDataframeColumnObject,
value: Any,
should_validate_full_table: bool,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of passing this in as an argument, you can use pandera.config.get_config_context to get the full_table_validation configuration value. This is so that the API for each check is consistent across the different backends.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cosmicBboy. Will make the recommended change.

Also, is there a way that you can suggest to keep the PANDERA_FULL_TABLE_VALIDATION config value to be False when the backend is pyspark and True when the backend is pandas? Did not find a good way to do this, hence asking for a suggestion 😅.

Copy link
Collaborator

@cosmicBboy cosmicBboy May 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use the config_context context manager in the validate methods for each backend to control this behavior: https://github.com/unionai-oss/pandera/blob/main/pandera/config.py#L71

for example this is used in the polars backend:

with config_context(validation_depth=get_validation_depth(check_obj)):

@cosmicBboy
Copy link
Collaborator

thanks @nk4456542, this is awesome!

Looks like some of the tests are broken

=========================== short test summary info ============================
FAILED tests/core/test_pandas_config.py::TestPandasDataFrameConfig::test_disable_validation - AssertionError: assert {'validation_enabled': False, 'validation_depth': <ValidationDepth.SCHEMA_AND_DATA: 'SCHEMA_AND_DATA'>, 'cache_dataframe': False, 'keep_cached_dataframe': False, 'full_table_validation': None} == {'cache_dataframe': False, 'keep_cached_dataframe': False, 'validation_enabled': False, 'validation_depth': <ValidationDepth.SCHEMA_AND_DATA: 'SCHEMA_AND_DATA'>}

see https://github.com/unionai-oss/pandera/actions/runs/9054025532/job/24909236981?pr=1639.

You can run these tests locally with pytest tests/pyspark.

- Remove unused decorators

Signed-off-by: Zaheer Abbas <contact@zaheerabbas.dev>
- Will help to use the flag in backend validate functions

Signed-off-by: Zaheer Abbas <contact@zaheerabbas.dev>
- More tests to come for full_table_validation config for built_in_checks after adding support in pyspark backend

Signed-off-by: Zaheer Abbas <contact@zaheerabbas.dev>
@zaheerabbas21
Copy link
Author

zaheerabbas21 commented Jun 2, 2024

Was a bit busy for the past two weeks, will continue working on this from this week

@cosmicBboy
Copy link
Collaborator

hi @nk4456542 friendly ping on progress here, let me know if you need any help!

@zaheerabbas21
Copy link
Author

zaheerabbas21 commented Jun 29, 2024

@cosmicBboy - Apologies for dropping this, will pick this up this week. I work at a startup 😅, so I had my work cut out for one of the feature launches.

I will contact you in the comments if I need help on this PR.

@cosmicBboy
Copy link
Collaborator

thanks for the update @nk4456542, totally understand what it's like to be at a startup 👍

@zaheerabbas21
Copy link
Author

zaheerabbas21 commented Jul 9, 2024

I have been caught up in work again 😞 . But would really like to work on this 😬 , would update here again when I can pick up this again.

Apologies again for not being clear on the timelines

@cosmicBboy cosmicBboy marked this pull request as ready for review September 1, 2024 19:08
Copy link

codecov bot commented Sep 1, 2024

Codecov Report

Attention: Patch coverage is 5.63380% with 67 lines in your changes missing coverage. Please review.

Project coverage is 74.00%. Comparing base (812b2a8) to head (86a34a4).
Report is 141 commits behind head on main.

Files with missing lines Patch % Lines
pandera/backends/pyspark/builtin_checks.py 0.00% 57 Missing ⚠️
pandera/backends/pyspark/utils.py 0.00% 9 Missing ⚠️
pandera/config.py 80.00% 1 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (812b2a8) and HEAD (86a34a4). Click for more details.

HEAD has 7 uploads less than BASE
Flag BASE (812b2a8) HEAD (86a34a4)
48 41
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #1639       +/-   ##
===========================================
- Coverage   94.28%   74.00%   -20.28%     
===========================================
  Files          91      120       +29     
  Lines        7013     9190     +2177     
===========================================
+ Hits         6612     6801      +189     
- Misses        401     2389     +1988     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants