Add support for dropping invalid rows for pyspark backend #1639

zaheerabbas21 · 2024-05-12T20:25:36Z

Solves issue - #1540

Tasks to be completed as per this comment:

Introduce PANDERA_FULL_TABLE_VALIDATION configuration. By default, it should be None and should be set depending on the validation backend. It should be True for the pandas check backend but False for the pyspark backend.
Modify all of the pyspark builtin checks to have two execution modes:
- PANDERA_FULL_TABLE_VALIDATION=False is the current behavior
- PANDERA_FULL_TABLE_VALIDATION=True should return a boolean column indicating which element in the column passed the check.
Make any additional changes to the pyspark backend to support a boolean column as the output of a check (we can take inspiration from the polars check backend on how to do this).
Add support for the drop_invalid_rows option
Add info logging at validation time to let the user know if full table validation is happening or not
Add documentation discussing the performance implications of turning on full table validation.
Add unit test cases in the testing pipeline to support the PANDERA_FULL_TABLE_VALIDATION config and drop_invalid_rows option

PS: New to the repo 😄 , so please call out if I am not following repo guidelines or code style. Appreciate your help!

- Add full table validation support for pyspark backend Signed-off-by: Zaheer Abbas <contact@zaheerabbas.dev>

…alidation Signed-off-by: Zaheer Abbas <contact@zaheerabbas.dev>

cosmicBboy · 2024-05-14T01:46:37Z

pandera/backends/pyspark/builtin_checks.py

+def equal_to(
+    data: PysparkDataframeColumnObject,
+    value: Any,
+    should_validate_full_table: bool,


instead of passing this in as an argument, you can use pandera.config.get_config_context to get the full_table_validation configuration value. This is so that the API for each check is consistent across the different backends.

Thanks @cosmicBboy. Will make the recommended change.

Also, is there a way that you can suggest to keep the PANDERA_FULL_TABLE_VALIDATION config value to be False when the backend is pyspark and True when the backend is pandas? Did not find a good way to do this, hence asking for a suggestion 😅.

You can use the config_context context manager in the validate methods for each backend to control this behavior: https://github.com/unionai-oss/pandera/blob/main/pandera/config.py#L71

for example this is used in the polars backend:

pandera/pandera/api/polars/container.py

Line 53 in c24dda9

with config_context(validation_depth=get_validation_depth(check_obj)):

cosmicBboy · 2024-05-14T01:46:51Z

thanks @nk4456542, this is awesome!

Looks like some of the tests are broken

=========================== short test summary info ============================
FAILED tests/core/test_pandas_config.py::TestPandasDataFrameConfig::test_disable_validation - AssertionError: assert {'validation_enabled': False, 'validation_depth': <ValidationDepth.SCHEMA_AND_DATA: 'SCHEMA_AND_DATA'>, 'cache_dataframe': False, 'keep_cached_dataframe': False, 'full_table_validation': None} == {'cache_dataframe': False, 'keep_cached_dataframe': False, 'validation_enabled': False, 'validation_depth': <ValidationDepth.SCHEMA_AND_DATA: 'SCHEMA_AND_DATA'>}

see https://github.com/unionai-oss/pandera/actions/runs/9054025532/job/24909236981?pr=1639.

You can run these tests locally with pytest tests/pyspark.

- Remove unused decorators Signed-off-by: Zaheer Abbas <contact@zaheerabbas.dev>

- Will help to use the flag in backend validate functions Signed-off-by: Zaheer Abbas <contact@zaheerabbas.dev>

- More tests to come for full_table_validation config for built_in_checks after adding support in pyspark backend Signed-off-by: Zaheer Abbas <contact@zaheerabbas.dev>

zaheerabbas21 · 2024-06-02T17:07:51Z

Was a bit busy for the past two weeks, will continue working on this from this week

cosmicBboy · 2024-06-28T16:23:37Z

hi @nk4456542 friendly ping on progress here, let me know if you need any help!

zaheerabbas21 · 2024-06-29T12:02:28Z

@cosmicBboy - Apologies for dropping this, will pick this up this week. I work at a startup 😅, so I had my work cut out for one of the feature launches.

I will contact you in the comments if I need help on this PR.

cosmicBboy · 2024-06-30T17:42:31Z

thanks for the update @nk4456542, totally understand what it's like to be at a startup 👍

zaheerabbas21 · 2024-07-09T08:47:13Z

I have been caught up in work again 😞 . But would really like to work on this 😬 , would update here again when I can pick up this again.

Apologies again for not being clear on the timelines

codecov · 2024-09-01T19:08:27Z

Codecov Report

Attention: Patch coverage is 5.63380% with 67 lines in your changes missing coverage. Please review.

Project coverage is 74.00%. Comparing base (812b2a8) to head (86a34a4).
Report is 141 commits behind head on main.

Files with missing lines	Patch %	Lines
pandera/backends/pyspark/builtin_checks.py	0.00%	57 Missing ⚠️
pandera/backends/pyspark/utils.py	0.00%	9 Missing ⚠️
pandera/config.py	80.00%	1 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (812b2a8) and HEAD (86a34a4). Click for more details.

HEAD has 7 uploads less than BASE

Flag BASE (812b2a8) HEAD (86a34a4)

48 41

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #1639       +/-   ##
===========================================
- Coverage   94.28%   74.00%   -20.28%     
===========================================
  Files          91      120       +29     
  Lines        7013     9190     +2177     
===========================================
+ Hits         6612     6801      +189     
- Misses        401     2389     +1988

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

zaheerabbas21 added 2 commits May 13, 2024 01:46

Add PANDERA_FULL_TABLE_VALIDATION config for full table validation

ba02957

- Add full table validation support for pyspark backend Signed-off-by: Zaheer Abbas <contact@zaheerabbas.dev>

Modify all builtin_checks for pyspark backend to support full table v…

eed319c

…alidation Signed-off-by: Zaheer Abbas <contact@zaheerabbas.dev>

cosmicBboy reviewed May 14, 2024

View reviewed changes

zaheerabbas21 added 3 commits May 18, 2024 23:36

Use the same function signature for builtin_checks across backends

cd2f76c

- Remove unused decorators Signed-off-by: Zaheer Abbas <contact@zaheerabbas.dev>

Add full_table_validation as a flag to config context func

1f952d1

- Will help to use the flag in backend validate functions Signed-off-by: Zaheer Abbas <contact@zaheerabbas.dev>

Fix broken tests and add new tests for full_table_validation config

14513a6

- More tests to come for full_table_validation config for built_in_checks after adding support in pyspark backend Signed-off-by: Zaheer Abbas <contact@zaheerabbas.dev>

Merge branch 'main' into feature/pyspark-full-table-validation

86a34a4

cosmicBboy marked this pull request as ready for review September 1, 2024 19:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for dropping invalid rows for pyspark backend #1639

Add support for dropping invalid rows for pyspark backend #1639

zaheerabbas21 commented May 12, 2024 •

edited

Loading

cosmicBboy May 14, 2024

zaheerabbas21 May 18, 2024

cosmicBboy May 18, 2024 •

edited

Loading

cosmicBboy commented May 14, 2024

zaheerabbas21 commented Jun 2, 2024 •

edited

Loading

cosmicBboy commented Jun 28, 2024

zaheerabbas21 commented Jun 29, 2024 •

edited

Loading

cosmicBboy commented Jun 30, 2024

zaheerabbas21 commented Jul 9, 2024 •

edited

Loading

codecov bot commented Sep 1, 2024 •

edited

Loading

Add support for dropping invalid rows for pyspark backend #1639

Are you sure you want to change the base?

Add support for dropping invalid rows for pyspark backend #1639

Conversation

zaheerabbas21 commented May 12, 2024 • edited Loading

cosmicBboy May 14, 2024

Choose a reason for hiding this comment

zaheerabbas21 May 18, 2024

Choose a reason for hiding this comment

cosmicBboy May 18, 2024 • edited Loading

Choose a reason for hiding this comment

cosmicBboy commented May 14, 2024

zaheerabbas21 commented Jun 2, 2024 • edited Loading

cosmicBboy commented Jun 28, 2024

zaheerabbas21 commented Jun 29, 2024 • edited Loading

cosmicBboy commented Jun 30, 2024

zaheerabbas21 commented Jul 9, 2024 • edited Loading

codecov bot commented Sep 1, 2024 • edited Loading

Codecov Report

zaheerabbas21 commented May 12, 2024 •

edited

Loading

cosmicBboy May 18, 2024 •

edited

Loading

zaheerabbas21 commented Jun 2, 2024 •

edited

Loading

zaheerabbas21 commented Jun 29, 2024 •

edited

Loading

zaheerabbas21 commented Jul 9, 2024 •

edited

Loading

codecov bot commented Sep 1, 2024 •

edited

Loading