-
-
Notifications
You must be signed in to change notification settings - Fork 305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhanced Validation Reporting for PySpark DataFrames in Pandera #1540
Comments
We are also interested in this enhancement |
We are waiting on this enhancement for PYSPark data frame to filter invalid records after schema validation |
I am also interested in this enhancement |
I am interested as well!!!! |
This seems like a useful feature! I'd support this effort. @NeerajMalhotra-QB @jaskaransinghsidana @filipeo2-mck any thoughts on this? It would incur significant compute cost, since right now the pyspark checks queries are limited to the first invalid value: https://github.com/unionai-oss/pandera/blob/main/pandera/backends/pyspark/builtin_checks.py A couple of thoughts:
|
|
|
Okay, so high level steps for this issue:
To avoid further complexity we shouldn't support randomly sampling data values, we either do full table validation or single value validation (the current pyspark behavior) |
Happy to help review and take a PR over the finish line if someone wants to take this task on @zaheerabbas-prodigal |
Thanks for laying out the high-level details needed for this feature @cosmicBboy, will be happy to take this up 😄. |
Hey, this feature would be really, really nice. I can easily sell my Databricks team on Pandera if this were to be implemented. Are you still working on this, or should I not get my hopes up? |
Is your feature request related to a problem? Please describe.
Hello, I am new to Pyspark and data engineering in general. I am looking to validate a Pyspark Dataframe given a schema. Came across pandera, which suits my needs the best.
Currently, as I understand, because of the distributed nature of spark. All pandera invalidation errors are compiled and are generated in error report of pandera. But these error reports do not seem to show which records are invalid.
I am looking for a way to run validation of pyspark dataframe through pandera and get indices of invalid rows so that I can post process or dump into a different corrupt records database if that makes sense or atleast drop invalid rows.
There is a feature of Drop Invalid Rows - but only supports pandas for now if my understanding is correct. When I add a
drop_invalid_rows
it throws meBackendNotFound
exception. I also came across this stackoverflow response but this also is only supported for pandas I believe.Describe the solution you'd like
Get the indices of invalid rows after running
Pandera.validate()
functionDescribe alternatives you've considered
Drop invalid rows in a pyspark dataframe that do not match schema defined. This is supported for pandas but not for pyspark dataframe as per my experimentation.
Additional context
NA
The text was updated successfully, but these errors were encountered: