-
-
Notifications
You must be signed in to change notification settings - Fork 296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add StructType and DDL extraction from Pandera schemas #1570
Add StructType and DDL extraction from Pandera schemas #1570
Conversation
Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>
Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1570 +/- ##
===========================================
- Coverage 94.29% 83.15% -11.14%
===========================================
Files 91 114 +23
Lines 7024 8505 +1481
===========================================
+ Hits 6623 7072 +449
- Misses 401 1433 +1032 ☔ View full report in Codecov by Sentry. |
Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>
|
Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>
@filipeo2-mck yeah I can manually restart these. Need to figure out why the hashes don't match... I see this fairly often with other PRs. |
@NeerajMalhotra-QB @jaskaransinghsidana would appreciate your review on this PR! |
pandera/api/pyspark/container.py
Outdated
:returns: StructType object with current schema fields. | ||
""" | ||
fields = [ | ||
StructField(column, self.columns[column]._dtype.type, True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this be able to handle a complex/nested struct method ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it does and I have just added a test for it.
When we get the annotation from the schema/model, the entire set of annotated pyspark.sql.types
is copied to the StructType
object output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very minor comment 🙂
@filipeo2-mck I think it's better to access the public attribute dtype
, since _dtype
is a protected attribute, and meant to be used inside the ColumnSchema
class only
StructField(column, self.columns[column]._dtype.type, True) | |
StructField(column, self.columns[column].dtype.type, True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tks for noting that, I just changed it :)
Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>
LGTM! |
sorry @cosmicBboy I missed github notification about this message. I will try to review it soon. Thanks |
this looks great, @filipeo2-mck |
…rk_schema_generation Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>
Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>
Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>
Hi @NeerajMalhotra-QB ! |
Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>
…rk_schema_generation Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Hello @cosmicBboy! Approvals were granted, happy if you can evaluate and/or merge it :) |
hey @filipeo2-mck would you mind rebasing on |
…rk_schema_generation Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>
…mck/pandera into feat/pyspark_schema_generation Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>
Done, @cosmicBboy , I hope that everything is OK now :) |
Looks like test is failing: You can test this locally by running the nox test:
(You need |
…rk_schema_generation Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>
Hi @cosmicBboy ! Sorry for the delay.
Would you mind to rerun the CI from start one time, only to make sure it's not a transient issue with GH Windows runners, please? |
return schema | ||
|
||
|
||
def test_pyspark_read(schema_with_simple_datatypes, tmp_path, spark): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@filipeo2-mck can we mark this as skipped if the os is windows? Looks like we do this elsewhere in the tests
@pytest.mark.skipif(
platform.system() == "Windows",
reason="skipping due to issues with opening file names for temp files.",
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! Thank you for checking it. The Windows jobs are running now.
Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>
Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>
Thanks for the contribution @filipeo2-mck ! |
Using Pandera schemas for data quality checks are great but we need to get the PySpark DataFrame correctly loaded first, with the correct column types we are expecting to validate later.
Relying on automatic Spark's
inferSchema = True
when loading data files (CSV and parquet, for example) is not reliable, so this PR tries to address this by allowing the extraction of a PySpark schema from existing Pandera schemas/models, in two ways:A
StructType
objectA more compact/simple DDL-like schema:
Both extractions above can be used to create or read files in Spark, as in these examples:
Creating a dataframe:
Reading an existing file:
Specific tests for these were added, representing most common scenarios/datatypes used. The output of the unit test
![image](https://private-user-images.githubusercontent.com/110418479/324172229-81e9735c-1370-43e7-9e33-c985a69ea3f5.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjA2NDYxNDYsIm5iZiI6MTcyMDY0NTg0NiwicGF0aCI6Ii8xMTA0MTg0NzkvMzI0MTcyMjI5LTgxZTk3MzVjLTEzNzAtNDNlNy05ZTMzLWM5ODVhNjllYTNmNS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNzEwJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDcxMFQyMTEwNDZaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT00YWYxOTY5NWZiZjJjYjdmZGE3YWIwOTJmZmE4ZTRhMjJjNjllNzZjOTIyNThhOGU3ZjBkMDkxNzk5MDEyNzhmJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.sI7Vm0zksSxBNaYRTqEG_7b3vzjK0sxhKrXrQ4qS6rU)
test_pyspark_read
shows the default behavior between reading a sample CSV file with schema inference (non-deterministic) and using the approach enabled by this PR (deterministic):This PR tries to address both open issues: #1327 and #1434.