Add StructType and DDL extraction from Pandera schemas #1570

filipeo2-mck · 2024-04-12T13:48:02Z

Using Pandera schemas for data quality checks are great but we need to get the PySpark DataFrame correctly loaded first, with the correct column types we are expecting to validate later.
Relying on automatic Spark's inferSchema = True when loading data files (CSV and parquet, for example) is not reliable, so this PR tries to address this by allowing the extraction of a PySpark schema from existing Pandera schemas/models, in two ways:

A StructType object
A more compact/simple DDL-like schema:
```
binary BINARY,byte TINYINT,text STRING
```

Both extractions above can be used to create or read files in Spark, as in these examples:

Creating a dataframe:

spark.createDataFrame([], schema)  # be `schema` a StructType or a DDL-like string

Reading an existing file:

customSchema = StructType([
    StructField("IDGC", StringType(), True),        
    StructField("SEARCHNAME", StringType(), True),
    StructField("PRICE", DoubleType(), True)
])
df = spark.read.load('/file.csv', format="csv", schema=customSchema)

Specific tests for these were added, representing most common scenarios/datatypes used. The output of the unit test test_pyspark_read shows the default behavior between reading a sample CSV file with schema inference (non-deterministic) and using the approach enabled by this PR (deterministic):

This PR tries to address both open issues: #1327 and #1434.

Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>

codecov · 2024-04-12T13:57:58Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 83.15%. Comparing base (4df61da) to head (f011ea7).
Report is 73 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #1570       +/-   ##
===========================================
- Coverage   94.29%   83.15%   -11.14%     
===========================================
  Files          91      114       +23     
  Lines        7024     8505     +1481     
===========================================
+ Hits         6623     7072      +449     
- Misses        401     1433     +1032

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>

filipeo2-mck · 2024-04-12T14:30:47Z

~~Hey @cosmicBboy , not sure why the CI broke. I don't have permissions to restart it from the failed one:~~

Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>

cosmicBboy · 2024-04-12T19:41:25Z

@filipeo2-mck yeah I can manually restart these. Need to figure out why the hashes don't match... I see this fairly often with other PRs.

cosmicBboy · 2024-04-12T19:41:41Z

@NeerajMalhotra-QB @jaskaransinghsidana would appreciate your review on this PR!

jaskaransinghsidana · 2024-04-15T06:09:46Z

pandera/api/pyspark/container.py

+        :returns: StructType object with current schema fields.
+        """
+        fields = [
+            StructField(column, self.columns[column]._dtype.type, True)


Would this be able to handle a complex/nested struct method ?

Yes, it does and I have just added a test for it.
When we get the annotation from the schema/model, the entire set of annotated pyspark.sql.types is copied to the StructType object output.

Very minor comment 🙂

@filipeo2-mck I think it's better to access the public attribute dtype, since _dtype is a protected attribute, and meant to be used inside the ColumnSchema class only

Suggested change

StructField(column, self.columns[column]._dtype.type, True)

StructField(column, self.columns[column].dtype.type, True)

Tks for noting that, I just changed it :)

Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>

jaskaransinghsidana · 2024-04-15T11:05:26Z

@NeerajMalhotra-QB @jaskaransinghsidana would appreciate your review on this PR!

LGTM!

NeerajMalhotra-QB · 2024-04-19T19:56:46Z

@NeerajMalhotra-QB @jaskaransinghsidana would appreciate your review on this PR!

sorry @cosmicBboy I missed github notification about this message. I will try to review it soon. Thanks

NeerajMalhotra-QB · 2024-04-19T21:05:53Z

this looks great, @filipeo2-mck
As discussed, please add a negative and positive tests with dummy data to explain the situation you are fixing.

…rk_schema_generation Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>

Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>

filipeo2-mck · 2024-04-20T10:54:00Z

Hi @NeerajMalhotra-QB !
The suggested test cases were added, showing the Pandera usage I'm trying to enable with this PR, along negative test cases. A screenshot was also added to this PR description.
Thanks for your suggestions 👍

Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>

…rk_schema_generation Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>

NeerajMalhotra-QB

LGTM

filipeo2-mck · 2024-04-22T17:20:57Z

Hello @cosmicBboy! Approvals were granted, happy if you can evaluate and/or merge it :)
Thank you!

cosmicBboy · 2024-04-24T05:28:36Z

hey @filipeo2-mck would you mind rebasing on main? It should address the failing unit test

…rk_schema_generation Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>

…mck/pandera into feat/pyspark_schema_generation Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>

filipeo2-mck · 2024-04-24T11:13:27Z

hey @filipeo2-mck would you mind rebasing on main? It should address the failing unit test

Done, @cosmicBboy , I hope that everything is OK now :)
Thank you!

cosmicBboy · 2024-04-24T16:15:06Z

Looks like test is failing:
https://github.com/unionai-oss/pandera/actions/runs/8815578745/job/24197845359?pr=1570#step:15:1540

You can test this locally by running the nox test:

nox -db mamba --envdir .nox-mamba -s "tests(extra='pyspark', pydantic='2.3.0', python='3.10', pandas='2.2.0')"

(You need nox and mamba installed)

…rk_schema_generation Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>

filipeo2-mck · 2024-04-26T11:04:48Z

Looks like test is failing: https://github.com/unionai-oss/pandera/actions/runs/8815578745/job/24197845359?pr=1570#step:15:1540

You can test this locally by running the nox test:
nox -db mamba --envdir .nox-mamba -s "tests(extra='pyspark', pydantic='2.3.0', python='3.10', pandas='2.2.0')"
(You need nox and mamba installed)

Hi @cosmicBboy ! Sorry for the delay.
I took a look at the CI run:

it looks like it's happening only with the Windows runners (linux and macos ran fine with this config):
This Windows task hanged for 1 hour+ and ended with a HADOOP_HOME unset error, probably an issue with Spark installation:
I don't have a Windows machine to test it locally and, as I'm using pytest's tmp_dir functionality to save the temporary file, I don't see what could be wrong with PR code.

Would you mind to rerun the CI from start one time, only to make sure it's not a transient issue with GH Windows runners, please?

cosmicBboy · 2024-04-26T20:09:40Z

tests/pyspark/test_pyspark_container.py

+    return schema
+
+
+def test_pyspark_read(schema_with_simple_datatypes, tmp_path, spark):


@filipeo2-mck can we mark this as skipped if the os is windows? Looks like we do this elsewhere in the tests

@pytest.mark.skipif( platform.system() == "Windows", reason="skipping due to issues with opening file names for temp files.", )

Done! Thank you for checking it. The Windows jobs are running now.

Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>

cosmicBboy · 2024-04-27T01:46:49Z

Thanks for the contribution @filipeo2-mck !

filipeo2-mck added 2 commits April 12, 2024 08:23

organize tests and add multiple inheritance test for model

07b560c

Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>

organize tests and add multiple inheritance test for model

d7e86e7

Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>

This was referenced Apr 12, 2024

Support conversion from Pandera DataFrameSchema or DataFrameModel to PySpark StructType #1327

Open

Support conversion from DataFrameModel to PySpark StructType #1434

Closed

organize tests and add multiple inheritance test for model

7bec12a

Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>

filipeo2-mck marked this pull request as ready for review April 12, 2024 18:32

fix test format

4c522b7

Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>

cosmicBboy requested a review from NeerajMalhotra-QB April 12, 2024 19:41

jaskaransinghsidana reviewed Apr 15, 2024

View reviewed changes

add nested structure test

2b56716

Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>

filipeo2-mck requested a review from jaskaransinghsidana April 15, 2024 09:59

jaskaransinghsidana approved these changes Apr 15, 2024

View reviewed changes

filipeo2-mck added 3 commits April 20, 2024 06:12

Merge branch 'main' of github.com:unionai-oss/pandera into feat/pyspa…

8eeaed1

…rk_schema_generation Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>

add read test case using CSV wrong schema inference

43f82eb

Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>

add read test case using CSV wrong schema inference

1abf330

Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>

filipeo2-mck requested a review from AbhishekBhatiaQB April 22, 2024 12:32

filipeo2-mck added 2 commits April 22, 2024 09:33

accept abhishek s suggestion

2257f5d

Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>

Merge branch 'main' of github.com:unionai-oss/pandera into feat/pyspa…

2795497

…rk_schema_generation Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>

NeerajMalhotra-QB approved these changes Apr 22, 2024

View reviewed changes

Merge branch 'unionai-oss:main' into feat/pyspark_schema_generation

b5037c7

filipeo2-mck added 2 commits April 24, 2024 08:06

Merge branch 'main' of github.com:unionai-oss/pandera into feat/pyspa…

583371a

…rk_schema_generation Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>

Merge branch 'feat/pyspark_schema_generation' of github.com:filipeo2-…

335ceff

…mck/pandera into feat/pyspark_schema_generation Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>

Merge branch 'main' of github.com:unionai-oss/pandera into feat/pyspa…

608f11a

…rk_schema_generation Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>

cosmicBboy reviewed Apr 26, 2024

View reviewed changes

filipeo2-mck added 2 commits April 26, 2024 17:19

skip read test in Windows plataform

ebe2c35

Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>

skip read test in Windows plataform

f011ea7

Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>

filipeo2-mck requested a review from cosmicBboy April 26, 2024 21:57

cosmicBboy approved these changes Apr 27, 2024

View reviewed changes

cosmicBboy merged commit cf09ae2 into unionai-oss:main Apr 27, 2024
67 of 68 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add StructType and DDL extraction from Pandera schemas #1570

Add StructType and DDL extraction from Pandera schemas #1570

filipeo2-mck commented Apr 12, 2024 •

edited

Loading

codecov bot commented Apr 12, 2024 •

edited

Loading

filipeo2-mck commented Apr 12, 2024 •

edited

Loading

cosmicBboy commented Apr 12, 2024

cosmicBboy commented Apr 12, 2024

jaskaransinghsidana Apr 15, 2024

filipeo2-mck Apr 15, 2024 •

edited

Loading

AbhishekBhatiaQB Apr 22, 2024 •

edited

Loading

filipeo2-mck Apr 22, 2024

jaskaransinghsidana commented Apr 15, 2024

NeerajMalhotra-QB commented Apr 19, 2024

NeerajMalhotra-QB commented Apr 19, 2024

filipeo2-mck commented Apr 20, 2024

NeerajMalhotra-QB left a comment

filipeo2-mck commented Apr 22, 2024

cosmicBboy commented Apr 24, 2024

filipeo2-mck commented Apr 24, 2024

cosmicBboy commented Apr 24, 2024

filipeo2-mck commented Apr 26, 2024

cosmicBboy Apr 26, 2024

filipeo2-mck Apr 26, 2024

cosmicBboy commented Apr 27, 2024

	StructField(column, self.columns[column]._dtype.type, True)
	StructField(column, self.columns[column].dtype.type, True)

		return schema


		def test_pyspark_read(schema_with_simple_datatypes, tmp_path, spark):

Add StructType and DDL extraction from Pandera schemas #1570

Add StructType and DDL extraction from Pandera schemas #1570

Conversation

filipeo2-mck commented Apr 12, 2024 • edited Loading

codecov bot commented Apr 12, 2024 • edited Loading

Codecov Report

filipeo2-mck commented Apr 12, 2024 • edited Loading

cosmicBboy commented Apr 12, 2024

cosmicBboy commented Apr 12, 2024

jaskaransinghsidana Apr 15, 2024

Choose a reason for hiding this comment

filipeo2-mck Apr 15, 2024 • edited Loading

Choose a reason for hiding this comment

AbhishekBhatiaQB Apr 22, 2024 • edited Loading

Choose a reason for hiding this comment

filipeo2-mck Apr 22, 2024

Choose a reason for hiding this comment

jaskaransinghsidana commented Apr 15, 2024

NeerajMalhotra-QB commented Apr 19, 2024

NeerajMalhotra-QB commented Apr 19, 2024

filipeo2-mck commented Apr 20, 2024

NeerajMalhotra-QB left a comment

Choose a reason for hiding this comment

filipeo2-mck commented Apr 22, 2024

cosmicBboy commented Apr 24, 2024

filipeo2-mck commented Apr 24, 2024

cosmicBboy commented Apr 24, 2024

filipeo2-mck commented Apr 26, 2024

cosmicBboy Apr 26, 2024

Choose a reason for hiding this comment

filipeo2-mck Apr 26, 2024

Choose a reason for hiding this comment

cosmicBboy commented Apr 27, 2024

filipeo2-mck commented Apr 12, 2024 •

edited

Loading

codecov bot commented Apr 12, 2024 •

edited

Loading

filipeo2-mck commented Apr 12, 2024 •

edited

Loading

filipeo2-mck Apr 15, 2024 •

edited

Loading

AbhishekBhatiaQB Apr 22, 2024 •

edited

Loading