Raise SchemaInitError when for unique field is specified in pyspark field and a test showing the issue as well as config when it works. #1592

zippeurfou · 2024-04-19T22:03:45Z

Following #1344 I am adding a bit of edited documentation.
I wasn't able to raise SchemaInitError as @cosmicBboy suggested as it turns out that if you use pa.Field(unique=True) it is seen as None which most likely is the issue.
In the follow up screenshot you can see the behavior when I did add the code where @cosmicBboy suggested.

As the ghost text show it is None when I put a breakpoint there so I did not add it.

cosmicBboy · 2024-04-21T19:55:55Z

if you use pa.Field(unique=True) it is seen as None which most likely is the issue.

How is this the case? Specifying pa.Field(unique=True) should translate to unique is True when the DataFrameModel is translated into a DataFrameSchema

(Also see here to signing your commits for the DCO check)

cosmicBboy · 2024-04-21T20:21:04Z

Okay so there are two issues here:

specifying unique at the dataframe-level

import pandera.pyspark as pa

class Model(pa.DataFrameModel):
    class Config:
        unique = ["col1", "col2"]  # col1 and col2 should be jointly unique

specifying pa.Field(unique=True) at the column level.

class Model(pa.DataFrameModel):
    col: int = pa.Field(unique=True)  # values in col need to be unique

For 1 the SchemaInitError here still makes sense: https://github.com/unionai-oss/pandera/blob/main/pandera/api/pyspark/container.py#L148

For 2, a SchemaInitError here needs to be added: https://github.com/unionai-oss/pandera/blob/main/pandera/api/pyspark/model_components.py#L184. This is because the underlying Column definition doesn't even support the unique argument: https://github.com/unionai-oss/pandera/blob/main/pandera/api/pyspark/components.py#L15-L27

zippeurfou · 2024-04-22T19:54:47Z

Thanks @cosmicBboy,
Let me rephrase things a bit:
For 1 with config it does work as you expressed. I added a unit test with only one column but I can add a second one with 2 columns.
For 2, I can add it where you mentioned. I hadn't had the time to look too much at how internal works so I appreciate the direction.
My guess is given 1 was implemented and works as expected, 2 should not be impossible to implement but right now I don't have the bandwidth to do it sadly as I would need extra time to understand the internal of the library.
For the DCO I will try to do it as well when I have time. I appreciate the direction.

…showing the issue as well as config when it works. Signed-off-by: Marc Ferradou <mferradou@RWG7599T39.grubhub.local> Signed-off-by: zippeurfou <zippeurfou@gmail.com>

zippeurfou · 2024-04-23T16:31:25Z

@cosmicBboy updated the PR according to my understanding.

zippeurfou · 2024-04-25T15:06:44Z

I am not sure why the linter didn't execute here.

cosmicBboy · 2024-04-27T01:52:25Z

pandera/api/pyspark/model_components.py

@@ -177,6 +178,10 @@ def Field(
        else:
            check_ = check_constructor(arg_value, **check_kwargs)
        checks.append(check_)
+    if unique is True:


nit: this can just be if unique:

codecov · 2024-04-27T03:45:29Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 83.16%. Comparing base (4df61da) to head (4fffae2).
Report is 75 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #1592       +/-   ##
===========================================
- Coverage   94.29%   83.16%   -11.14%     
===========================================
  Files          91      114       +23     
  Lines        7024     8504     +1480     
===========================================
+ Hits         6623     7072      +449     
- Misses        401     1432     +1031

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

pandera/api/pyspark/model_components.py

cosmicBboy · 2024-04-28T02:16:34Z

Thanks @zippeurfou and congrats on your first contribution to pandera! 🚀

docs(pyspark): Adding warning for unique in pyspark field and a test …

349f65d

…showing the issue as well as config when it works. Signed-off-by: Marc Ferradou <mferradou@RWG7599T39.grubhub.local> Signed-off-by: zippeurfou <zippeurfou@gmail.com>

zippeurfou force-pushed the pyspark_unique branch from 6e25dfb to 349f65d Compare April 23, 2024 16:29

Merge branch 'main' into pyspark_unique

2f9f685

cosmicBboy reviewed Apr 27, 2024

View reviewed changes

cosmicBboy approved these changes Apr 27, 2024

View reviewed changes

cosmicBboy reviewed Apr 27, 2024

View reviewed changes

pandera/api/pyspark/model_components.py Outdated Show resolved Hide resolved

Update pandera/api/pyspark/model_components.py

4fffae2

cosmicBboy approved these changes Apr 28, 2024

View reviewed changes

cosmicBboy merged commit dbf1831 into unionai-oss:main Apr 28, 2024
67 of 68 checks passed

cosmicBboy changed the title ~~Adding warning for unique in pyspark field and a test showing the issue as well as config when it works.~~ Raise SchemaInitError when for unique field is specified in pyspark field and a test showing the issue as well as config when it works. May 7, 2024

cosmicBboy mentioned this pull request May 7, 2024

unique Field argument not yet implemented for pyspark #1624

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Raise SchemaInitError when for unique field is specified in pyspark field and a test showing the issue as well as config when it works. #1592

Raise SchemaInitError when for unique field is specified in pyspark field and a test showing the issue as well as config when it works. #1592

zippeurfou commented Apr 19, 2024

cosmicBboy commented Apr 21, 2024

cosmicBboy commented Apr 21, 2024

zippeurfou commented Apr 22, 2024 •

edited

Loading

zippeurfou commented Apr 23, 2024

zippeurfou commented Apr 25, 2024

cosmicBboy Apr 27, 2024 •

edited

Loading

codecov bot commented Apr 27, 2024 •

edited

Loading

cosmicBboy commented Apr 28, 2024

Raise SchemaInitError when for unique field is specified in pyspark field and a test showing the issue as well as config when it works. #1592

Raise SchemaInitError when for unique field is specified in pyspark field and a test showing the issue as well as config when it works. #1592

Conversation

zippeurfou commented Apr 19, 2024

cosmicBboy commented Apr 21, 2024

cosmicBboy commented Apr 21, 2024

zippeurfou commented Apr 22, 2024 • edited Loading

zippeurfou commented Apr 23, 2024

zippeurfou commented Apr 25, 2024

cosmicBboy Apr 27, 2024 • edited Loading

Choose a reason for hiding this comment

codecov bot commented Apr 27, 2024 • edited Loading

Codecov Report

cosmicBboy commented Apr 28, 2024

zippeurfou commented Apr 22, 2024 •

edited

Loading

cosmicBboy Apr 27, 2024 •

edited

Loading

codecov bot commented Apr 27, 2024 •

edited

Loading