Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raise SchemaInitError when for unique field is specified in pyspark field and a test showing the issue as well as config when it works. #1592

Merged
merged 3 commits into from
Apr 28, 2024

Conversation

zippeurfou
Copy link
Contributor

Following #1344 I am adding a bit of edited documentation.
I wasn't able to raise SchemaInitError as @cosmicBboy suggested as it turns out that if you use pa.Field(unique=True) it is seen as None which most likely is the issue.
In the follow up screenshot you can see the behavior when I did add the code where @cosmicBboy suggested.
Screenshot 2024-04-19 at 5 39 04 PM
As the ghost text show it is None when I put a breakpoint there so I did not add it.

@cosmicBboy
Copy link
Collaborator

if you use pa.Field(unique=True) it is seen as None which most likely is the issue.

How is this the case? Specifying pa.Field(unique=True) should translate to unique is True when the DataFrameModel is translated into a DataFrameSchema

(Also see here to signing your commits for the DCO check)

@cosmicBboy
Copy link
Collaborator

Okay so there are two issues here:

  1. specifying unique at the dataframe-level
import pandera.pyspark as pa

class Model(pa.DataFrameModel):
    class Config:
        unique = ["col1", "col2"]  # col1 and col2 should be jointly unique
  1. specifying pa.Field(unique=True) at the column level.
class Model(pa.DataFrameModel):
    col: int = pa.Field(unique=True)  # values in col need to be unique

For 1 the SchemaInitError here still makes sense: https://github.com/unionai-oss/pandera/blob/main/pandera/api/pyspark/container.py#L148

For 2, a SchemaInitError here needs to be added: https://github.com/unionai-oss/pandera/blob/main/pandera/api/pyspark/model_components.py#L184. This is because the underlying Column definition doesn't even support the unique argument: https://github.com/unionai-oss/pandera/blob/main/pandera/api/pyspark/components.py#L15-L27

@zippeurfou
Copy link
Contributor Author

zippeurfou commented Apr 22, 2024

Thanks @cosmicBboy,
Let me rephrase things a bit:
For 1 with config it does work as you expressed. I added a unit test with only one column but I can add a second one with 2 columns.
For 2, I can add it where you mentioned. I hadn't had the time to look too much at how internal works so I appreciate the direction.
My guess is given 1 was implemented and works as expected, 2 should not be impossible to implement but right now I don't have the bandwidth to do it sadly as I would need extra time to understand the internal of the library.
For the DCO I will try to do it as well when I have time. I appreciate the direction.

…showing the issue as well as config when it works.

Signed-off-by: Marc Ferradou <mferradou@RWG7599T39.grubhub.local>
Signed-off-by: zippeurfou <zippeurfou@gmail.com>
@zippeurfou
Copy link
Contributor Author

@cosmicBboy updated the PR according to my understanding.

@zippeurfou
Copy link
Contributor Author

I am not sure why the linter didn't execute here.

@@ -177,6 +178,10 @@ def Field(
else:
check_ = check_constructor(arg_value, **check_kwargs)
checks.append(check_)
if unique is True:
Copy link
Collaborator

@cosmicBboy cosmicBboy Apr 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this can just be if unique:

Copy link

codecov bot commented Apr 27, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 83.16%. Comparing base (4df61da) to head (4fffae2).
Report is 75 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff             @@
##             main    #1592       +/-   ##
===========================================
- Coverage   94.29%   83.16%   -11.14%     
===========================================
  Files          91      114       +23     
  Lines        7024     8504     +1480     
===========================================
+ Hits         6623     7072      +449     
- Misses        401     1432     +1031     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@cosmicBboy
Copy link
Collaborator

Thanks @zippeurfou and congrats on your first contribution to pandera! 🚀

@cosmicBboy cosmicBboy merged commit dbf1831 into unionai-oss:main Apr 28, 2024
67 of 68 checks passed
@cosmicBboy cosmicBboy changed the title Adding warning for unique in pyspark field and a test showing the issue as well as config when it works. Raise SchemaInitError when for unique field is specified in pyspark field and a test showing the issue as well as config when it works. May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants