Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for subclassing Series in DataFrameModels #1092

Open
weery opened this issue Feb 16, 2023 · 1 comment
Open

Add support for subclassing Series in DataFrameModels #1092

weery opened this issue Feb 16, 2023 · 1 comment
Labels
enhancement New feature or request

Comments

@weery
Copy link

weery commented Feb 16, 2023

Is your feature request related to a problem? Please describe.
I like the idea that I can define my required data quality and feature behaviour directly in Pandera schema classes. I.e. I want to define all the logic for how my features should behave in a schema class. E.g. I want to be able to define a schema class that is able to separate its attributes by how they were defined. An example:

class TestSchema(pa.DataFrameModel):
    # Feature group 1
    feature_1_g1: Series[int] = pa.Field()
    feature_2_g1: Series[int] = pa.Field()

    # Feature group 2
    feature_1_g2: Series[int] = pa.Field()
    feature_2_g2: Series[int] = pa.Field()

where I would to be able to separate these features without hardcoding their names. It is not always that the grouping can be inferred directly from e.g. the annotation.

Describe the solution you'd like
One way to solve this would be to be able to define a schema class by e.g.

class FeatureGroup1Series(Series, Generic[GenericDtype]):
    pass

class FeatureGroup2Series(Series, Generic[GenericDtype]):
    pass

class TestSchema(pa.DataFrameModel):
    # Feature group 1
    feature_1_g1: FeatureGroup1Series[int] = pa.Field()
    feature_2_g1: FeatureGroup1Series[int] = pa.Field()

    # Feature group 2
    feature_1_g2: FeatureGroup2Series[int] = pa.Field()
    feature_2_g2: FeatureGroup2Series[int] = pa.Field()

Then it should be simple to separate the different attributes by looking at the annotations.

Though the current problem with this solution is that TestSchema.to_schema() would not work anymore as it requires that all columns should be annotated with Series[T] and not a subclass to Series.

Describe alternatives you've considered
I've considered adding more information to e.g. a field object, but have not come up with a good solution.

Additional context
I'll put up a PR with a potential solution.

@nathanjmcdougall
Copy link
Contributor

The way I like to do something like this is to use inheritance, i.e.

Group1Model(DataFrameModel):
    feature_1_g1: Series[int] = pa.Field()
    feature_2_g1: Series[int] = pa.Field()

Group2Model(DataFrameModel):
    feature_1_g2: Series[int] = pa.Field()
    feature_2_g2: Series[int] = pa.Field()

class TestSchema(Group1Model, Group2Model):
    pass

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants