Add support for subclassing Series in DataFrameModels #1092

weery · 2023-02-16T09:38:24Z

Is your feature request related to a problem? Please describe.
I like the idea that I can define my required data quality and feature behaviour directly in Pandera schema classes. I.e. I want to define all the logic for how my features should behave in a schema class. E.g. I want to be able to define a schema class that is able to separate its attributes by how they were defined. An example:

class TestSchema(pa.DataFrameModel):
    # Feature group 1
    feature_1_g1: Series[int] = pa.Field()
    feature_2_g1: Series[int] = pa.Field()

    # Feature group 2
    feature_1_g2: Series[int] = pa.Field()
    feature_2_g2: Series[int] = pa.Field()

where I would to be able to separate these features without hardcoding their names. It is not always that the grouping can be inferred directly from e.g. the annotation.

Describe the solution you'd like
One way to solve this would be to be able to define a schema class by e.g.

class FeatureGroup1Series(Series, Generic[GenericDtype]):
    pass

class FeatureGroup2Series(Series, Generic[GenericDtype]):
    pass

class TestSchema(pa.DataFrameModel):
    # Feature group 1
    feature_1_g1: FeatureGroup1Series[int] = pa.Field()
    feature_2_g1: FeatureGroup1Series[int] = pa.Field()

    # Feature group 2
    feature_1_g2: FeatureGroup2Series[int] = pa.Field()
    feature_2_g2: FeatureGroup2Series[int] = pa.Field()

Then it should be simple to separate the different attributes by looking at the annotations.

Though the current problem with this solution is that TestSchema.to_schema() would not work anymore as it requires that all columns should be annotated with Series[T] and not a subclass to Series.

Describe alternatives you've considered
I've considered adding more information to e.g. a field object, but have not come up with a good solution.

Additional context
I'll put up a PR with a potential solution.

The text was updated successfully, but these errors were encountered:

nathanjmcdougall · 2023-08-05T11:05:30Z

The way I like to do something like this is to use inheritance, i.e.

Group1Model(DataFrameModel):
    feature_1_g1: Series[int] = pa.Field()
    feature_2_g1: Series[int] = pa.Field()

Group2Model(DataFrameModel):
    feature_1_g2: Series[int] = pa.Field()
    feature_2_g2: Series[int] = pa.Field()

class TestSchema(Group1Model, Group2Model):
    pass

weery added the enhancement New feature or request label Feb 16, 2023

weery mentioned this issue Feb 16, 2023

Allow subclassed series when using to_schema #1093

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for subclassing Series in DataFrameModels #1092

Add support for subclassing Series in DataFrameModels #1092

weery commented Feb 16, 2023

nathanjmcdougall commented Aug 5, 2023

Add support for subclassing Series in DataFrameModels #1092

Add support for subclassing Series in DataFrameModels #1092

Comments

weery commented Feb 16, 2023

nathanjmcdougall commented Aug 5, 2023