Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parametrized type annotations are broken for polars DataFrameModels #1580

Closed
2 of 3 tasks
r-bar opened this issue Apr 16, 2024 · 1 comment · Fixed by #1596
Closed
2 of 3 tasks

Parametrized type annotations are broken for polars DataFrameModels #1580

r-bar opened this issue Apr 16, 2024 · 1 comment · Fixed by #1596
Labels
bug Something isn't working

Comments

@r-bar
Copy link

r-bar commented Apr 16, 2024

Description

Pandera DataFrameModels do not support parameterized types for polars, while DataFrameSchemas do.

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of pandera.
  • (optional) I have confirmed this bug exists on the master branch of pandera.

Example

Here is an example of a working DataFrameSchema and several variations of broken DataFrameModels.

from typing import Annotated

import pandera.polars as pa
import polars as pl

from pandera.typing import Series
from pandera.errors import SchemaInitError


df = pl.DataFrame({
    "id": [1, 2, 3],
    "lists": [["a"], ["a", "b"], ["a", "b", "c"]],
})


# works!
schema = pa.DataFrameSchema(
    columns={
        "id": pa.Column(int),
        "lists": pa.Column(list[str]),
    }
)
schema.validate(df)
print("DataFrameSchema validation passed")


class Lists(pa.DataFrameModel):
    """Most basic, expected form given the working schema above."""
    id: int
    lists: list[str]


try:
    Lists.validate(df)
except SchemaInitError as e:
    print("\nLists validation failed")
    print(e)
else:
    print("\nLists validation passed")


class ListsSeries(pa.DataFrameModel):
    """Using series as a wrapper around basic data types like the id column here
    will not work. Examples of this appear in the pandera documentation.
    https://pandera.readthedocs.io/en/latest/dataframe_models.html#dtype-aliases
    """
    id: Series[int]
    lists: Series[list[str]]


try:
    ListsSeries.validate(df)
except SchemaInitError as e:
    print("\nListsSeries validation failed")
    print(e)
else:
    print("\nListsSeries validation passed")


class AlternateListsSeries(pa.DataFrameModel):
    """Demonstrating using Series as a type wrapper around only lists to avoid
    the initialization error on id."""
    id: int
    lists: Series[list[str]]


try:
    AlternateListsSeries.validate(df)
except SchemaInitError as e:
    print("\nAlternateListsSeries validation failed")
    print(e)
else:
    print("\nAlternateListsSeries validation passed")


class ListsAnnotated(pa.DataFrameModel):
    """Parameterized form using Annotated as suggested at
    https://pandera.readthedocs.io/en/latest/polars.html#nested-types
    """
    id: int
    lists: Series[Annotated[list, str]]


try:
    ListsAnnotated.validate(df)
except TypeError as e:
    print("\nListsAnnotated validation failed")
    print(e)
else:
    print("\nListsAnnotated validation passed")


class ListsAnnotatedStr(pa.DataFrameModel):
    """Alternate parameterized form using Annotated as seen in the examples here:
    https://pandera.readthedocs.io/en/latest/dataframe_models.html#annotated
    """
    id: int
    lists: Series[Annotated[list, "str"]]

try:
    ListsAnnotatedStr.validate(df)
except TypeError as e:
    print("\nListsAnnotatedStr validation failed")
    print(e)
else:
    print("\nListsAnnotatedStr validation passed")

When run with the following python / library versions:

  • python==3.11.9
  • polars==0.20.19
  • pandera[polars]==0.19.0b1

the above script produces:

DataFrameSchema validation passed

Lists validation failed
Invalid annotation 'lists: list[str]'

ListsSeries validation failed
Invalid annotation 'id: pandera.typing.pandas.Series[int]'

AlternateListsSeries validation failed
Invalid annotation 'lists: pandera.typing.pandas.Series[list[str]]'

ListsAnnotated validation failed
Annotation 'Annotated' requires all positional arguments ['args', 'kwargs'].

ListsAnnotatedStr validation failed
Annotation 'Annotated' requires all positional arguments ['args', 'kwargs'].

Expected behavior

I would expect any column types that are valid to pass to DataFrameSchema's constructor to also be valid as annotations for DataFrameModel.

Desktop (please complete the following information):

  • OS: MacOS 14.4.1 (M2 Max)
  • Browser: Chrome
  • Version: 0.19.0b1
@r-bar r-bar added the bug Something isn't working label Apr 16, 2024
@cosmicBboy
Copy link
Collaborator

cosmicBboy commented Apr 22, 2024

Thanks for reporting this @r-bar FYI Series[Type] annotations is currently not supported in the polars API, see #1588 and ongoing discussion here: #1594.

Looking into this, planning on supporting:

class Lists(pa.DataFrameModel):
    """Most basic, expected form given the working schema above."""
    id: int
    lists: list[str]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants