Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Series[pd.Int64Dtype] cannot coerce columns with mixed string/int values #1037

Open
3 tasks done
dantheand opened this issue Nov 29, 2022 · 0 comments
Open
3 tasks done
Labels
bug Something isn't working

Comments

@dantheand
Copy link
Contributor

dantheand commented Nov 29, 2022

Describe the bug
It appears that Series[pd.Int64Dtype] cannot coerce columns with mixed string/int values. That "bug" plus the fact that Series[int] cannot coerce nullable values means there is currently no pandas column with int dtype that can coerce mixed dtypes + null values.

It makes sense that Series[int] doesn't work with null types because python's int is apparently a non-nullable dtype.

I documented this bug and a workaround using Series[pd.Float64Dtype] in the test below.

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of pandera.
  • (optional) I have confirmed this bug exists on the master branch of pandera.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import numpy as np
import pandas as pd
import pandera as pa

def test_coerce_type_bug():
    """This test documents a bug in pandera coercion methods and provides workarounds"""

    class SchemaNullableIntType(pa.SchemaModel):
        col: pa.typing.Series[pd.Int64Dtype] = pa.Field(nullable=True)

        class Config:
            coerce = True

    class SchemaIntType(pa.SchemaModel):
        col: pa.typing.Series[int] = pa.Field(nullable=True)

        class Config:
            coerce = True

    class SchemaNullableFloat(pa.SchemaModel):
        col: pa.typing.Series[pd.Float64Dtype] = pa.Field(nullable=True)

        class Config:
            coerce = True

    mixed_type_df = pd.DataFrame({"col": [1, "1"]})
    nans_df = pd.DataFrame({"col": [1, None]})
    mixed_w_nans_df = pd.DataFrame({"col": [1, "1", None]})

    # Test mixed types
    with pytest.raises(pa.errors.SchemaError):
        # Nullable int type fails with mixed dtypes.
        SchemaNullableIntType.validate(mixed_type_df)
    coerced_df = SchemaIntType.validate(mixed_type_df)
    assert mixed_type_df.dtypes["col"] == np.dtype(np.object)
    assert coerced_df.dtypes["col"] == np.dtype(np.int64), "Non-nullable int type can coerce mixed types."

    # Test null types
    with pytest.raises(pa.errors.SchemaError):
        # Non-nullable int type cannot coerce nan values
        SchemaIntType.validate(nans_df)
    coerced_df = SchemaNullableIntType.validate(nans_df)
    assert nans_df.dtypes["col"] == np.dtype(np.float64)
    assert coerced_df.dtypes["col"] == pd.Int64Dtype(), "Nullable int type can coerce nan values."

    # Test null and mixed
    with pytest.raises(pa.errors.SchemaError):
        # Non-nullable int type cannot coerce mixed dtypes + nan values.
        SchemaIntType.validate(mixed_w_nans_df)
    with pytest.raises(pa.errors.SchemaError):
        # Nullable int type cannot coerce mixed dtypes + nan values.
        SchemaNullableIntType.validate(mixed_w_nans_df)

    # Offer float type for mixed values
    coerced_df = SchemaNullableFloat.validate(mixed_w_nans_df)
    assert mixed_w_nans_df.dtypes["col"] == np.dtype(np.object)
    assert coerced_df.dtypes["col"] == pd.Float64Dtype(), "If you have integer-valued columns with null and mixed dtypes, you currently need to use the pd.Float64Dtype"

Expected behavior

I expected a SchemaModel with column ofSeries[pd.Int64Dtype] to be able to coerce both:

  • a column with [1, "1"]
  • a column with [1, "1", None]

Desktop (please complete the following information):

  • OS: macOS 12.4
  • Browser: Chrome
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant