Skip to content

BUG?: creating Categorical from pandas Index/Series with "object" dtype infers string #61778

Open
@jorisvandenbossche

Description

@jorisvandenbossche

When creating a pandas Series/Index/DataFrame, I think we generally differentiate between passing a pandas object with object dtype and a numpy array with object dtype:

>>> pd.options.future.infer_string = True
>>> pd.Index(pd.Series(["foo", "bar", "baz"], dtype="object"))
Index(['foo', 'bar', 'baz'], dtype='object')
>>> pd.Index(np.array(["foo", "bar", "baz"], dtype="object"))
Index(['foo', 'bar', 'baz'], dtype='str')

So for pandas objects, we preserve the dtype, for numpy arrays of object dtype, we essentially treat that as a sequence of python objects where we infer the dtype (@jbrockmendel that's also your understanding?)

But for categorical that doesn't seem to happen:

>>> pd.options.future.infer_string = True
>>> pd.Categorical(pd.Series(["foo", "bar", "baz"], dtype="object"))
['foo', 'bar', 'baz']
Categories (3, str): [bar, baz, foo]   # <--- categories inferred as str

So we want to preserver the dtype for the categories here as well?

Metadata

Metadata

Assignees

No one assigned

    Labels

    CategoricalCategorical Data TypeDtype ConversionsUnexpected or buggy dtype conversions

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions