Skip to content

DOC: add pandas 3.0 migration guide for the string dtype #61705

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Jun 25, 2025

This PR starts adding a migration guide with some typical issues one might run into regarding the new string dtype when upgrading to pandas 3.0 (or when enabling it in pandas 2.3).

(for now I just added it to the user guide, which is already a long list of pages, so we might need to think about better organizing this or putting it elsewhere)

Closes #59328

@jorisvandenbossche jorisvandenbossche added this to the 2.3.1 milestone Jun 25, 2025
@jorisvandenbossche jorisvandenbossche added Docs Strings String extension data type and string data labels Jun 25, 2025
Copy link
Member

@simonjayhawkins simonjayhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche i'll post these few now rather than doing too many in a batch, but feel free to wait until i'm done, whatever is more convenient for you.

not yet been made the default, and uses the ``pd.NA`` scalar to represent
missing values.

Pandas 3.0 changes the default dtype for strings to a new string data type,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Pandas 3.0 changes the default dtype for strings to a new string data type,
Pandas 3.0 changes the default inferred dtype for strings to a new string data type,

.. - Breaking changes:
.. - dtype is no longer object dtype
.. - None gets coerced to NaN
.. - setitem raises an error for non-string data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the above is not rendered?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this are comments, it was my outline when writing it (can remove this in the end)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no problem.

Copy link
Member

@simonjayhawkins simonjayhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool. Thanks @jorisvandenbossche

True

One caveat: this function works both on scalars and on array-likes, and in the
latter case it will return an array of boolean dtype. When using it in a boolean
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
latter case it will return an array of boolean dtype. When using it in a boolean
latter case it will return an array of Boolean dtype. When using it in a Boolean

not to confuse with pandas nullable type should capitalize as named after George Boole?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numpy uses "boolean" as well, so would rather leave it like this, or can make it an "array of bools"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

.. code-block:: python

>>> ser = pd.Series(["a", "b", None], dtype="str")
>>> ser[1] = 2.5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i notice you can do ser[1] = pd.NA so we are accepting this as a missing value. Should we disallow this or perhaps encourage it instead to perhaps make migration to the pd.NA variant simpler?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I am not a super big fan of already allowing to assign pd.NA for dtypes that don't use pd.NA, although I am also fine with keeping it as is.
But so this also works this way for other dtypes (such as numpy float64 or datetime64, coercing NA to NaN or NaT respectively, similarly as we also coerce None for those dtypes), so changing that is a bigger discussions not just about the string dtype.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I am not a super big fan of already allowing to assign pd.NA for dtypes that don't use pd.NA, although I am also fine with keeping it as is.

sure. It doesn't create any issues really like with object dtype.

@jorisvandenbossche
Copy link
Member Author

@simonjayhawkins thanks a lot for the proofreading!

jorisvandenbossche and others added 2 commits June 27, 2025 13:18
Co-authored-by: Simon Hawkins <simonjayhawkins@gmail.com>
@jorisvandenbossche
Copy link
Member Author

/preview

@jorisvandenbossche
Copy link
Member Author

Added three more sections based on the items listed in #59328

Comment on lines +99 to +100
This new string dtype should otherwise work the same as how you have been
using pandas with string data today. For example, all string-specific methods
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This new string dtype should otherwise work the same as how you have been
using pandas with string data today. For example, all string-specific methods
This new string dtype should otherwise behave the same as the existing ``object`` dtype users are used to. For example, all string-specific methods

...
TypeError: Cannot perform reduction 'prod' with string dtype

For existing users of the nullable ``StringDtype``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you really want to keep writing i have no objection, but by construction these are advanced users who i dont think need as much hand-holding

Copy link
Member Author

@jorisvandenbossche jorisvandenbossche Jun 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mostly want to briefly mention in the docs (as I don't think we really do that anywhere, except for in the PDEP) that we made this backcompat as if you were using "string", that should keep working, except that we also switched the default from "python" to "pyarrow" storage.

(and maybe mention that if you were using it for getting the faster pyarrow one, but don't care about the missing value sentinel, you could also just use the default dtype now. But that might be a bit subjective/controversial to say, and indeed at that point they probably understand that themselves as well)

@jorisvandenbossche jorisvandenbossche mentioned this pull request Jul 1, 2025
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

String dtype: overview of breaking behaviour changes
3 participants