DOC: add pandas 3.0 migration guide for the string dtype #61705

jorisvandenbossche · 2025-06-25T14:32:04Z

This PR starts adding a migration guide with some typical issues one might run into regarding the new string dtype when upgrading to pandas 3.0 (or when enabling it in pandas 2.3).

(for now I just added it to the user guide, which is already a long list of pages, so we might need to think about better organizing this or putting it elsewhere)

Closes #59328

simonjayhawkins

@jorisvandenbossche i'll post these few now rather than doing too many in a batch, but feel free to wait until i'm done, whatever is more convenient for you.

doc/source/user_guide/migration-3-strings.rst

simonjayhawkins · 2025-06-25T14:54:13Z

doc/source/user_guide/migration-3-strings.rst

+not yet been made the default, and uses the ``pd.NA`` scalar to represent
+missing values.
+
+Pandas 3.0 changes the default dtype for strings to a new string data type,


Suggested change

Pandas 3.0 changes the default dtype for strings to a new string data type,

Pandas 3.0 changes the default inferred dtype for strings to a new string data type,

doc/source/user_guide/migration-3-strings.rst

simonjayhawkins · 2025-06-25T15:09:18Z

doc/source/user_guide/migration-3-strings.rst

+.. - Breaking changes:
+..    - dtype is no longer object dtype
+..    - None gets coerced to NaN
+..    - setitem raises an error for non-string data


the above is not rendered?

No, this are comments, it was my outline when writing it (can remove this in the end)

no problem.

doc/source/user_guide/migration-3-strings.rst

simonjayhawkins

cool. Thanks @jorisvandenbossche

doc/source/user_guide/migration-3-strings.rst

simonjayhawkins · 2025-06-25T16:42:41Z

doc/source/user_guide/migration-3-strings.rst

+   True
+
+One caveat: this function works both on scalars and on array-likes, and in the
+latter case it will return an array of boolean dtype. When using it in a boolean


Suggested change

latter case it will return an array of boolean dtype. When using it in a boolean

latter case it will return an array of Boolean dtype. When using it in a Boolean

not to confuse with pandas nullable type should capitalize as named after George Boole?

numpy uses "boolean" as well, so would rather leave it like this, or can make it an "array of bools"

simonjayhawkins · 2025-06-25T16:48:05Z

doc/source/user_guide/migration-3-strings.rst

+.. code-block:: python
+
+   >>> ser = pd.Series(["a", "b", None], dtype="str")
+   >>> ser[1] = 2.5


i notice you can do ser[1] = pd.NA so we are accepting this as a missing value. Should we disallow this or perhaps encourage it instead to perhaps make migration to the pd.NA variant simpler?

Yeah, I am not a super big fan of already allowing to assign pd.NA for dtypes that don't use pd.NA, although I am also fine with keeping it as is.
But so this also works this way for other dtypes (such as numpy float64 or datetime64, coercing NA to NaN or NaT respectively, similarly as we also coerce None for those dtypes), so changing that is a bigger discussions not just about the string dtype.

Yeah, I am not a super big fan of already allowing to assign pd.NA for dtypes that don't use pd.NA, although I am also fine with keeping it as is.

sure. It doesn't create any issues really like with object dtype.

doc/source/user_guide/migration-3-strings.rst

jorisvandenbossche · 2025-06-27T11:06:24Z

@simonjayhawkins thanks a lot for the proofreading!

Co-authored-by: Simon Hawkins <simonjayhawkins@gmail.com>

jorisvandenbossche · 2025-06-27T11:23:08Z

/preview

jorisvandenbossche · 2025-06-27T12:33:38Z

Added three more sections based on the items listed in #59328

jbrockmendel · 2025-06-30T15:08:42Z

doc/source/user_guide/migration-3-strings.rst

+This new string dtype should otherwise work the same as how you have been
+using pandas with string data today. For example, all string-specific methods


Suggested change

This new string dtype should otherwise work the same as how you have been

using pandas with string data today. For example, all string-specific methods

This new string dtype should otherwise behave the same as the existing ``object`` dtype users are used to. For example, all string-specific methods

jbrockmendel · 2025-06-30T15:13:25Z

doc/source/user_guide/migration-3-strings.rst

+   ...
+   TypeError: Cannot perform reduction 'prod' with string dtype
+
+For existing users of the nullable ``StringDtype``


if you really want to keep writing i have no objection, but by construction these are advanced users who i dont think need as much hand-holding

I mostly want to briefly mention in the docs (as I don't think we really do that anywhere, except for in the PDEP) that we made this backcompat as if you were using "string", that should keep working, except that we also switched the default from "python" to "pyarrow" storage.

(and maybe mention that if you were using it for getting the faster pyarrow one, but don't care about the missing value sentinel, you could also just use the default dtype now. But that might be a bit subjective/controversial to say, and indeed at that point they probably understand that themselves as well)

DOC: add pandas 3.0 migration guide for the string dtype

975dea1

jorisvandenbossche added this to the 2.3.1 milestone Jun 25, 2025

jorisvandenbossche added Docs Strings labels Jun 25, 2025

fixup title underline

db42937

simonjayhawkins reviewed Jun 25, 2025

View reviewed changes

simonjayhawkins mentioned this pull request Jun 26, 2025

WEB: add note to PDEP-10 about delayed timeline for requiring pyarrow #61706

Open

jorisvandenbossche and others added 2 commits June 27, 2025 13:18

Apply suggestions from code review

8c0b883

Co-authored-by: Simon Hawkins <simonjayhawkins@gmail.com>

further edits to address feedback

1bc84ca

jorisvandenbossche mentioned this pull request Jun 27, 2025

TRACKER: new default String dtype (pyarrow-backed, numpy NaN semantics) #54792

Open

44 tasks

jorisvandenbossche added 2 commits June 27, 2025 13:26

fix underling length

e4a764d

add sections about invalid unicode, astype(str) and prod()

9760fee

jbrockmendel reviewed Jun 30, 2025

View reviewed changes

jorisvandenbossche mentioned this pull request Jul 1, 2025

RLS: 2.3.1 #61590

Open

5 tasks

	Pandas 3.0 changes the default dtype for strings to a new string data type,
	Pandas 3.0 changes the default inferred dtype for strings to a new string data type,

	latter case it will return an array of boolean dtype. When using it in a boolean
	latter case it will return an array of Boolean dtype. When using it in a Boolean

		This new string dtype should otherwise work the same as how you have been
		using pandas with string data today. For example, all string-specific methods

	This new string dtype should otherwise work the same as how you have been
	using pandas with string data today. For example, all string-specific methods
	This new string dtype should otherwise behave the same as the existing ``object`` dtype users are used to. For example, all string-specific methods

Uh oh!

DOC: add pandas 3.0 migration guide for the string dtype #61705

Are you sure you want to change the base?

DOC: add pandas 3.0 migration guide for the string dtype #61705

Conversation

jorisvandenbossche commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

simonjayhawkins left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

simonjayhawkins left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jorisvandenbossche commented Jun 27, 2025

Uh oh!

jorisvandenbossche commented Jun 27, 2025

Uh oh!

jorisvandenbossche commented Jun 27, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jorisvandenbossche commented Jun 25, 2025 •

edited

Loading

jorisvandenbossche Jun 30, 2025 •

edited

Loading