Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Need API support and __repr__ to discover the storage used for strings #59342

Open
arnaudlegout opened this issue Jul 29, 2024 · 10 comments
Labels
Enhancement Needs Discussion Requires discussion from core team before further action Strings String extension data type and string data
Milestone

Comments

@arnaudlegout
Copy link
Contributor

arnaudlegout commented Jul 29, 2024

Originally raised in #58551 (comment)

Problem Description

With PDEP-14 there is the need for developers to be aware of the storage used for strings. Indeed, the storage might have a lot of impact of performance, for instance

  • pyarrow storage
    • pros: compact (optimal memory footprint), fast (vectorization)
    • cons: immutable (so any modification creates a new string pyarrow ChunkedArray)
  • python storage
    • pros: mutable
    • cons: highest memory footprint (each string is a different Python object), slow (no vectorization)
  • numpy 2.0 strings storage (I don't have a good knowledge of these new strings, and never tested them)
    • pros: compact, vectorization, mutable (my understanding is that is takes more space and is slower than pyarrow strings)
    • cons: different representations depending on a string size, which make understanding performance harder

Feature Description

I would like to have two way to discover the storage

  • __repr__ goal is to give information on the inner of an object, one option suggested by @jorisvandenbossche is to display <pandas.StringDtype(storage=...)> instead of string[storage]
  • .get_storage that returns the storage (not sure what is possible with the current implementation, would be best to have a class, otherwise, a string). The API is useful to check before running a time consuming code that we have the correct storage.

Alternative Solutions

.

Additional Context

No response

@arnaudlegout arnaudlegout added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 29, 2024
@rhshadrach rhshadrach added Needs Discussion Requires discussion from core team before further action Strings String extension data type and string data and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 3, 2024
@jorisvandenbossche jorisvandenbossche added this to the 2.3 milestone Aug 21, 2024
@jorisvandenbossche
Copy link
Member

@arnaudlegout thanks for opening the issue!

First quick note: at the moment numpy 2.0 string dtype is not supported in the pd.StringDtype at the moment (but could be in the future), so right now the two options to consider are "pyarrow" and "python" (i.e. object-dtype)

Then, the API to inspect and discover the storage is actually already available, as the .storage attribute on the StringDtype instance:

>>> pd.options.future.infer_string = True
>>> ser = pd.Series(["a", "b"], dtype="str")
>>> ser.dtype
str
>>> ser.dtype.storage
'pyarrow'

So I think the main discussion is how the __repr__ should look like.

@WillAyd
Copy link
Member

WillAyd commented Aug 22, 2024

I think it makes sense to have the storage and na_value as part of the repr. While @jorisvandenbossche is correct that you can inspect this with attributes, that also assumes developers know in advance what those attributes are. By putting it into the repr instead it becomes a little clearer to developers what they might need to consider

@arnaudlegout
Copy link
Contributor Author

@WillAyd right, I was not aware of the .storage attribute and indeed getting information on the na_value is interesting.

I did not find the .storage in the pandas documentation, so it would be great to also complement the documentation to show the available attributes to inspect the storage properties.

@pantheraleo-7
Copy link

pantheraleo-7 commented Sep 21, 2024

so right now the two options to consider are "pyarrow" and "python" (i.e. object-dtype)

I think the name "python" for the fallback storage option is not future proof? If I'm reading the PDEP-14 right, the fallback is a numpy array of python str objects. So the fallback storage option name should be "numpy".

  • when numpy 2.0 strings will be implemented as a fallback, the name "python" won't make sense anymore
  • it kinda don't make sense even right now because we are storing those objects in a numpy array anyway
  • also, the names "pyarrow" and "numpy" would complement each other better ig

numpy 2.0 string dtype is not supported in the pd.StringDtype at the moment (but could be in the future)

do we have a timeline on this? It seems like PDEP-10 will be reverted by PDEP-15, so pyarrow is going to stay an optional dependency. So to force users who just want the vectorisation speed benefits (and nothing more) to install pyarrow will practically lessen the importance of numpy 2.0 string implementation as they would've have already moved to pyarrow in pandas 3.0.

pandas 3.0 is a golden opportunity to incorporate numpy 2.0 string dtype, as users who will shift to a newer major version of pandas, would also most likely shift to a newer major version of numpy.

but if we still don't want to force numpy 2.0, we could have an intermediate fallback no?
use pyarrow if installed >>> use numpy 2.0 str dtype if numpy>=2.0 is installed >>> use numpy object dtype

basically I'm saying we should fast track numpy 2.0 string implementation xD

@WillAyd
Copy link
Member

WillAyd commented Sep 21, 2024

I think a numpy 2.0 string data type would needs it own PDEP. We already have a proliferation of string data types in pandas, so it needs some discussion to define what value we see from adding another, and to define what the semantics of it are.

@simonjayhawkins
Copy link
Member

I think it makes sense to have the storage and na_value as part of the repr. While @jorisvandenbossche is correct that you can inspect this with attributes, that also assumes developers know in advance what those attributes are. By putting it into the repr instead it becomes a little clearer to developers what they might need to consider

xref #60305

The outcome of that discussion could impact what should be done here. If we want the dtypes with only the storage being different to be considered equal and to hide the implementation detail to users then I guess we would not want to update the repr to display the storage?

if the dtypes for the array with numpy semantics and the array using pd.NA are not considered equal in the equality checks then it may be that the na_value should be included in the repr.

@arnaudlegout
Copy link
Contributor Author

I really disagree with the willingness to "hide implementation details". The underlying implementation should be considered as a detail. It has strong performance (both in space and speed) implications, and considering performance as a detail for a regular pandas user is, in my opinion, a mistake.

The impact of the dtypes implementation should even be more documented. with a numpy boolean is 8 bits, a nullable numpy boolean is 16 bits, and a nullable pyarrow boolean is 2 bits? why changing a string in a large string Series is fast with "string[python]", slow with "string[pyarrow]", why in some cases the pyarrow is producing a memory overflow? These implementation issues are not details and should not be hidden to the user (even the beginner).

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Mar 20, 2025

We discussed this once more at the community dev meeting last week, how we can make the repr more informative, so one can actually see the difference in dtype when inspecting data.
The proposal to move forward here is to distinguish __repr__ and __str__ more (right now, they are largely the same for StringDtype).

__str__ is generally used when showing the dtype when included in another representation, such as the dtype shown in the footer of a Series representation, or in the values when doing df.dtypes. __repr__ is then typically used when actually interactively inspecting the dtype (e.g. checking obj.dtype, but not printing it), or eg used in error message of assert methods.

That has some balance in keeping the repr short in those DataFrame/Series representation, but having the full information available when actually looking at the dtype object itself.

The concrete proposal is then to keep __str__ the same (this is currently "str" or "string" for NaN vs NA semantics). And to turn __repr__ into a more verbose repr like <StringDtype(storage='python', na_value=nan)>.
(this brings it somewhat in line with Categorical dtype, where the str is "category" but the repr is a full CategoricalDtype(..) with categories and other attributes)

Generally the repr would then show all the arguments of the constructor, except that we were thinking we could omit defaults to make it a bit less verbose for those cases. In this case that would mean only show storage='..' if it is "python", i.e. (typically, if you have pyarrow installed) the opt-in variant. But for na_value, although this has a default, we would still show it always, given this has a much bigger impact on behaviour semantics.

These implementation issues are not details and should not be hidden to the user (even the beginner).

This is indeed where we disagree .. I really think that beginners should almost never have to think about those details, and ideally not even get exposed to those details. If there are common cases where they actually need to know right now, I think that is rather a sign that we need to fix or improve something, than that this newcomer should be aware of it.
(for example, the memory overflow issues with pyarrow should simply not happen, that should be considered a bug we have to fix, and which I think is also more or less fixed at the moment because of always using the large string variant under the hood)

But anyway, I hope the proposal here is some compromise between not showing the dtype "details" constantly, while making it still very accessible (and shown by default when inspecting the dtype itself)

@jorisvandenbossche
Copy link
Member

I have a PR trying to implement the above proposal: #61148

One thing I realized while doing that PR is that we currently have a special case for StringDtype to use its repr instead of str when being displayed in other (Series/DataFrame) representations. This ensures that we currently show string[pyarrow] or string[python] (and not just string) in those places (e.g. in df.dtypes). But when removing that special case (since we don't want to show the verbose repr in those places), that will become just string. So one option is to accept that, another option would be to actually change __str__ to include that [storage] parametrization for the NA variant.

@arnaudlegout
Copy link
Contributor Author

We discussed this once more at the community dev meeting last week, how we can make the repr more informative, so one can actually see the difference in dtype when inspecting data. The proposal to move forward here is to distinguish __repr__ and __str__ more (right now, they are largely the same for StringDtype).

__str__ is generally used when showing the dtype when included in another representation, such as the dtype shown in the footer of a Series representation, or in the values when doing df.dtypes. __repr__ is then typically used when actually interactively inspecting the dtype (e.g. checking obj.dtype, but not printing it), or eg used in error message of assert methods.

That has some balance in keeping the repr short in those DataFrame/Series representation, but having the full information available when actually looking at the dtype object itself.

The concrete proposal is then to keep __str__ the same (this is currently "str" or "string" for NaN vs NA semantics). And to turn __repr__ into a more verbose repr like <StringDtype(storage='python', na_value=nan)>. (this brings it somewhat in line with Categorical dtype, where the str is "category" but the repr is a full CategoricalDtype(..) with categories and other attributes)

sounds good

Generally the repr would then show all the arguments of the constructor, except that we were thinking we could omit defaults to make it a bit less verbose for those cases. In this case that would mean only show storage='..' if it is "python", i.e. (typically, if you have pyarrow installed) the opt-in variant. But for na_value, although this has a default, we would still show it always, given this has a much bigger impact on behaviour semantics.

I disagree with this one. Defaults change quite often from pandas version to pandas versions and in the case of string, they might even change based on installed dependencies.
Hiding default parameters in __repr__ will make it much harder to find which representation you are using and to debug code.

These implementation issues are not details and should not be hidden to the user (even the beginner).

This is indeed where we disagree .. I really think that beginners should almost never have to think about those details, and ideally not even get exposed to those details. If there are common cases where they actually need to know right now, I think that is rather a sign that we need to fix or improve something, than that this newcomer should be aware of it. (for example, the memory overflow issues with pyarrow should simply not happen, that should be considered a bug we have to fix, and which I think is also more or less fixed at the moment because of always using the large string variant under the hood)

it depends on how you use pandas. My use case (as a researcher) is to work on large data structures (for instance, graph with 1B nodes 10B edges or series of 1B strings). In my case, the implementation "details" of dtypes is central to the performance of my code. I always tradeoff memory space vs speed vs implemented functionality. I regularly change the dtype depending on the computation I need to perform. Interns working on my projects should be aware from day 1 of these issues (and always struggle to find the correct dtype backend)

I am not sure we are that far from an agreement. When I say the backend should not be hidden, I do not claim it should be displayed all the time. Displaying with __repr__ only and not with __str__ is perfectly fine. But having no easy and consistant way to know which backend your dtype is using would be a problem.

But anyway, I hope the proposal here is some compromise between not showing the dtype "details" constantly, while making it still very accessible (and shown by default when inspecting the dtype itself)

Having a __repr__ showing the internal representation including default parameters would perfectly match my expectation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Needs Discussion Requires discussion from core team before further action Strings String extension data type and string data
Projects
None yet
Development

No branches or pull requests

6 participants