-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Need API support and __repr__ to discover the storage used for strings #59342
Comments
@arnaudlegout thanks for opening the issue! First quick note: at the moment numpy 2.0 string dtype is not supported in the Then, the API to inspect and discover the storage is actually already available, as the >>> pd.options.future.infer_string = True
>>> ser = pd.Series(["a", "b"], dtype="str")
>>> ser.dtype
str
>>> ser.dtype.storage
'pyarrow' So I think the main discussion is how the |
I think it makes sense to have the storage and na_value as part of the repr. While @jorisvandenbossche is correct that you can inspect this with attributes, that also assumes developers know in advance what those attributes are. By putting it into the repr instead it becomes a little clearer to developers what they might need to consider |
@WillAyd right, I was not aware of the I did not find the |
I think the name "python" for the fallback storage option is not future proof? If I'm reading the PDEP-14 right, the fallback is a numpy array of python
do we have a timeline on this? It seems like PDEP-10 will be reverted by PDEP-15, so pyarrow is going to stay an optional dependency. So to force users who just want the vectorisation speed benefits (and nothing more) to install pyarrow will practically lessen the importance of numpy 2.0 string implementation as they would've have already moved to pyarrow in pandas 3.0. pandas 3.0 is a golden opportunity to incorporate numpy 2.0 string dtype, as users who will shift to a newer major version of pandas, would also most likely shift to a newer major version of numpy. but if we still don't want to force numpy 2.0, we could have an intermediate fallback no? basically I'm saying we should fast track numpy 2.0 string implementation xD |
I think a numpy 2.0 string data type would needs it own PDEP. We already have a proliferation of string data types in pandas, so it needs some discussion to define what value we see from adding another, and to define what the semantics of it are. |
xref #60305 The outcome of that discussion could impact what should be done here. If we want the dtypes with only the storage being different to be considered equal and to hide the implementation detail to users then I guess we would not want to update the repr to display the storage? if the dtypes for the array with numpy semantics and the array using pd.NA are not considered equal in the equality checks then it may be that the na_value should be included in the repr. |
I really disagree with the willingness to "hide implementation details". The underlying implementation should be considered as a detail. It has strong performance (both in space and speed) implications, and considering performance as a detail for a regular pandas user is, in my opinion, a mistake. The impact of the dtypes implementation should even be more documented. with a numpy boolean is 8 bits, a nullable numpy boolean is 16 bits, and a nullable pyarrow boolean is 2 bits? why changing a string in a large string Series is fast with "string[python]", slow with "string[pyarrow]", why in some cases the pyarrow is producing a memory overflow? These implementation issues are not details and should not be hidden to the user (even the beginner). |
We discussed this once more at the community dev meeting last week, how we can make the repr more informative, so one can actually see the difference in dtype when inspecting data.
That has some balance in keeping the repr short in those DataFrame/Series representation, but having the full information available when actually looking at the dtype object itself. The concrete proposal is then to keep Generally the repr would then show all the arguments of the constructor, except that we were thinking we could omit defaults to make it a bit less verbose for those cases. In this case that would mean only show
This is indeed where we disagree .. I really think that beginners should almost never have to think about those details, and ideally not even get exposed to those details. If there are common cases where they actually need to know right now, I think that is rather a sign that we need to fix or improve something, than that this newcomer should be aware of it. But anyway, I hope the proposal here is some compromise between not showing the dtype "details" constantly, while making it still very accessible (and shown by default when inspecting the dtype itself) |
I have a PR trying to implement the above proposal: #61148 One thing I realized while doing that PR is that we currently have a special case for StringDtype to use its |
sounds good
I disagree with this one. Defaults change quite often from pandas version to pandas versions and in the case of string, they might even change based on installed dependencies.
it depends on how you use pandas. My use case (as a researcher) is to work on large data structures (for instance, graph with 1B nodes 10B edges or series of 1B strings). In my case, the implementation "details" of dtypes is central to the performance of my code. I always tradeoff memory space vs speed vs implemented functionality. I regularly change the dtype depending on the computation I need to perform. Interns working on my projects should be aware from day 1 of these issues (and always struggle to find the correct dtype backend) I am not sure we are that far from an agreement. When I say the backend should not be hidden, I do not claim it should be displayed all the time. Displaying with
Having a |
Originally raised in #58551 (comment)
Problem Description
With PDEP-14 there is the need for developers to be aware of the storage used for strings. Indeed, the storage might have a lot of impact of performance, for instance
pyarrow
storageChunkedArray
)python
storagenumpy
2.0 strings storage (I don't have a good knowledge of these new strings, and never tested them)Feature Description
I would like to have two way to discover the storage
__repr__
goal is to give information on the inner of an object, one option suggested by @jorisvandenbossche is to display<pandas.StringDtype(storage=...)>
instead ofstring[storage]
.get_storage
that returns the storage (not sure what is possible with the current implementation, would be best to have a class, otherwise, a string). The API is useful to check before running a time consuming code that we have the correct storage.Alternative Solutions
.
Additional Context
No response
The text was updated successfully, but these errors were encountered: