-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
String dtype: more informative repr (keeping brief __str__) #61148
base: main
Are you sure you want to change the base?
String dtype: more informative repr (keeping brief __str__) #61148
Conversation
I did not realize during the community meeting that we'd be using This has me rethinking the naming convention discussed for PDEP-14. While I appreciate the historical reasons we associate Proposal: We add the
so that users can specify dtypes with these. We can then use these as the |
Ah, I think perhaps just |
This specific idea indeed hasn't come up in the PDEP discussions. As I recall, I think I proposed from the beginning to never include the My first thoughts why I would rather not go that route:
|
To address the concerns that have been raised about debugging and unexpected changes to the backend storage used after operations is to maybe consider only displaying the storage when it differs from the global option? The constructor keyword and string aliases to set the storage backend were after all a convenience to avoid context managers in development and testing? This would allow us to hide the storage implementation detail in most situations which seems appropriate given that a change of storage backend is lossless for the object values and hence construction. The downside could be that this argument is extended to the choice of missing value indicator which would again be lossless for the StringDtype on conversion between storage backends. This would not be the case for the nullable floats if np.nan is allowed alongside the pd.NA missing value indicator.
The parameterization is explicitly needed here to qualify the values. |
Thanks for the response @jorisvandenbossche
Other nullable types all use NA. It's only the strings that can be NA or NaN.
Just because it runs does not mean it's not portable. The behavior changes in subtle ways depending on whether pyarrow is installed. Code should always fail loudly in that case.
Agreed - we should not do this.
Yea, this is the root of the disagreement. Hiding important information from advanced users to make things easier for beginners seems to me to be wrong. We're pushing users down a path of changing the storage when we know there are differences in performance and behavior, all while saying "you shouldn't care! Any differences in behavior we'll call bugs and get to eventually". But this does nothing to alleviate what I fear will be pain points, the very least we could do is tell the user what dtypes they actually have. And of course the difference in behavior for NA-semantics will always exist - at least we're indicating that to the user (although I have to imagine from their perspective in a quite bizarre way). I really do hope that I'm making this overblown - that users can really not care. Let's move forward and see what happens. |
Attempt to address #59342
With the current version of the PR, the reprs for the different dtype variants are:
Some questions to decide on:
<...>
or not? (we are somewhat inconsistent internally for similar reprs; e.g. the Index repr does not use it, the ExtensionArray repr does)<..>
makes it clearer that it is not necessarily exactly executable code, I think__str__
as is (i.e. just"str"
or"string"
), or do we include the storage for the"string"
case (to preserve the current repr behaviour). i.e. make it to have the options"str"
,"string[pyarrow]"
or"string[python]"
.dtype.name
attribute or not (which right now is defined to be "str" or "string").name
(e.g. "datetime64[s, UTC]"), while for CategoricalDtype we do not (there it is just "category")"string[python]"
, while we still allow that as string alias fordtype
arguments (e.g. in constructors or inastype()
)pd.NA
andnp.nan
, which means they are displayed as<NA>
andnan
pd.NA
andnp.nan
. This makes it a more "executable" repr, which could be nice, but on the other hand I also don't want to encourage that too much.