Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String dtype: more informative repr (keeping brief __str__) #61148

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Mar 19, 2025

Attempt to address #59342

With the current version of the PR, the reprs for the different dtype variants are:

# default NaN variants
<StringDtype(na_value=nan)>
<StringDtype(storage='python', na_value=nan)>
# nullable NA variants
<StringDtype(na_value=<NA>)>
<StringDtype(storage='python', na_value=<NA>)>

Some questions to decide on:

  • Do we use surrounding <...> or not? (we are somewhat inconsistent internally for similar reprs; e.g. the Index repr does not use it, the ExtensionArray repr does)
    • Including the <..> makes it clearer that it is not necessarily exactly executable code, I think
  • Do we keep the current __str__ as is (i.e. just "str" or "string"), or do we include the storage for the "string" case (to preserve the current repr behaviour). i.e. make it to have the options "str", "string[pyarrow]" or "string[python]".
    • Essentially, this comes down to changing the dtype.name attribute or not (which right now is defined to be "str" or "string")
    • As comparison, for DatetimeTZDtype we do include the [] parametrization in .name (e.g. "datetime64[s, UTC]"), while for CategoricalDtype we do not (there it is just "category")
    • If we don't use it in the name/str, we actually never show "string[python]", while we still allow that as string alias for dtype arguments (e.g. in constructors or in astype())
  • Currently I just use their own repr for pd.NA and np.nan, which means they are displayed as <NA> and nan
    • But we could also make it to look like pd.NA and np.nan. This makes it a more "executable" repr, which could be nice, but on the other hand I also don't want to encourage that too much.

@jorisvandenbossche jorisvandenbossche added Output-Formatting __repr__ of pandas objects, to_string Strings String extension data type and string data labels Mar 19, 2025
@rhshadrach
Copy link
Member

rhshadrach commented Mar 23, 2025

I did not realize during the community meeting that we'd be using __str__ and not __repr__ when printing the dtypes of a Series or df.dtypes. To me this is a vast majority of cases where users will see the information, and I have to imagine many users are not aware of the difference between str(...) and repr(...).

This has me rethinking the naming convention discussed for PDEP-14. While I appreciate the historical reasons we associate str -> NaN and string -> NA, users do not. If we were to go forward with this, it seems to me to be a very odd quirk and we have the chance of fixing it now. While I realize this is coming late, I didn't see this in the discussion of PDEP-14, and wonder what we think about the following.

Proposal: We add the dtype= aliases:

"str[nan,python]", "str[NA,python]", "str[nan,pyarrow]", "str[NA,pyarrow]"

so that users can specify dtypes with these. We can then use these as the __str__ result of the four dtypes here. Thus users who specify dtype=str or dtype="str" will get the default arguments, but they can also override them in a natural way with the values above. While these would be the preferred aliases, we can use .lower() to allow accepting na or NaN, etc. This is in addition to PDEP-14, so all of the previously agreed to syntax will work as is, namely "string", for historical reasons.

@rhshadrach
Copy link
Member

Ah, I think perhaps just "str[nan]" and "str[NA]" since while I prefer adding the storage, I am okay with omitting it.

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Mar 25, 2025

This specific idea indeed hasn't come up in the PDEP discussions. As I recall, I think I proposed from the beginning to never include the na_value in the string alias (for not including the storage in the alias, I provided some argumentation in the PDEP, but not including na_value seemingly has not been discussed that explicitly?), and I initially proposed to use "string" also for the new future-default dtype, but with the back-compat concerns about this already existing, we opted for "str" instead to differentiate.

My first thoughts why I would rather not go that route:

  • My preference to not include this NA vs NaN in the __str__, is exactly because it is so visible. Beginners who are just using the defaults of pandas, start with reading a csv file, and check the info or dtypes of the resulting dataframe, should IMO not be confronted with this.
    (but, this is very subjective, and I think this is the part that will be hard to get real agreement on, in the end this is very much a judgement call and depending on the audience)
  • While I certainly agree that str vs string is quite arbitrary and mostly so for historical reasons which users are also not aware of, it is somewhat consistent with the other dtypes we already have. We do have bool vs boolean already, and we also have int64 vs Int64 (and this is not spelled as int64[NA] or float64[NaN] and float64[NA]).
    (I think I have mentioned in the past that I am also fine with writing string as String for now to keep that consistency in capitals for the nullable dtypes)
    If we do this for string, then we should also do this for the other nullable dtypes? But then that involves again the larger discussion around logical dtypes etc as well, and the current string dtype proposal was a way to already get this specific dtype while leaving most of those larger discussions for after 3.0 (which is of course never entirely possible, as essentially this discussion also shows, because how we do things now for string, ideally matches with what we would decide later for the other dtypes in general ...).
  • While showing it in repr/str is one thing, the other aspect of this is then also allowing this as an alias for specifying the dtype. That is certainly something I have specifically argued against in the PDEP-14 discussions, i.e. to not introduce dtype aliases that are so specific (like str[NA], although I mostly argued for that in context of including storage in the alias). The reason that I would prefer not allowing this as a string alias is that I think this will indirectly encourage users to actually use this (while if we only allow "str"/"string" or pd.StringDtype(storage=.., ..), I think more users will tend to use the short generic string aliases, instead of the constructor with specifying the storage and na_value explicitly).
    And the reason that I would personally like to avoid this as much as possible, is because I think 1) most people should not hardcode this in their code (although this only applies to storage of course, since the na_value already gets hardcoded indirectly right now as well through str vs string), to avoid thinking too much about it and to write more portable code, and 2) because this will be easier to update later (this one does apply to na_value as well): if at some point we want to move to nullable dtypes by default, and most people have been using "str" to specify the dtype, then we can make that an alias for the nullable version (initially behind an option of course) without otherwise having to change their code. While otherwise if people having been using "str[NaN]", they would have to change that and we would have to deprecate that string alias (in that future world where we only have nullable dypes (that I would still like to get to at some point), we can keep "str" and "string" as identical aliases for a long time or indefinitely, while keeping "str[NaN]" working as an alias would be very strange). But I know this last argument is very much dependent on the outcome of other discussions ..
    We could of course only show it in __str__ and not actually allow it as an alias for dtype specifications (if there is agreement about the alias part), but that is probably only more confusing?
  • This would also increase the options on the short term, because we have to keep the existing"string" working. Of course if we long term want to end up with str[NaN/NA], that is worth such transition period, but so personally I am not convinced this is the best end state

@simonjayhawkins
Copy link
Member

To address the concerns that have been raised about debugging and unexpected changes to the backend storage used after operations is to maybe consider only displaying the storage when it differs from the global option? The constructor keyword and string aliases to set the storage backend were after all a convenience to avoid context managers in development and testing?

This would allow us to hide the storage implementation detail in most situations which seems appropriate given that a change of storage backend is lossless for the object values and hence construction.

The downside could be that this argument is extended to the choice of missing value indicator which would again be lossless for the StringDtype on conversion between storage backends. This would not be the case for the nullable floats if np.nan is allowed alongside the pd.NA missing value indicator.

  • As comparison, for DatetimeTZDtype we do include the [] parametrization in .name (e.g. "datetime64[s, UTC]")

The parameterization is explicitly needed here to qualify the values.

@rhshadrach
Copy link
Member

Thanks for the response @jorisvandenbossche

If we do this for string, then we should also do this for the other nullable dtypes?

Other nullable types all use NA. It's only the strings that can be NA or NaN.

to avoid thinking too much about it and to write more portable code

Just because it runs does not mean it's not portable. The behavior changes in subtle ways depending on whether pyarrow is installed. Code should always fail loudly in that case.

We could of course only show it in __str__ and not actually allow it as an alias for dtype specifications (if there is agreement about the alias part), but that is probably only more confusing?

Agreed - we should not do this.

My preference to not include this NA vs NaN in the __str__, is exactly because it is so visible. Beginners who are just using the defaults of pandas, start with reading a csv file, and check the info or dtypes of the resulting dataframe, should IMO not be confronted with this.
(but, this is very subjective, and I think this is the part that will be hard to get real agreement on, in the end this is very much a judgement call and depending on the audience)

Yea, this is the root of the disagreement. Hiding important information from advanced users to make things easier for beginners seems to me to be wrong. We're pushing users down a path of changing the storage when we know there are differences in performance and behavior, all while saying "you shouldn't care! Any differences in behavior we'll call bugs and get to eventually". But this does nothing to alleviate what I fear will be pain points, the very least we could do is tell the user what dtypes they actually have. And of course the difference in behavior for NA-semantics will always exist - at least we're indicating that to the user (although I have to imagine from their perspective in a quite bizarre way).

I really do hope that I'm making this overblown - that users can really not care. Let's move forward and see what happens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Output-Formatting __repr__ of pandas objects, to_string Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants