-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG/API: sum of a string column with all-NaN or empty #60229
Comments
The above is in a groupby context, but there is of course also the normal column / Series case for empty or all-NA data. The current behaviour for our StringDtype variants: >>> pd.Series([np.nan], dtype=pd.StringDtype("pyarrow", na_value=np.nan)).sum()
''
>>> pd.Series([np.nan], dtype=pd.StringDtype("python", na_value=np.nan)).sum()
0 So for the default pyarrow backend, it does give the empty string. Now, I don't remember explicitly implementing or testing that behaviour in #59853, so I assume this is just "by accident" because of how it is implemented. |
I think the pyarrow-backed behavior here makes sense to use generically |
I agree with Will. If doing a |
take |
At the newcomers meeting, we only looked into the case of But just to note that those two cases require separate fixes in the code, so best to consider that as separate issues / PRs. |
Will work on this fix and a separate fix for the series sum as well. Thanks @jorisvandenbossche |
@Thanushri16 are you still working on this fix? If not I'd like to take it. |
@jorisvandenbossche What are your thoughts on setting the sum to None instead of empty string for all-nan object values? I see two advantages of doing so -
So we either keep it consistent and set it to None in both cases (i.e. all nan values as well as fewer values than min_count), or change the behavior for both which could have wider consequences.
If this makes sense to you, here's a PR with the fix: #60458 |
@snitish it seems your PR is addressing the On the question of empty string vs None, very quick drive-by comment: I think it is more consistent with numeric data to return an empty string (like we return 0) instead of None, and also otherwise the |
@jorisvandenbossche Apologies for jumping the gun on this. I'll let @Thanushri16 work on the |
I do wonder about the answer here in the context of the logical type system proposal #58455 Having the return value dependent upon the semantics of each data type seems kind of tricky. In an ideal world would these all just return a missing value indicator? |
I think there are two common scenarios that users may want and we should support.
On all NA data, this is accomplished by returning the identity ( On object dtype, do not know what the identity is (if it exists at all), so NA (which for object dtype is np.nan currently) makes sense to me. It looks like currently we default to 0. |
Sorry still a bit confused here and would like for some clarification. For object type, currently it defaults to 0. Are we expecting to return NA which is np.nan or an empty string ''? |
I do not think we should associate object dtype with strings. Although that is currently a very common case, we hope that in the near future it will not be. As such, I do not think we should use the empty string as the result of an empty sum on object dtype. The only value that makes sense to me is the NA value for object dtype, which is currently from pandas.core.dtypes.missing import na_value_for_dtype
ser = pd.Series(dtype=object)
print(na_value_for_dtype(ser.dtype))
# nan |
Got it. I'll create a PR with the np.nan fix for Series sum. |
I agree with you, BUT if we implement this for v3.0, wouldn't this be a change in behavior? Not sure if it would cause existing user code to break or not. |
Agreed with your concerns here @Dr-Irv. @Thanushri16 - I think we should make sure there is a consensus here. It is also something we should maybe think about changing well after pandas 3.0 with the improvements to the string dtypes. |
To be clear, this issue is about string dtype, not object dtype. We can certainly discuss the best behaviour for object dtype as well (and I agree that 0 most of the time will also not make sense for object dtype), but that is not critical to change for 3.0 (and let's also leave that for a separate issue to discuss?). For 2.3 / 3.0, we want to fix |
This looks like one of the last blockers for the 2.3 release. Just so we are on the same page, do we all agree that the default value returned here should be |
I agree. Worth noting - I ran into a similar issue using |
We decided to allow the
sum
operation for the future string dtype (PR in #59853, based on discussion in #59328).But I ran into a strange case in groupby where the end result contains
"0"
in case of an empty or all-NaN group.Reproducible example:
Currently, you get this:
So the "sum" operation has introduced a
0
. Not very useful I think in context of strings, but at least it is object dtype and can contain anything.However, with
pd.options.future.infer_string = True
enabled and starting from a proper string dtype, the result is seemingly the same (the repr looks the same), but the values in the column are now strings:So the integer
0
has been converted to a string.I think we certainly should not introduce this
"0"
string, and returning object dtype with0
is also quite useless I think (but at least not inventing a new string in your data).But if we have to return something, the empty string is probably the equivalent of 0 in case of string or the "sum" operation?
cc @rhshadrach @WillAyd @Dr-Irv
The text was updated successfully, but these errors were encountered: