BUG/API: sum of a string column with all-NaN or empty #60229

jorisvandenbossche · 2024-11-07T15:55:19Z

We decided to allow the sum operation for the future string dtype (PR in #59853, based on discussion in #59328).

But I ran into a strange case in groupby where the end result contains "0" in case of an empty or all-NaN group.

Reproducible example:

df = pd.DataFrame(
    {
        "key": [1, 2, 2, 3, 3, 3],
        "col1": [np.nan, 2, np.nan, 4, 5, 6],
        "col2": [np.nan, "b", np.nan, "d", "e", "f"],
    }
)
result = df.groupby("key").sum()

Currently, you get this:

>>> result
     col1 col2
key           
1     0.0    0
2     2.0    b
3    15.0  def

>>> result["col2"].values
array([0, 'b', 'def'], dtype=object)

So the "sum" operation has introduced a 0. Not very useful I think in context of strings, but at least it is object dtype and can contain anything.

However, with pd.options.future.infer_string = True enabled and starting from a proper string dtype, the result is seemingly the same (the repr looks the same), but the values in the column are now strings:

>>> result["col2"].values
<ArrowStringArrayNumpySemantics>
['0', 'b', 'def']
Length: 3, dtype: str

So the integer 0 has been converted to a string.

I think we certainly should not introduce this "0" string, and returning object dtype with 0 is also quite useless I think (but at least not inventing a new string in your data).
But if we have to return something, the empty string is probably the equivalent of 0 in case of string or the "sum" operation?

cc @rhshadrach @WillAyd @Dr-Irv

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2024-11-07T16:08:09Z

The above is in a groupby context, but there is of course also the normal column / Series case for empty or all-NA data. The current behaviour for our StringDtype variants:

>>> pd.Series([np.nan], dtype=pd.StringDtype("pyarrow", na_value=np.nan)).sum()
''

>>> pd.Series([np.nan], dtype=pd.StringDtype("python", na_value=np.nan)).sum()
0

So for the default pyarrow backend, it does give the empty string. Now, I don't remember explicitly implementing or testing that behaviour in #59853, so I assume this is just "by accident" because of how it is implemented.
(checking the implementation, this is indeed the case because it uses the join algorithm pc.binary_join(data_list, "") (where data_list is the data with missing values filtered out), which is essentially the equivalent for the python logic "".join(data), so if there is nothing to join, you get the join character, being an empty string here)

WillAyd · 2024-11-07T18:35:46Z

I think the pyarrow-backed behavior here makes sense to use generically

Dr-Irv · 2024-11-07T19:09:57Z

I think the pyarrow-backed behavior here makes sense to use generically

I agree with Will. If doing a sum on strings, the default value is the empty string when all values are missing. Just like the default value is 0 when the dtype is int or float.

Thanushri16 · 2024-11-12T02:19:21Z

take

jorisvandenbossche · 2024-11-20T23:16:15Z

At the newcomers meeting, we only looked into the case of Series.sum() (example in the second post above), and so also groupby().sum() has the same issue (as shown in the top post).

But just to note that those two cases require separate fixes in the code, so best to consider that as separate issues / PRs.

Thanushri16 · 2024-11-20T23:20:04Z

At the newcomers meeting, we only looked into the case of Series.sum() (example in the second post above), and so also groupby().sum() has the same issue (as shown in the top post).

But just to note that those two cases require separate fixes in the code, so best to consider that as separate issues / PRs.

Will work on this fix and a separate fix for the series sum as well. Thanks @jorisvandenbossche

snitish · 2024-12-01T05:15:09Z

@Thanushri16 are you still working on this fix? If not I'd like to take it.

snitish · 2024-12-01T06:22:22Z

@jorisvandenbossche What are your thoughts on setting the sum to None instead of empty string for all-nan object values? I see two advantages of doing so -

None is probably more appropriate for object values as opposed to empty string
We already default to None in cases were we have fewer values to aggregate than min_count, for object dtype

So we either keep it consistent and set it to None in both cases (i.e. all nan values as well as fewer values than min_count), or change the behavior for both which could have wider consequences.
The None solution also works when pd.options.future.infer_string = True, i.e. the ouput looks like this:

     col1 col2
key           
1     0.0  NaN
2     2.0    b
3    15.0  def

If this makes sense to you, here's a PR with the fix: #60458

jorisvandenbossche · 2024-12-01T13:37:56Z

@snitish it seems your PR is addressing the groupby().sum() case, which is good, but then I would propose that @Thanushri16 can still work on the Series.sum() case (or at least give them a bit more time to answer)

On the question of empty string vs None, very quick drive-by comment: I think it is more consistent with numeric data to return an empty string (like we return 0) instead of None, and also otherwise the min_count argument has no effect anymore. But will try to take a closer look later.

snitish · 2024-12-01T19:19:41Z

@jorisvandenbossche Apologies for jumping the gun on this. I'll let @Thanushri16 work on the Series.sum() fix.

WillAyd · 2024-12-06T17:02:07Z

I do wonder about the answer here in the context of the logical type system proposal #58455

Having the return value dependent upon the semantics of each data type seems kind of tricky. In an ideal world would these all just return a missing value indicator?

rhshadrach · 2024-12-07T15:11:16Z

What are your thoughts on setting the sum to None instead of empty string for all-nan object values? I see two advantages of doing so -

None is probably more appropriate for object values as opposed to empty string

We already default to None in cases were we have fewer values to aggregate than min_count, for object dtype

I think there are two common scenarios that users may want and we should support.

I want the empty sum to be the identity of the monoid.
I want NA when there aren't enough observations.

On all NA data, this is accomplished by returning the identity (0 for numeric/timedelta, "" for string) when min_count = 0 and NA when min_count > 0.

On object dtype, do not know what the identity is (if it exists at all), so NA (which for object dtype is np.nan currently) makes sense to me. It looks like currently we default to 0.

Thanushri16 · 2024-12-09T02:25:15Z

What are your thoughts on setting the sum to None instead of empty string for all-nan object values? I see two advantages of doing so -

None is probably more appropriate for object values as opposed to empty string

We already default to None in cases were we have fewer values to aggregate than min_count, for object dtype

I think there are two common scenarios that users may want and we should support.

I want the empty sum to be the identity of the monoid.

I want NA when there aren't enough observations.

On all NA data, this is accomplished by returning the identity (0 for numeric/timedelta, "" for string) when min_count = 0 and NA when min_count > 0.

On object dtype, do not know what the identity is (if it exists at all), so NA (which for object dtype is np.nan currently) makes sense to me. It looks like currently we default to 0.

Sorry still a bit confused here and would like for some clarification. For object type, currently it defaults to 0. Are we expecting to return NA which is np.nan or an empty string ''?

rhshadrach · 2024-12-09T23:01:05Z

For object type, currently it defaults to 0. Are we expecting to return NA which is np.nan or an empty string ''?

I do not think we should associate object dtype with strings. Although that is currently a very common case, we hope that in the near future it will not be. As such, I do not think we should use the empty string as the result of an empty sum on object dtype. The only value that makes sense to me is the NA value for object dtype, which is currently np.nan.

from pandas.core.dtypes.missing import na_value_for_dtype

ser = pd.Series(dtype=object)
print(na_value_for_dtype(ser.dtype))
# nan

Thanushri16 · 2024-12-09T23:04:43Z

For object type, currently it defaults to 0. Are we expecting to return NA which is np.nan or an empty string ''?

I do not think we should associate object dtype with strings. Although that is currently a very common case, we hope that in the near future it will not be. As such, I do not think we should use the empty string as the result of an empty sum on object dtype. The only value that makes sense to me is the NA value for object dtype, which is currently np.nan.
from pandas.core.dtypes.missing import na_value_for_dtype

ser = pd.Series(dtype=object)
print(na_value_for_dtype(ser.dtype))
# nan

Got it. I'll create a PR with the np.nan fix for Series sum.

Dr-Irv · 2024-12-09T23:22:20Z

I do not think we should associate object dtype with strings. Although that is currently a very common case, we hope that in the near future it will not be. As such, I do not think we should use the empty string as the result of an empty sum on object dtype. The only value that makes sense to me is the NA value for object dtype, which is currently np.nan.

I agree with you, BUT if we implement this for v3.0, wouldn't this be a change in behavior? Not sure if it would cause existing user code to break or not.

rhshadrach · 2024-12-09T23:26:59Z

Agreed with your concerns here @Dr-Irv. @Thanushri16 - I think we should make sure there is a consensus here. It is also something we should maybe think about changing well after pandas 3.0 with the improvements to the string dtypes.

jorisvandenbossche · 2024-12-10T21:15:12Z

To be clear, this issue is about string dtype, not object dtype. We can certainly discuss the best behaviour for object dtype as well (and I agree that 0 most of the time will also not make sense for object dtype), but that is not critical to change for 3.0 (and let's also leave that for a separate issue to discuss?).

For 2.3 / 3.0, we want to fix sum() for the string dtype, and I think the agreed fix is to return an empty string instead of 0 for an empty/all-NaN series.

WillAyd · 2025-01-28T22:13:30Z

This looks like one of the last blockers for the 2.3 release. Just so we are on the same page, do we all agree that the default value returned here should be "" when min_count=0 and the na_value of the associated string data type when min_count>0?

Dr-Irv · 2025-01-28T22:46:27Z

This looks like one of the last blockers for the 2.3 release. Just so we are on the same page, do we all agree that the default value returned here should be "" when min_count=0 and the na_value of the associated string data type when min_count>0?

I agree.

Worth noting - I ran into a similar issue using min and max on a Series with strings that had np.nan as the NA sentinel with object dtype. See #4147 (that's a pretty old issue!). Worth noting that if you use the new string dtype, the problem originally posted there goes away for strings (but not necessarily for other Series that have object dtype where the objects are not strings).

jorisvandenbossche added Bug Groupby Strings labels Nov 7, 2024

jorisvandenbossche added this to the 2.3 milestone Nov 7, 2024

jorisvandenbossche mentioned this issue Nov 7, 2024

lib/datautils: groupby_agg on string column with missing values introduces 0 owid/etl#3515

Closed

jorisvandenbossche mentioned this issue Nov 7, 2024

TRACKER: new default String dtype (pyarrow-backed, numpy NaN semantics) #54792

Open

41 tasks

github-actions bot assigned Thanushri16 Nov 12, 2024

snitish mentioned this issue Dec 1, 2024

BUG: Groupby.sum, DataFrame.sum and Series.sum for object type should be NA instead of 0 for all-nan values #60458

Closed

5 tasks

WillAyd mentioned this issue Jan 28, 2025

BUG: Inconsistent dtype with GroupBy for StrDtype and all missing values #60810

Open

3 tasks

rhshadrach mentioned this issue Feb 15, 2025

BUG(string dtype): Empty sum produces incorrect result #60936

Merged

5 tasks

WillAyd closed this as completed in #60936 Mar 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG/API: sum of a string column with all-NaN or empty #60229

BUG/API: sum of a string column with all-NaN or empty #60229

jorisvandenbossche commented Nov 7, 2024 •

edited

Loading

jorisvandenbossche commented Nov 7, 2024

WillAyd commented Nov 7, 2024

Dr-Irv commented Nov 7, 2024

Thanushri16 commented Nov 12, 2024

jorisvandenbossche commented Nov 20, 2024

Thanushri16 commented Nov 20, 2024

snitish commented Dec 1, 2024

snitish commented Dec 1, 2024 •

edited

Loading

jorisvandenbossche commented Dec 1, 2024

snitish commented Dec 1, 2024

WillAyd commented Dec 6, 2024

rhshadrach commented Dec 7, 2024

Thanushri16 commented Dec 9, 2024

rhshadrach commented Dec 9, 2024

Thanushri16 commented Dec 9, 2024

Dr-Irv commented Dec 9, 2024

rhshadrach commented Dec 9, 2024

jorisvandenbossche commented Dec 10, 2024 •

edited

Loading

WillAyd commented Jan 28, 2025

Dr-Irv commented Jan 28, 2025

BUG/API: sum of a string column with all-NaN or empty #60229

BUG/API: sum of a string column with all-NaN or empty #60229

Comments

jorisvandenbossche commented Nov 7, 2024 • edited Loading

jorisvandenbossche commented Nov 7, 2024

WillAyd commented Nov 7, 2024

Dr-Irv commented Nov 7, 2024

Thanushri16 commented Nov 12, 2024

jorisvandenbossche commented Nov 20, 2024

Thanushri16 commented Nov 20, 2024

snitish commented Dec 1, 2024

snitish commented Dec 1, 2024 • edited Loading

jorisvandenbossche commented Dec 1, 2024

snitish commented Dec 1, 2024

WillAyd commented Dec 6, 2024

rhshadrach commented Dec 7, 2024

Thanushri16 commented Dec 9, 2024

rhshadrach commented Dec 9, 2024

Thanushri16 commented Dec 9, 2024

Dr-Irv commented Dec 9, 2024

rhshadrach commented Dec 9, 2024

jorisvandenbossche commented Dec 10, 2024 • edited Loading

WillAyd commented Jan 28, 2025

Dr-Irv commented Jan 28, 2025

jorisvandenbossche commented Nov 7, 2024 •

edited

Loading

snitish commented Dec 1, 2024 •

edited

Loading

jorisvandenbossche commented Dec 10, 2024 •

edited

Loading