Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG/API: sum of a string column with all-NaN or empty #60229

Closed
Tracked by #54792
jorisvandenbossche opened this issue Nov 7, 2024 · 20 comments · Fixed by #60936
Closed
Tracked by #54792

BUG/API: sum of a string column with all-NaN or empty #60229

jorisvandenbossche opened this issue Nov 7, 2024 · 20 comments · Fixed by #60936
Assignees
Labels
Bug Groupby Strings String extension data type and string data
Milestone

Comments

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Nov 7, 2024

We decided to allow the sum operation for the future string dtype (PR in #59853, based on discussion in #59328).

But I ran into a strange case in groupby where the end result contains "0" in case of an empty or all-NaN group.

Reproducible example:

df = pd.DataFrame(
    {
        "key": [1, 2, 2, 3, 3, 3],
        "col1": [np.nan, 2, np.nan, 4, 5, 6],
        "col2": [np.nan, "b", np.nan, "d", "e", "f"],
    }
)
result = df.groupby("key").sum()

Currently, you get this:

>>> result
     col1 col2
key           
1     0.0    0
2     2.0    b
3    15.0  def

>>> result["col2"].values
array([0, 'b', 'def'], dtype=object)

So the "sum" operation has introduced a 0. Not very useful I think in context of strings, but at least it is object dtype and can contain anything.

However, with pd.options.future.infer_string = True enabled and starting from a proper string dtype, the result is seemingly the same (the repr looks the same), but the values in the column are now strings:

>>> result["col2"].values
<ArrowStringArrayNumpySemantics>
['0', 'b', 'def']
Length: 3, dtype: str

So the integer 0 has been converted to a string.

I think we certainly should not introduce this "0" string, and returning object dtype with 0 is also quite useless I think (but at least not inventing a new string in your data).
But if we have to return something, the empty string is probably the equivalent of 0 in case of string or the "sum" operation?

cc @rhshadrach @WillAyd @Dr-Irv

@jorisvandenbossche jorisvandenbossche added Bug Groupby Strings String extension data type and string data labels Nov 7, 2024
@jorisvandenbossche jorisvandenbossche added this to the 2.3 milestone Nov 7, 2024
@jorisvandenbossche
Copy link
Member Author

The above is in a groupby context, but there is of course also the normal column / Series case for empty or all-NA data. The current behaviour for our StringDtype variants:

>>> pd.Series([np.nan], dtype=pd.StringDtype("pyarrow", na_value=np.nan)).sum()
''

>>> pd.Series([np.nan], dtype=pd.StringDtype("python", na_value=np.nan)).sum()
0

So for the default pyarrow backend, it does give the empty string. Now, I don't remember explicitly implementing or testing that behaviour in #59853, so I assume this is just "by accident" because of how it is implemented.
(checking the implementation, this is indeed the case because it uses the join algorithm pc.binary_join(data_list, "") (where data_list is the data with missing values filtered out), which is essentially the equivalent for the python logic "".join(data), so if there is nothing to join, you get the join character, being an empty string here)

@WillAyd
Copy link
Member

WillAyd commented Nov 7, 2024

I think the pyarrow-backed behavior here makes sense to use generically

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Nov 7, 2024

I think the pyarrow-backed behavior here makes sense to use generically

I agree with Will. If doing a sum on strings, the default value is the empty string when all values are missing. Just like the default value is 0 when the dtype is int or float.

@Thanushri16
Copy link

take

@jorisvandenbossche
Copy link
Member Author

At the newcomers meeting, we only looked into the case of Series.sum() (example in the second post above), and so also groupby().sum() has the same issue (as shown in the top post).

But just to note that those two cases require separate fixes in the code, so best to consider that as separate issues / PRs.

@Thanushri16
Copy link

At the newcomers meeting, we only looked into the case of Series.sum() (example in the second post above), and so also groupby().sum() has the same issue (as shown in the top post).

But just to note that those two cases require separate fixes in the code, so best to consider that as separate issues / PRs.

Will work on this fix and a separate fix for the series sum as well. Thanks @jorisvandenbossche

@snitish
Copy link
Member

snitish commented Dec 1, 2024

@Thanushri16 are you still working on this fix? If not I'd like to take it.

@snitish
Copy link
Member

snitish commented Dec 1, 2024

@jorisvandenbossche What are your thoughts on setting the sum to None instead of empty string for all-nan object values? I see two advantages of doing so -

  1. None is probably more appropriate for object values as opposed to empty string
  2. We already default to None in cases were we have fewer values to aggregate than min_count, for object dtype

So we either keep it consistent and set it to None in both cases (i.e. all nan values as well as fewer values than min_count), or change the behavior for both which could have wider consequences.
The None solution also works when pd.options.future.infer_string = True, i.e. the ouput looks like this:

     col1 col2
key           
1     0.0  NaN
2     2.0    b
3    15.0  def

If this makes sense to you, here's a PR with the fix: #60458

@jorisvandenbossche
Copy link
Member Author

@snitish it seems your PR is addressing the groupby().sum() case, which is good, but then I would propose that @Thanushri16 can still work on the Series.sum() case (or at least give them a bit more time to answer)

On the question of empty string vs None, very quick drive-by comment: I think it is more consistent with numeric data to return an empty string (like we return 0) instead of None, and also otherwise the min_count argument has no effect anymore. But will try to take a closer look later.

@snitish
Copy link
Member

snitish commented Dec 1, 2024

@jorisvandenbossche Apologies for jumping the gun on this. I'll let @Thanushri16 work on the Series.sum() fix.

@WillAyd
Copy link
Member

WillAyd commented Dec 6, 2024

I do wonder about the answer here in the context of the logical type system proposal #58455

Having the return value dependent upon the semantics of each data type seems kind of tricky. In an ideal world would these all just return a missing value indicator?

@rhshadrach
Copy link
Member

What are your thoughts on setting the sum to None instead of empty string for all-nan object values? I see two advantages of doing so -

  1. None is probably more appropriate for object values as opposed to empty string

  2. We already default to None in cases were we have fewer values to aggregate than min_count, for object dtype

I think there are two common scenarios that users may want and we should support.

  1. I want the empty sum to be the identity of the monoid.
  2. I want NA when there aren't enough observations.

On all NA data, this is accomplished by returning the identity (0 for numeric/timedelta, "" for string) when min_count = 0 and NA when min_count > 0.

On object dtype, do not know what the identity is (if it exists at all), so NA (which for object dtype is np.nan currently) makes sense to me. It looks like currently we default to 0.

@Thanushri16
Copy link

What are your thoughts on setting the sum to None instead of empty string for all-nan object values? I see two advantages of doing so -

  1. None is probably more appropriate for object values as opposed to empty string
  2. We already default to None in cases were we have fewer values to aggregate than min_count, for object dtype

I think there are two common scenarios that users may want and we should support.

  1. I want the empty sum to be the identity of the monoid.
  2. I want NA when there aren't enough observations.

On all NA data, this is accomplished by returning the identity (0 for numeric/timedelta, "" for string) when min_count = 0 and NA when min_count > 0.

On object dtype, do not know what the identity is (if it exists at all), so NA (which for object dtype is np.nan currently) makes sense to me. It looks like currently we default to 0.

Sorry still a bit confused here and would like for some clarification. For object type, currently it defaults to 0. Are we expecting to return NA which is np.nan or an empty string ''?

@rhshadrach
Copy link
Member

For object type, currently it defaults to 0. Are we expecting to return NA which is np.nan or an empty string ''?

I do not think we should associate object dtype with strings. Although that is currently a very common case, we hope that in the near future it will not be. As such, I do not think we should use the empty string as the result of an empty sum on object dtype. The only value that makes sense to me is the NA value for object dtype, which is currently np.nan.

from pandas.core.dtypes.missing import na_value_for_dtype

ser = pd.Series(dtype=object)
print(na_value_for_dtype(ser.dtype))
# nan

@Thanushri16
Copy link

For object type, currently it defaults to 0. Are we expecting to return NA which is np.nan or an empty string ''?

I do not think we should associate object dtype with strings. Although that is currently a very common case, we hope that in the near future it will not be. As such, I do not think we should use the empty string as the result of an empty sum on object dtype. The only value that makes sense to me is the NA value for object dtype, which is currently np.nan.

from pandas.core.dtypes.missing import na_value_for_dtype

ser = pd.Series(dtype=object)
print(na_value_for_dtype(ser.dtype))
# nan

Got it. I'll create a PR with the np.nan fix for Series sum.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Dec 9, 2024

I do not think we should associate object dtype with strings. Although that is currently a very common case, we hope that in the near future it will not be. As such, I do not think we should use the empty string as the result of an empty sum on object dtype. The only value that makes sense to me is the NA value for object dtype, which is currently np.nan.

I agree with you, BUT if we implement this for v3.0, wouldn't this be a change in behavior? Not sure if it would cause existing user code to break or not.

@rhshadrach
Copy link
Member

Agreed with your concerns here @Dr-Irv. @Thanushri16 - I think we should make sure there is a consensus here. It is also something we should maybe think about changing well after pandas 3.0 with the improvements to the string dtypes.

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Dec 10, 2024

To be clear, this issue is about string dtype, not object dtype. We can certainly discuss the best behaviour for object dtype as well (and I agree that 0 most of the time will also not make sense for object dtype), but that is not critical to change for 3.0 (and let's also leave that for a separate issue to discuss?).

For 2.3 / 3.0, we want to fix sum() for the string dtype, and I think the agreed fix is to return an empty string instead of 0 for an empty/all-NaN series.

@WillAyd
Copy link
Member

WillAyd commented Jan 28, 2025

This looks like one of the last blockers for the 2.3 release. Just so we are on the same page, do we all agree that the default value returned here should be "" when min_count=0 and the na_value of the associated string data type when min_count>0?

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jan 28, 2025

This looks like one of the last blockers for the 2.3 release. Just so we are on the same page, do we all agree that the default value returned here should be "" when min_count=0 and the na_value of the associated string data type when min_count>0?

I agree.

Worth noting - I ran into a similar issue using min and max on a Series with strings that had np.nan as the NA sentinel with object dtype. See #4147 (that's a pretty old issue!). Worth noting that if you use the new string dtype, the problem originally posted there goes away for strings (but not necessarily for other Series that have object dtype where the objects are not strings).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Groupby Strings String extension data type and string data
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants