[FEA] Support lists as groupby keys #8039

randerzander · 2021-04-22T18:26:08Z

I'd like to be able to use lists as keys in a groupby:

import cudf

df = cudf.DataFrame({
    'id': [0, 1],
    'id_lst': [[0, 0], [1, 1]],
    'val': [0, 1]
})

df.groupby(['id', 'id_lst']).val.sum()

Result:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~/conda/envs/rapids/lib/python3.8/site-packages/pandas/core/arrays/categorical.py in __init__(self, values, categories, ordered, dtype, fastpath)
    339             try:
--> 340                 codes, categories = factorize(values, sort=True)
    341             except TypeError as err:

~/conda/envs/rapids/lib/python3.8/site-packages/pandas/core/algorithms.py in factorize(values, sort, na_sentinel, size_hint)
    721 
--> 722         codes, uniques = factorize_array(
    723             values, na_sentinel=na_sentinel, size_hint=size_hint, na_value=na_value

~/conda/envs/rapids/lib/python3.8/site-packages/pandas/core/algorithms.py in factorize_array(values, na_sentinel, size_hint, na_value, mask)
    527     table = hash_klass(size_hint or len(values))
--> 528     uniques, codes = table.factorize(
    529         values, na_sentinel=na_sentinel, na_value=na_value, mask=mask

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.factorize()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable._unique()

TypeError: unhashable type: 'numpy.ndarray'

One workaround is once string list concatenation merges, converting id_lst to a tokenized string and using the string representation as the grouping key.

The text was updated successfully, but these errors were encountered:

github-actions · 2021-05-22T19:13:47Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Related to #8039 and #10181 Contributes to #10186 This PR updates `groupby::hash` to use new row operators. It gets rid of the current "flattened nested column" logic and allows `groupby::hash` to handle `LIST` and `STRUCT` keys. The work also involves small cleanups like getting rid of unnecessary template parameters and removing unused arguments. It becomes a breaking PR since the updated `groupby::hash` will treat inner nulls as equal when top-level nulls are excluded while the current behavior treats inner nulls as **unequal**. Authors: - Yunsong Wang (https://github.com/PointKernel) Approvers: - Jake Hemstad (https://github.com/jrhemstad) - Nghia Truong (https://github.com/ttnghia) - Devavret Makkar (https://github.com/devavret) URL: #10770

bdice · 2022-06-06T19:33:45Z

This is partially resolved by #10770. The remaining work to close this issue is to implement lexicographical comparators < to enable sort-based groupby.

GregoryKimball · 2022-10-03T18:20:56Z

This looks closer after merging #11129 but not quite working yet.

sorting on list columns works:

>>> import cudf
>>> df = cudf.DataFrame({'a':[[1,2],[1,1]], 'b':[1,2]})
>>> df.sort_values('a')
        a  b
1  [1, 1]  2
0  [1, 2]  1

but groupby does not due to a cuDF error. Perhaps list types aren't supported in the index as groupby would require

>>> df.groupby('a').mean()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/cudf/core/mixins/mixin_factory.py", line 11, in wrapper
    return method(self, *args1, *args2, **kwargs1, **kwargs2)
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/cudf/core/groupby/groupby.py", line 530, in _reduce
    return self.agg(op)
  File "/opt/conda/envs/rapids/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/cudf/core/groupby/groupby.py", line 458, in agg
    ) = self._groupby.aggregate(columns, normalized_aggs)
  File "/opt/conda/envs/rapids/lib/python3.8/functools.py", line 967, in __get__
    val = self.func(instance)
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/cudf/core/groupby/groupby.py", line 378, in _groupby
    [*self.grouping.keys._columns], dropna=self._dropna
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/cudf/core/groupby/groupby.py", line 1745, in keys
    return cudf.core.index.as_index(
  File "/opt/conda/envs/rapids/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/cudf/core/index.py", line 2848, in as_index
    return _index_from_data({kwargs.get("name", None): arbitrary})
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/cudf/core/index.py", line 116, in _index_from_data
    return index_class_type._from_data(data, name)
UnboundLocalError: local variable 'index_class_type' referenced before assignment

PointKernel · 2022-10-03T18:22:56Z

This looks closer after merging #11129 but not quite working yet.

The issue would be resolved by #11792.

shwina · 2022-10-03T18:47:59Z

Perhaps list types aren't supported in the index as groupby would require

Correct -- there's no ListIndex type in cuDF (or Pandas for that matter). There are two ways we could resolve this:

Add a ListIndex type
Support grouping on a list column only when as_index=False. At the same time, we could improve the error message when as_index=True.

For now I think (2) would be the simpler/faster option.

GregoryKimball · 2022-10-11T02:01:43Z

Note: Implementing the ListIndex type would also solve #6932

shwina · 2022-10-11T14:35:49Z

Understood. I'd be hesitant to expand our API surface area by introducing a new ListIndex type until we have compelling use cases. This is especially true around indexes, as we are concurrently pushing for Pandas to rely less on indexes (pandas-dev/pandas#48880).

shwina · 2022-11-01T14:51:44Z

@randerzander, is this request still a high priority for you? @PointKernel and I have been experimenting with exposing list-groupby in Python and have run into a few different issues.

Notably, even Pandas does not support the example snippet first posted, so we have nothing to test against:

import pandas as pd

df = pd.DataFrame({
    'id': [0, 1],
    'id_lst': [[0, 0], [1, 1]],
    'val': [0, 1]
})

df.groupby(['id', 'id_lst']).val.sum()
# TypeError: unhashable type: 'list'

Supporting grouping by lists in Python will need us to define and document semantics different from Pandas, and involves some non-trivial refactoring of our code. I'm trying to understand if there's a pressing need to do that.

randerzander · 2022-11-01T17:10:46Z

@randerzander, is this request still a high priority for you?

It's not high priority, no

PointKernel · 2022-11-01T20:01:17Z

#11792 partially solves the issue and now lists can be used as groupby keys in libcudf.

Further work to close this issue will be tracked via #12037

randerzander added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. labels Apr 22, 2021

github-actions bot added the inactive-30d label May 22, 2021

jrhemstad added 0 - Backlog In queue waiting for assignment and removed inactive-30d labels May 24, 2021

beckernick added this to the List and Struct data types and operations milestone Jul 14, 2021

This was referenced Feb 1, 2022

[FEA] group by aggregations that include a list of strings as the grouping type #10181

Closed

[FEA] Story - Supporting row operators on nested types #10186

Closed

revans2 mentioned this issue Mar 10, 2022

[FEA] Support Group-By on Array[String] NVIDIA/spark-rapids#4656

Closed

sameerz mentioned this issue Apr 4, 2022

[FEA] Support GroupBy Array[INT] NVIDIA/spark-rapids#5096

Closed

ttnghia self-assigned this Apr 5, 2022

jrhemstad assigned PointKernel and unassigned ttnghia Apr 29, 2022

PointKernel mentioned this issue May 10, 2022

Update groupby::hash to use new row operators for keys #10770

Merged

PointKernel mentioned this issue Sep 27, 2022

Support nested types as groupby keys in libcudf #11792

Merged

3 tasks

GregoryKimball mentioned this issue Oct 4, 2022

[FEA] Implement full support for nested types #11844

Closed

GregoryKimball mentioned this issue Apr 3, 2023

Aggregation Unique supports the list format. #12437

Closed

PointKernel closed this as completed Sep 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Support lists as groupby keys #8039

[FEA] Support lists as groupby keys #8039

randerzander commented Apr 22, 2021

github-actions bot commented May 22, 2021

bdice commented Jun 6, 2022

GregoryKimball commented Oct 3, 2022

PointKernel commented Oct 3, 2022

shwina commented Oct 3, 2022

GregoryKimball commented Oct 11, 2022 •

edited

Loading

shwina commented Oct 11, 2022

shwina commented Nov 1, 2022 •

edited

Loading

randerzander commented Nov 1, 2022

PointKernel commented Nov 1, 2022 •

edited

Loading

[FEA] Support lists as groupby keys #8039

[FEA] Support lists as groupby keys #8039

Comments

randerzander commented Apr 22, 2021

github-actions bot commented May 22, 2021

bdice commented Jun 6, 2022

GregoryKimball commented Oct 3, 2022

PointKernel commented Oct 3, 2022

shwina commented Oct 3, 2022

GregoryKimball commented Oct 11, 2022 • edited Loading

shwina commented Oct 11, 2022

shwina commented Nov 1, 2022 • edited Loading

randerzander commented Nov 1, 2022

PointKernel commented Nov 1, 2022 • edited Loading

GregoryKimball commented Oct 11, 2022 •

edited

Loading

shwina commented Nov 1, 2022 •

edited

Loading

PointKernel commented Nov 1, 2022 •

edited

Loading