Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support lists as groupby keys #8039

Open
randerzander opened this issue Apr 22, 2021 · 10 comments
Open

[FEA] Support lists as groupby keys #8039

randerzander opened this issue Apr 22, 2021 · 10 comments
Assignees
Labels
0 - Backlog In queue waiting for assignment cuDF (Python) Affects Python cuDF API. feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@randerzander
Copy link
Contributor

I'd like to be able to use lists as keys in a groupby:

import cudf

df = cudf.DataFrame({
    'id': [0, 1],
    'id_lst': [[0, 0], [1, 1]],
    'val': [0, 1]
})

df.groupby(['id', 'id_lst']).val.sum()

Result:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~/conda/envs/rapids/lib/python3.8/site-packages/pandas/core/arrays/categorical.py in __init__(self, values, categories, ordered, dtype, fastpath)
    339             try:
--> 340                 codes, categories = factorize(values, sort=True)
    341             except TypeError as err:

~/conda/envs/rapids/lib/python3.8/site-packages/pandas/core/algorithms.py in factorize(values, sort, na_sentinel, size_hint)
    721 
--> 722         codes, uniques = factorize_array(
    723             values, na_sentinel=na_sentinel, size_hint=size_hint, na_value=na_value

~/conda/envs/rapids/lib/python3.8/site-packages/pandas/core/algorithms.py in factorize_array(values, na_sentinel, size_hint, na_value, mask)
    527     table = hash_klass(size_hint or len(values))
--> 528     uniques, codes = table.factorize(
    529         values, na_sentinel=na_sentinel, na_value=na_value, mask=mask

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.factorize()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable._unique()

TypeError: unhashable type: 'numpy.ndarray'

One workaround is once string list concatenation merges, converting id_lst to a tokenized string and using the string representation as the grouping key.

@randerzander randerzander added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. cuDF (Python) Affects Python cuDF API. labels Apr 22, 2021
@kkraus14 kkraus14 added this to Needs prioritizing in Feature Planning via automation Apr 22, 2021
@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@jrhemstad jrhemstad added 0 - Backlog In queue waiting for assignment and removed inactive-30d labels May 24, 2021
@ttnghia ttnghia self-assigned this Apr 5, 2022
@jrhemstad jrhemstad assigned PointKernel and unassigned ttnghia Apr 29, 2022
rapids-bot bot pushed a commit that referenced this issue May 25, 2022
Related to #8039 and #10181 

Contributes to #10186 

This PR updates `groupby::hash` to use new row operators. It gets rid of the current "flattened nested column" logic and allows `groupby::hash` to handle `LIST` and `STRUCT` keys. The work also involves small cleanups like getting rid of unnecessary template parameters and removing unused arguments.

It becomes a breaking PR since the updated `groupby::hash` will treat inner nulls as equal when top-level nulls are excluded 
 while the current behavior treats inner nulls as **unequal**.

Authors:
  - Yunsong Wang (https://github.com/PointKernel)

Approvers:
  - Jake Hemstad (https://github.com/jrhemstad)
  - Nghia Truong (https://github.com/ttnghia)
  - Devavret Makkar (https://github.com/devavret)

URL: #10770
@bdice
Copy link
Contributor

bdice commented Jun 6, 2022

This is partially resolved by #10770. The remaining work to close this issue is to implement lexicographical comparators < to enable sort-based groupby.

@GregoryKimball
Copy link
Contributor

This looks closer after merging #11129 but not quite working yet.

sorting on list columns works:

>>> import cudf
>>> df = cudf.DataFrame({'a':[[1,2],[1,1]], 'b':[1,2]})
>>> df.sort_values('a')
        a  b
1  [1, 1]  2
0  [1, 2]  1

but groupby does not due to a cuDF error. Perhaps list types aren't supported in the index as groupby would require

>>> df.groupby('a').mean()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/cudf/core/mixins/mixin_factory.py", line 11, in wrapper
    return method(self, *args1, *args2, **kwargs1, **kwargs2)
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/cudf/core/groupby/groupby.py", line 530, in _reduce
    return self.agg(op)
  File "/opt/conda/envs/rapids/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/cudf/core/groupby/groupby.py", line 458, in agg
    ) = self._groupby.aggregate(columns, normalized_aggs)
  File "/opt/conda/envs/rapids/lib/python3.8/functools.py", line 967, in __get__
    val = self.func(instance)
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/cudf/core/groupby/groupby.py", line 378, in _groupby
    [*self.grouping.keys._columns], dropna=self._dropna
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/cudf/core/groupby/groupby.py", line 1745, in keys
    return cudf.core.index.as_index(
  File "/opt/conda/envs/rapids/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/cudf/core/index.py", line 2848, in as_index
    return _index_from_data({kwargs.get("name", None): arbitrary})
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/cudf/core/index.py", line 116, in _index_from_data
    return index_class_type._from_data(data, name)
UnboundLocalError: local variable 'index_class_type' referenced before assignment

@PointKernel
Copy link
Member

This looks closer after merging #11129 but not quite working yet.

The issue would be resolved by #11792.

@shwina
Copy link
Contributor

shwina commented Oct 3, 2022

Perhaps list types aren't supported in the index as groupby would require

Correct -- there's no ListIndex type in cuDF (or Pandas for that matter). There are two ways we could resolve this:

  1. Add a ListIndex type
  2. Support grouping on a list column only when as_index=False. At the same time, we could improve the error message when as_index=True.

For now I think (2) would be the simpler/faster option.

@GregoryKimball
Copy link
Contributor

GregoryKimball commented Oct 11, 2022

Note: Implementing the ListIndex type would also solve #6932

@shwina
Copy link
Contributor

shwina commented Oct 11, 2022

Understood. I'd be hesitant to expand our API surface area by introducing a new ListIndex type until we have compelling use cases. This is especially true around indexes, as we are concurrently pushing for Pandas to rely less on indexes (pandas-dev/pandas#48880).

@shwina
Copy link
Contributor

shwina commented Nov 1, 2022

@randerzander, is this request still a high priority for you? @PointKernel and I have been experimenting with exposing list-groupby in Python and have run into a few different issues.

Notably, even Pandas does not support the example snippet first posted, so we have nothing to test against:

import pandas as pd

df = pd.DataFrame({
    'id': [0, 1],
    'id_lst': [[0, 0], [1, 1]],
    'val': [0, 1]
})

df.groupby(['id', 'id_lst']).val.sum()
# TypeError: unhashable type: 'list'

Supporting grouping by lists in Python will need us to define and document semantics different from Pandas, and involves some non-trivial refactoring of our code. I'm trying to understand if there's a pressing need to do that.

@randerzander
Copy link
Contributor Author

@randerzander, is this request still a high priority for you?

It's not high priority, no

@PointKernel
Copy link
Member

PointKernel commented Nov 1, 2022

#11792 partially solves the issue and now lists can be used as groupby keys in libcudf.

Further work to close this issue will be tracked via #12037

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment cuDF (Python) Affects Python cuDF API. feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
No open projects
Feature Planning
Needs prioritizing
Development

No branches or pull requests

8 participants