[BUG] sort_values fails on groupby aggregation #14794

jacobtomlinson · 2024-01-19T11:27:04Z

Describe the bug
This bug was found while playing around with the One Billion Row Challenge.

Steps/Code to reproduce bug

import cudf
import cupy as cp

# Generate some sample data similar to the data in 1BRC
stations = cudf.Series(['Ankara', 'Kampala', 'Tallinn', 'Gjoa Haven', 'Luanda', 'Cairo', 'Phnom Penh', 'Thessaloniki', 'Split', 'Palermo', 'Ouarzazate', 'Mandalay'])
df = cudf.DataFrame({        
        "station": cp.random.randint(0, len(stations)-1, 10_000_000),
        "measure": cp.random.normal(0, 10.0, 10_000_000)
    })
df.station = df.station.map(stations)

# Perform the grouby aggregation
df = df.groupby("station").agg(["min", "max", "mean"])
df.columns = df.columns.droplevel()

# Sort
df.sort_values("station") # <------------ Raises a KeyError

# It works in Pandas
df.to_pandas().sort_values("station")  # Works!

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[10], line 1
----> 1 df.sort_values("station")

File /opt/conda/lib/python3.10/site-packages/cudf/core/indexed_frame.py:2459, in IndexedFrame.sort_values(self, by, axis, ascending, inplace, kind, na_position, ignore_index)
   2454     return self
   2456 # argsort the `by` column
   2457 out = self._gather(
   2458     GatherMap.from_column_unchecked(
-> 2459         self._get_columns_by_label(by)._get_sorted_inds(
   2460             ascending=ascending, na_position=na_position
   2461         ),
   2462         len(self),
   2463         nullify=False,
   2464     ),
   2465     keep_index=not ignore_index,
   2466 )
   2467 if (
   2468     isinstance(self, cudf.core.dataframe.DataFrame)
   2469     and self._data.multiindex
   2470 ):
   2471     out.columns = self._data.to_pandas_index()

File /opt/conda/lib/python3.10/site-packages/nvtx/nvtx.py:115, in annotate.__call__.<locals>.inner(*args, **kwargs)
    112 @wraps(func)
    113 def inner(*args, **kwargs):
    114     libnvtx_push_range(self.attributes, self.domain.handle)
--> 115     result = func(*args, **kwargs)
    116     libnvtx_pop_range(self.domain.handle)
    117     return result

File /opt/conda/lib/python3.10/site-packages/cudf/core/dataframe.py:1959, in DataFrame._get_columns_by_label(self, labels, downcast)
   1950 @_cudf_nvtx_annotate
   1951 def _get_columns_by_label(
   1952     self, labels, *, downcast=False
   1953 ) -> Self | Series:
   1954     """
   1955     Return columns of dataframe by `labels`
   1956 
   1957     If downcast is True, try and downcast from a DataFrame to a Series
   1958     """
-> 1959     ca = self._data.select_by_label(labels)
   1960     if downcast:
   1961         if is_scalar(labels):

File /opt/conda/lib/python3.10/site-packages/cudf/core/column_accessor.py:381, in ColumnAccessor.select_by_label(self, key)
    379     if any(isinstance(k, slice) for k in key):
    380         return self._select_by_label_with_wildcard(key)
--> 381 return self._select_by_label_grouped(key)

File /opt/conda/lib/python3.10/site-packages/cudf/core/column_accessor.py:536, in ColumnAccessor._select_by_label_grouped(self, key)
    535 def _select_by_label_grouped(self, key: Any) -> ColumnAccessor:
--> 536     result = self._grouped_data[key]
    537     if isinstance(result, cudf.core.column.ColumnBase):
    538         return self.__class__({key: result}, multiindex=self.multiindex)

KeyError: 'station'

Expected behavior
The same code works in Pandas/Numpy.

Environment overview (please complete the following information)

docker run --gpus all --pull always --rm -it \
    --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 \
    -p 8888:8888 -p 8787:8787 -p 8786:8786 \
    rapidsai/notebooks:23.12-cuda12.0-py3.10

Additional context
Add any other context about the problem here.

GregoryKimball · 2024-08-07T05:01:29Z

This is still an issue in 24.10. If you look at the dataframe that throws the key error, it is because the index name is station and cudf only checks the columns. Pandas must also check the index for the sort key if it is a named index.

                    min        max      mean
station                                     
Cairo        -47.980218  49.310769 -0.002941
Phnom Penh   -46.659628  45.804911  0.014362
Ankara       -49.188951  51.845989 -0.011023
Gjoa Haven   -49.149847  46.392483 -0.000982
Tallinn      -45.047023  45.827091 -0.015824
Thessaloniki -46.644071  49.220242  0.005804
Ouarzazate   -52.082950  49.220931 -0.023041
Split        -46.843045  45.643914 -0.017685
Kampala      -49.175427  47.710701 -0.010595
Luanda       -46.135836  49.233791 -0.009899
Palermo      -48.390119  46.636100  0.000921

You can also add df = df.reset_index() to get the right answer from cudf.. but we still have a pandas matching issue.

Matt711 · 2024-08-09T13:12:51Z

xref #16507

jacobtomlinson added bug Something isn't working Needs Triage Need team to review and classify labels Jan 19, 2024

github-project-automation bot added this to cuDF/Dask/Numba/UCX Jan 19, 2024

github-project-automation bot moved this to In Progress in cuDF/Dask/Numba/UCX Jan 19, 2024

bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024

GregoryKimball added this to the Pandas API Alignment and Coverage milestone Aug 7, 2024

Matt711 self-assigned this Aug 8, 2024

mroeschke mentioned this issue Aug 9, 2024

Allow DataFrame.sort_values(by=) to select an index level #16519

Merged

3 tasks

rapids-bot bot closed this as completed in #16519 Aug 9, 2024

rapids-bot bot closed this as completed in 16aa0ea Aug 9, 2024

github-project-automation bot moved this from In Progress to Done in cuDF/Dask/Numba/UCX Aug 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] sort_values fails on groupby aggregation #14794

[BUG] sort_values fails on groupby aggregation #14794

jacobtomlinson commented Jan 19, 2024

GregoryKimball commented Aug 7, 2024 •

edited

Loading

Matt711 commented Aug 9, 2024

[BUG] sort_values fails on groupby aggregation #14794

[BUG] sort_values fails on groupby aggregation #14794

Comments

jacobtomlinson commented Jan 19, 2024

GregoryKimball commented Aug 7, 2024 • edited Loading

Matt711 commented Aug 9, 2024

GregoryKimball commented Aug 7, 2024 •

edited

Loading