Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] sort_values fails on groupby aggregation #14794

Closed
jacobtomlinson opened this issue Jan 19, 2024 · 2 comments · Fixed by #16519
Closed

[BUG] sort_values fails on groupby aggregation #14794

jacobtomlinson opened this issue Jan 19, 2024 · 2 comments · Fixed by #16519
Assignees
Labels
bug Something isn't working

Comments

@jacobtomlinson
Copy link
Member

Describe the bug
This bug was found while playing around with the One Billion Row Challenge.

Steps/Code to reproduce bug

import cudf
import cupy as cp

# Generate some sample data similar to the data in 1BRC
stations = cudf.Series(['Ankara', 'Kampala', 'Tallinn', 'Gjoa Haven', 'Luanda', 'Cairo', 'Phnom Penh', 'Thessaloniki', 'Split', 'Palermo', 'Ouarzazate', 'Mandalay'])
df = cudf.DataFrame({        
        "station": cp.random.randint(0, len(stations)-1, 10_000_000),
        "measure": cp.random.normal(0, 10.0, 10_000_000)
    })
df.station = df.station.map(stations)

# Perform the grouby aggregation
df = df.groupby("station").agg(["min", "max", "mean"])
df.columns = df.columns.droplevel()

# Sort
df.sort_values("station") # <------------ Raises a KeyError

# It works in Pandas
df.to_pandas().sort_values("station")  # Works!
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[10], line 1
----> 1 df.sort_values("station")

File /opt/conda/lib/python3.10/site-packages/cudf/core/indexed_frame.py:2459, in IndexedFrame.sort_values(self, by, axis, ascending, inplace, kind, na_position, ignore_index)
   2454     return self
   2456 # argsort the `by` column
   2457 out = self._gather(
   2458     GatherMap.from_column_unchecked(
-> 2459         self._get_columns_by_label(by)._get_sorted_inds(
   2460             ascending=ascending, na_position=na_position
   2461         ),
   2462         len(self),
   2463         nullify=False,
   2464     ),
   2465     keep_index=not ignore_index,
   2466 )
   2467 if (
   2468     isinstance(self, cudf.core.dataframe.DataFrame)
   2469     and self._data.multiindex
   2470 ):
   2471     out.columns = self._data.to_pandas_index()

File /opt/conda/lib/python3.10/site-packages/nvtx/nvtx.py:115, in annotate.__call__.<locals>.inner(*args, **kwargs)
    112 @wraps(func)
    113 def inner(*args, **kwargs):
    114     libnvtx_push_range(self.attributes, self.domain.handle)
--> 115     result = func(*args, **kwargs)
    116     libnvtx_pop_range(self.domain.handle)
    117     return result

File /opt/conda/lib/python3.10/site-packages/cudf/core/dataframe.py:1959, in DataFrame._get_columns_by_label(self, labels, downcast)
   1950 @_cudf_nvtx_annotate
   1951 def _get_columns_by_label(
   1952     self, labels, *, downcast=False
   1953 ) -> Self | Series:
   1954     """
   1955     Return columns of dataframe by `labels`
   1956 
   1957     If downcast is True, try and downcast from a DataFrame to a Series
   1958     """
-> 1959     ca = self._data.select_by_label(labels)
   1960     if downcast:
   1961         if is_scalar(labels):

File /opt/conda/lib/python3.10/site-packages/cudf/core/column_accessor.py:381, in ColumnAccessor.select_by_label(self, key)
    379     if any(isinstance(k, slice) for k in key):
    380         return self._select_by_label_with_wildcard(key)
--> 381 return self._select_by_label_grouped(key)

File /opt/conda/lib/python3.10/site-packages/cudf/core/column_accessor.py:536, in ColumnAccessor._select_by_label_grouped(self, key)
    535 def _select_by_label_grouped(self, key: Any) -> ColumnAccessor:
--> 536     result = self._grouped_data[key]
    537     if isinstance(result, cudf.core.column.ColumnBase):
    538         return self.__class__({key: result}, multiindex=self.multiindex)

KeyError: 'station'

Expected behavior
The same code works in Pandas/Numpy.

Environment overview (please complete the following information)

docker run --gpus all --pull always --rm -it \
    --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 \
    -p 8888:8888 -p 8787:8787 -p 8786:8786 \
    rapidsai/notebooks:23.12-cuda12.0-py3.10

Additional context
Add any other context about the problem here.

@jacobtomlinson jacobtomlinson added bug Something isn't working Needs Triage Need team to review and classify labels Jan 19, 2024
@bdice bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024
@GregoryKimball
Copy link
Contributor

GregoryKimball commented Aug 7, 2024

This is still an issue in 24.10. If you look at the dataframe that throws the key error, it is because the index name is station and cudf only checks the columns. Pandas must also check the index for the sort key if it is a named index.

                    min        max      mean
station                                     
Cairo        -47.980218  49.310769 -0.002941
Phnom Penh   -46.659628  45.804911  0.014362
Ankara       -49.188951  51.845989 -0.011023
Gjoa Haven   -49.149847  46.392483 -0.000982
Tallinn      -45.047023  45.827091 -0.015824
Thessaloniki -46.644071  49.220242  0.005804
Ouarzazate   -52.082950  49.220931 -0.023041
Split        -46.843045  45.643914 -0.017685
Kampala      -49.175427  47.710701 -0.010595
Luanda       -46.135836  49.233791 -0.009899
Palermo      -48.390119  46.636100  0.000921

You can also add df = df.reset_index() to get the right answer from cudf.. but we still have a pandas matching issue.

@Matt711
Copy link
Contributor

Matt711 commented Aug 9, 2024

xref #16507

@rapids-bot rapids-bot bot closed this as completed in 16aa0ea Aug 9, 2024
@github-project-automation github-project-automation bot moved this from In Progress to Done in cuDF/Dask/Numba/UCX Aug 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants