You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
importcudfimportcupyascp# Generate some sample data similar to the data in 1BRCstations=cudf.Series(['Ankara', 'Kampala', 'Tallinn', 'Gjoa Haven', 'Luanda', 'Cairo', 'Phnom Penh', 'Thessaloniki', 'Split', 'Palermo', 'Ouarzazate', 'Mandalay'])
df=cudf.DataFrame({
"station": cp.random.randint(0, len(stations)-1, 10_000_000),
"measure": cp.random.normal(0, 10.0, 10_000_000)
})
df.station=df.station.map(stations)
# Perform the grouby aggregationdf=df.groupby("station").agg(["min", "max", "mean"])
df.columns=df.columns.droplevel()
# Sortdf.sort_values("station") # <------------ Raises a KeyError# It works in Pandasdf.to_pandas().sort_values("station") # Works!
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[10], line 1
----> 1 df.sort_values("station")
File /opt/conda/lib/python3.10/site-packages/cudf/core/indexed_frame.py:2459, in IndexedFrame.sort_values(self, by, axis, ascending, inplace, kind, na_position, ignore_index)
2454 return self
2456 # argsort the `by` column
2457 out = self._gather(
2458 GatherMap.from_column_unchecked(
-> 2459 self._get_columns_by_label(by)._get_sorted_inds(
2460 ascending=ascending, na_position=na_position
2461 ),
2462 len(self),
2463 nullify=False,
2464 ),
2465 keep_index=not ignore_index,
2466 )
2467 if (
2468 isinstance(self, cudf.core.dataframe.DataFrame)
2469 and self._data.multiindex
2470 ):
2471 out.columns = self._data.to_pandas_index()
File /opt/conda/lib/python3.10/site-packages/nvtx/nvtx.py:115, in annotate.__call__.<locals>.inner(*args, **kwargs)
112@wraps(func)
113definner(*args, **kwargs):
114 libnvtx_push_range(self.attributes, self.domain.handle)
--> 115 result = func(*args, **kwargs)
116 libnvtx_pop_range(self.domain.handle)
117return result
File /opt/conda/lib/python3.10/site-packages/cudf/core/dataframe.py:1959, in DataFrame._get_columns_by_label(self, labels, downcast)
1950 @_cudf_nvtx_annotate
1951 def _get_columns_by_label(
1952 self, labels, *, downcast=False
1953 ) -> Self | Series:
1954 """
1955 Return columns of dataframe by `labels`
1956
1957 If downcast is True, try and downcast from a DataFrame to a Series
1958 """
-> 1959 ca = self._data.select_by_label(labels)
1960 if downcast:
1961 if is_scalar(labels):
File /opt/conda/lib/python3.10/site-packages/cudf/core/column_accessor.py:381, in ColumnAccessor.select_by_label(self, key)
379ifany(isinstance(k, slice) for k in key):
380returnself._select_by_label_with_wildcard(key)
--> 381 return self._select_by_label_grouped(key)
File /opt/conda/lib/python3.10/site-packages/cudf/core/column_accessor.py:536, in ColumnAccessor._select_by_label_grouped(self, key)
535def_select_by_label_grouped(self, key: Any) -> ColumnAccessor:
--> 536 result = self._grouped_data[key]
537ifisinstance(result, cudf.core.column.ColumnBase):
538returnself.__class__({key: result}, multiindex=self.multiindex)
KeyError: 'station'
Expected behavior
The same code works in Pandas/Numpy.
Environment overview (please complete the following information)
This is still an issue in 24.10. If you look at the dataframe that throws the key error, it is because the index name is station and cudf only checks the columns. Pandas must also check the index for the sort key if it is a named index.
min max mean
station
Cairo -47.980218 49.310769 -0.002941
Phnom Penh -46.659628 45.804911 0.014362
Ankara -49.188951 51.845989 -0.011023
Gjoa Haven -49.149847 46.392483 -0.000982
Tallinn -45.047023 45.827091 -0.015824
Thessaloniki -46.644071 49.220242 0.005804
Ouarzazate -52.082950 49.220931 -0.023041
Split -46.843045 45.643914 -0.017685
Kampala -49.175427 47.710701 -0.010595
Luanda -46.135836 49.233791 -0.009899
Palermo -48.390119 46.636100 0.000921
You can also add df = df.reset_index() to get the right answer from cudf.. but we still have a pandas matching issue.
Describe the bug
This bug was found while playing around with the One Billion Row Challenge.
Steps/Code to reproduce bug
Expected behavior
The same code works in Pandas/Numpy.
Environment overview (please complete the following information)
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: