Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame.nlargest/nsmallest fail on sliced frame #50

Closed
jcrist opened this issue Aug 4, 2017 · 1 comment
Closed

DataFrame.nlargest/nsmallest fail on sliced frame #50

jcrist opened this issue Aug 4, 2017 · 1 comment

Comments

@jcrist
Copy link
Contributor

jcrist commented Aug 4, 2017

This works fine if the frame is sliced from the start, but fails if the slice is in the middle:

In [29]: import pandas as pd, pygdf as gd

In [30]: df = pd.DataFrame({'x': range(100), 'y': list(map(float, range(100)))})

In [31]: gdf = gd.DataFrame.from_pandas(df)

In [32]: gdf[:10].nlargest(5, 'x')
Out[32]:
     x    y
9    9  9.0
8    8  8.0
7    7  7.0
6    6  6.0
5    5  5.0

In [33]: gdf[10:20].nlargest(5, 'x')
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
<ipython-input-33-51dda14cf36d> in <module>()
----> 1 gdf[10:20].nlargest(5, 'x')

~/Code/pygdf/pygdf/dataframe.py in nlargest(self, n, columns, keep)
    439         * Only a single column is supported in *columns*
    440         """
--> 441         return self._n_largest_or_smallest('nlargest', n, columns, keep)
    442
    443     def nsmallest(self, n, columns, keep='first'):

~/Code/pygdf/pygdf/dataframe.py in _n_largest_or_smallest(self, method, n, columns, keep)
    465                 df[k] = sorted_series
    466             else:
--> 467                 df[k] = self[k].take(df.index.gpu_values)
    468         return df
    469

~/Code/pygdf/pygdf/dataframe.py in __setitem__(self, name, col)
    144             self._cols[name] = self._prepare_series_for_add(col)
    145         else:
--> 146             self.add_column(name, col)
    147
    148     def __delitem__(self, name):

~/Code/pygdf/pygdf/dataframe.py in add_column(self, name, data)
    304         if name in self._cols:
    305             raise NameError('duplicated column name {!r}'.format(name))
--> 306         series = self._prepare_series_for_add(data)
    307         self._cols[name] = series
    308

~/Code/pygdf/pygdf/dataframe.py in _prepare_series_for_add(self, col)
    290             return series
    291         else:
--> 292             raise NotImplementedError("join needed")
    293
    294     def add_column(self, name, data):

NotImplementedError: join needed
@jcrist
Copy link
Contributor Author

jcrist commented Aug 4, 2017

Looks like this is related to what happens to the index when sort_values is called on a series. On a dataframe the index is kept, but on a series it's dropped and replaced with 0..len(series).

In [1]: import pandas as pd, pygdf as gd

In [2]: df = pd.DataFrame({'x': range(100), 'y': list(map(float, range(100)))})

In [3]: gdf = gd.DataFrame.from_pandas(df)

In [4]: p = gdf[10:20]

In [5]: p
Out[5]:
      x    y
10   10 10.0
11   11 11.0
12   12 12.0
13   13 13.0
14   14 14.0
15   15 15.0
16   16 16.0
17   17 17.0
18   18 18.0
19   19 19.0

In [6]: p.sort_values('x')
Out[6]:
      x    y
10   10 10.0
11   11 11.0
12   12 12.0
13   13 13.0
14   14 14.0
15   15 15.0
16   16 16.0
17   17 17.0
18   18 18.0
19   19 19.0

In [7]: x = p.x

In [8]: x.sort_values()
Out[8]:

 0   10
 1   11
 2   12
 3   13
 4   14
 5   15
 6   16
 7   17
 8   18
 9   19

sklam added a commit to sklam/pygdf that referenced this issue Aug 23, 2017
sklam added a commit to sklam/pygdf that referenced this issue Aug 23, 2017
raydouglass pushed a commit that referenced this issue Nov 7, 2023
Seeing how much this impacts the pandas test suite pass rate related to documenting how we test

Authors:
  - Matthew Roeschke (https://github.com/mroeschke)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers: None

URL: rapidsai/cudf-private#50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant