DataFrame.nlargest/nsmallest fail on sliced frame #50

jcrist · 2017-08-04T13:42:40Z

This works fine if the frame is sliced from the start, but fails if the slice is in the middle:

In [29]: import pandas as pd, pygdf as gd

In [30]: df = pd.DataFrame({'x': range(100), 'y': list(map(float, range(100)))})

In [31]: gdf = gd.DataFrame.from_pandas(df)

In [32]: gdf[:10].nlargest(5, 'x')
Out[32]:
     x    y
9    9  9.0
8    8  8.0
7    7  7.0
6    6  6.0
5    5  5.0

In [33]: gdf[10:20].nlargest(5, 'x')
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
<ipython-input-33-51dda14cf36d> in <module>()
----> 1 gdf[10:20].nlargest(5, 'x')

~/Code/pygdf/pygdf/dataframe.py in nlargest(self, n, columns, keep)
    439         * Only a single column is supported in *columns*
    440         """
--> 441         return self._n_largest_or_smallest('nlargest', n, columns, keep)
    442
    443     def nsmallest(self, n, columns, keep='first'):

~/Code/pygdf/pygdf/dataframe.py in _n_largest_or_smallest(self, method, n, columns, keep)
    465                 df[k] = sorted_series
    466             else:
--> 467                 df[k] = self[k].take(df.index.gpu_values)
    468         return df
    469

~/Code/pygdf/pygdf/dataframe.py in __setitem__(self, name, col)
    144             self._cols[name] = self._prepare_series_for_add(col)
    145         else:
--> 146             self.add_column(name, col)
    147
    148     def __delitem__(self, name):

~/Code/pygdf/pygdf/dataframe.py in add_column(self, name, data)
    304         if name in self._cols:
    305             raise NameError('duplicated column name {!r}'.format(name))
--> 306         series = self._prepare_series_for_add(data)
    307         self._cols[name] = series
    308

~/Code/pygdf/pygdf/dataframe.py in _prepare_series_for_add(self, col)
    290             return series
    291         else:
--> 292             raise NotImplementedError("join needed")
    293
    294     def add_column(self, name, data):

NotImplementedError: join needed

jcrist · 2017-08-04T14:05:53Z

Looks like this is related to what happens to the index when sort_values is called on a series. On a dataframe the index is kept, but on a series it's dropped and replaced with 0..len(series).

In [1]: import pandas as pd, pygdf as gd

In [2]: df = pd.DataFrame({'x': range(100), 'y': list(map(float, range(100)))})

In [3]: gdf = gd.DataFrame.from_pandas(df)

In [4]: p = gdf[10:20]

In [5]: p
Out[5]:
      x    y
10   10 10.0
11   11 11.0
12   12 12.0
13   13 13.0
14   14 14.0
15   15 15.0
16   16 16.0
17   17 17.0
18   18 18.0
19   19 19.0

In [6]: p.sort_values('x')
Out[6]:
      x    y
10   10 10.0
11   11 11.0
12   12 12.0
13   13 13.0
14   14 14.0
15   15 15.0
16   16 16.0
17   17 17.0
18   18 18.0
19   19 19.0

In [7]: x = p.x

In [8]: x.sort_values()
Out[8]:

 0   10
 1   11
 2   12
 3   13
 4   14
 5   15
 6   16
 7   17
 8   18
 9   19

Seeing how much this impacts the pandas test suite pass rate related to documenting how we test Authors: - Matthew Roeschke (https://github.com/mroeschke) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: None URL: rapidsai/cudf-private#50

sklam added a commit to sklam/pygdf that referenced this issue Aug 23, 2017

Add tests for rapidsai#50

0fa61b8

sklam added a commit to sklam/pygdf that referenced this issue Aug 23, 2017

Fix issue rapidsai#50

c1447df

seibert closed this as completed in 0b80f29 Aug 23, 2017

manojkumardas7 mentioned this issue Apr 8, 2019

[BUG] cannot load library 'librmm.so' #1283

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame.nlargest/nsmallest fail on sliced frame #50

DataFrame.nlargest/nsmallest fail on sliced frame #50

jcrist commented Aug 4, 2017

jcrist commented Aug 4, 2017

DataFrame.nlargest/nsmallest fail on sliced frame #50

DataFrame.nlargest/nsmallest fail on sliced frame #50

Comments

jcrist commented Aug 4, 2017

jcrist commented Aug 4, 2017