Make groupby transform-like op order match original data order #8720

isVoid · 2021-07-12T23:49:20Z

This PR makes transform-like ops return results with orders matching that of inputs. For example: groupby.shift

In [21]: df.head(8)
Out[21]:
   key  val1
0    1    70
1    1    86
2    0    18
3    1    91
4    1    74
5    1    97
6    0    43
7    0    48

In [22]: df.groupby('key').shift(1).head(8)
Out[22]:
   val1
0  <NA>
1    70
2  <NA>
3    86
4    91
5    74
6    18
7    43

This would affect groupby.scan and groupby.shift.

codecov · 2021-07-13T02:48:20Z

Codecov Report

Merging #8720 (dccabeb) into branch-21.10 (18f7c01) will decrease coverage by 0.06%.
The diff coverage is n/a.

❗ Current head dccabeb differs from pull request most recent head d827744. Consider uploading reports for the commit d827744 to get more accurate results

@@               Coverage Diff                @@
##           branch-21.10    #8720      +/-   ##
================================================
- Coverage         10.67%   10.61%   -0.07%     
================================================
  Files               110      116       +6     
  Lines             18271    19003     +732     
================================================
+ Hits               1951     2017      +66     
- Misses            16320    16986     +666

Impacted Files	Coverage Δ
python/cudf/cudf/__init__.py	`0.00% <ø> (ø)`
python/cudf/cudf/core/__init__.py	`0.00% <ø> (ø)`
python/cudf/cudf/core/column/categorical.py	`0.00% <ø> (ø)`
python/cudf/cudf/core/column/column.py	`0.00% <ø> (ø)`
python/cudf/cudf/core/column/lists.py	`0.00% <ø> (ø)`
python/cudf/cudf/core/column/methods.py	`0.00% <ø> (ø)`
python/cudf/cudf/core/column/numerical.py	`0.00% <ø> (ø)`
python/cudf/cudf/core/column/string.py	`0.00% <ø> (ø)`
python/cudf/cudf/core/column/struct.py	`0.00% <ø> (ø)`
python/cudf/cudf/core/dataframe.py	`0.00% <ø> (ø)`
... and 75 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7f704d6...d827744. Read the comment docs.

shwina · 2021-07-15T19:16:30Z

python/cudf/cudf/core/groupby/groupby.py

+            Table(value_columns._data), periods, fill_value
+        )
+        result = self.obj.__class__._from_table(result)
+        result = self._mimic_pandas_order(result)


This does a multi-column sort, which we can avoid by appending a column 0...N to the dataframe before the groupby and then sorting by that single column later.

Catched up offline, this is certainly a good optimization to our current approach. To achieve this we would require libcudf to perform a "no-op" on the sequence column. However a "no-op" wouldn't fit in our current libcudf aggregation framework because they are required to be binary (reduction) ops.

We discussed alternatives but settled upon it's best to just merge what we have so far and raise an issue to track the optimization thoughts with more people joining the dicussion.

galipremsagar · 2021-07-20T02:42:10Z

@beckernick The issue(#8714) this PR is fixing was scoped to 21.10. Was it intentional ? If so, I think we need to retarget this PR to 21.10 as it is currently targeted for 21.08.

harrism · 2021-07-21T22:17:02Z

Going to go ahead and move it.

isVoid · 2021-07-27T23:44:21Z

rerun tests

shwina

LGTM pending merge conflicts

…8714

isVoid · 2021-08-04T00:30:03Z

@gpucibot merge

Initial, groupby shift

77df03a

github-actions bot added the cuDF (Python) Affects Python cuDF API. label Jul 12, 2021

isVoid added feature request New feature or request breaking Breaking change labels Jul 12, 2021

caryr35 added this to PR-WIP in v21.08 Release via automation Jul 13, 2021

scan_agg ordering

45940e7

isVoid marked this pull request as ready for review July 14, 2021 01:40

isVoid requested a review from a team as a code owner July 14, 2021 01:40

isVoid requested review from charlesbluca and marlenezw July 14, 2021 01:40

shwina reviewed Jul 15, 2021

View reviewed changes

charlesbluca approved these changes Jul 19, 2021

View reviewed changes

v21.08 Release automation moved this from PR-WIP to PR-Reviewer approved Jul 19, 2021

harrism removed this from PR-Reviewer approved in v21.08 Release Jul 21, 2021

harrism added this to PR-WIP in v21.10 Release via automation Jul 21, 2021

harrism changed the base branch from branch-21.08 to branch-21.10 July 21, 2021 22:16

rnyak mentioned this pull request Aug 2, 2021

[REVIEW] Ecom-rees preproc with NVTabular NVIDIA-Merlin/Transformers4Rec#26

Closed

shwina approved these changes Aug 3, 2021

View reviewed changes

v21.10 Release automation moved this from PR-WIP to PR-Reviewer approved Aug 3, 2021

Merge branch 'branch-21.10' of https://github.com/rapidsai/cudf into …

d827744

…8714

rapids-bot bot merged commit fdf47af into rapidsai:branch-21.10 Aug 4, 2021

v21.10 Release automation moved this from PR-Reviewer approved to Done Aug 4, 2021

isVoid mentioned this pull request Aug 6, 2021

[BUG] Groupby scans and segmented shift should preserve original index #8715

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make groupby transform-like op order match original data order #8720

Make groupby transform-like op order match original data order #8720

isVoid commented Jul 12, 2021 •

edited

codecov bot commented Jul 13, 2021 •

edited

shwina Jul 15, 2021 •

edited

isVoid Jul 15, 2021 •

edited

galipremsagar commented Jul 20, 2021 •

edited

harrism commented Jul 21, 2021

isVoid commented Jul 27, 2021

shwina left a comment

isVoid commented Aug 4, 2021

Make groupby transform-like op order match original data order #8720

Make groupby transform-like op order match original data order #8720

Conversation

isVoid commented Jul 12, 2021 • edited

codecov bot commented Jul 13, 2021 • edited

Codecov Report

shwina Jul 15, 2021 • edited

Choose a reason for hiding this comment

isVoid Jul 15, 2021 • edited

Choose a reason for hiding this comment

galipremsagar commented Jul 20, 2021 • edited

harrism commented Jul 21, 2021

isVoid commented Jul 27, 2021

shwina left a comment

Choose a reason for hiding this comment

isVoid commented Aug 4, 2021

isVoid commented Jul 12, 2021 •

edited

codecov bot commented Jul 13, 2021 •

edited

shwina Jul 15, 2021 •

edited

isVoid Jul 15, 2021 •

edited

galipremsagar commented Jul 20, 2021 •

edited