Dask-cuDF cumulative groupby ops #10889

brandon-b-miller · 2022-05-18T20:41:54Z

These should actually just work if the following PRs get merged, after which this diff might be really small:

During #10889 I found that the result was wrong for `cumcount` in the case of more than one single partition. Digging I found that this was because cuDF python always resets the index of `cumcount` operations meaning the index of the reassembled result would be wrong. It also needs the temporary object it groups on to have to original objects index in order for the post-processing functions to correctly set the index. This PR fixes it as such and adds a test. example old behavior: ```python >>> import pandas as pd >>> import cudf >>> df = pd.DataFrame({ ... 'a':[1,2,3,4,5,6] ... }, index=[1,2,3,4,5,6] ... ) >>> df a 1 1 2 2 3 3 4 4 5 5 6 6 >>> df.groupby('a').cumcount() 1 0 2 0 3 0 4 0 5 0 6 0 dtype: int64 >>> cudf.from_pandas(df).groupby('a').cumcount() 0 0 1 0 2 0 3 0 4 0 5 0 dtype: int32 ``` Authors: - https://github.com/brandon-b-miller Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Vyas Ramasubramani (https://github.com/vyasr) URL: #11188

charlesbluca · 2022-07-21T13:35:12Z

python/dask_cudf/dask_cudf/tests/test_groupby.py

+@pytest.mark.parametrize("aggregation", CUMULATIVE_AGGS)
+def test_groupby_cumulative(aggregation, pdf):


Should we also be testing on series groupbys here?

So I added Series tests here and encountered what I think might be a bug in upstream dask. Here's a reproducer with no cuDF, modeled off of these tests:

import pandas as pd import numpy as np import dask.dataframe as dd np.random.seed(0) size=10 npartitions=2 pdf = pd.DataFrame( { "xx": np.random.randint(0, 5, size=size), "x": np.random.normal(size=size), "y": np.random.normal(size=size), } ) ddf = dd.from_pandas(pdf, npartitions=npartitions) pdf_grouped = pdf.groupby('xx').xx ddf_grouped = ddf.groupby('xx').xx pdf_grouped.cumsum() ddf_grouped.cumsum().compute()

It's a little hard for me to reason about what the result "should" be here (we're aggregating one column of a dataframe groupby and taking...the cumulative sum of that?) but the above nets me different results for the last two lines. What do you think the best thing to do here is? I could file an issue and solve it before merging this, I could xfail this, etc.

Thanks for catching this! IMO I would add the tests and xfail, this is what I've done for other tests that would otherwise fail here due to upstream Dask issues, for example:

cudf/python/dask_cudf/dask_cudf/tests/test_groupby.py

Lines 288 to 294 in 97adac5

pytest.param(

False,

["a", "b"],

marks=pytest.mark.xfail(

reason="https://github.com/dask/dask/issues/8817"

),

),

raised dask/dask#9313

charlesbluca · 2022-07-21T13:37:08Z

python/dask_cudf/dask_cudf/tests/test_groupby.py

@@ -679,7 +695,7 @@ def test_groupby_agg_redirect(aggregations):
    ],
 )
 def test_is_supported(arg, supported):
-    assert _aggs_supported(arg, SUPPORTED_AGGS) is supported
+    assert _aggs_supported(arg, AGGS) is supported


Nitpicky, but we might want to keep this as SUPPORTED_AGGS to make sure we don't eventually mess something up with support for new aggregations down the line:

Suggested change

assert _aggs_supported(arg, AGGS) is supported

assert _aggs_supported(arg, SUPPORTED_AGGS) is supported

Actually, disregard this, I am forgetting that _aggs_supported really only needs to be tested for different groupby agg structures 😅 I think that a reasonable way to check that all aggregations are actually "supported" (i.e. use dask-cudf's groupby codepath) is to add the layer check I proposed in #10853

python/dask_cudf/dask_cudf/tests/test_groupby.py

…/cudf into fea-daskcudf-cumsum-etc

codecov · 2022-07-26T18:16:24Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.08@3f7bb6b). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff               @@
##             branch-22.08   #10889   +/-   ##
===============================================
  Coverage                ?   86.38%           
===============================================
  Files                   ?      143           
  Lines                   ?    22767           
  Branches                ?        0           
===============================================
  Hits                    ?    19668           
  Misses                  ?     3099           
  Partials                ?        0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3f7bb6b...9992e2e. Read the comment docs.

brandon-b-miller · 2022-07-27T20:40:19Z

@gpucibot merge

add tests

26bff13

brandon-b-miller added feature request New feature or request Python Affects Python cuDF API. 0 - Blocked Cannot progress due to external reasons non-breaking Non-breaking change labels May 18, 2022

galipremsagar mentioned this pull request May 19, 2022

Add a dispatch for non-pandas Grouper objects and use it in GroupBy dask/dask#9074

Merged

3 tasks

brandon-b-miller mentioned this pull request May 23, 2022

Register cudf.core.groupby.Grouper objects to dask grouper_dispatch #10838

Merged

brandon-b-miller changed the base branch from branch-22.06 to branch-22.08 May 25, 2022 18:55

brandon-b-miller added 2 commits June 8, 2022 06:42

Merge branch 'branch-22.08' into fea-daskcudf-cumsum-etc

5b92522

Merge branch 'branch-22.08' into fea-daskcudf-cumsum-etc

f9297e7

Merge branch 'branch-22.08' into fea-daskcudf-cumsum-etc

346d0e8

brandon-b-miller mentioned this pull request Jul 1, 2022

Fix cumulative count index behavior #11188

Merged

brandon-b-miller added 4 commits July 11, 2022 06:41

Merge branch 'branch-22.08' into fea-daskcudf-cumsum-etc

1c9e7bf

working through bugs

12f3c17

updates

5eb5bb8

Merge branch 'branch-22.08' into fea-daskcudf-cumsum-etc

5fd707c

brandon-b-miller marked this pull request as ready for review July 20, 2022 20:58

brandon-b-miller requested a review from a team as a code owner July 20, 2022 20:58

brandon-b-miller removed the 0 - Blocked Cannot progress due to external reasons label Jul 20, 2022

charlesbluca reviewed Jul 21, 2022

View reviewed changes

brandon-b-miller added 3 commits July 25, 2022 06:16

Merge branch 'branch-22.08' into fea-daskcudf-cumsum-etc

20fa2d2

pass style checks

10c6234

test series object as well

034f13c

charlesbluca reviewed Jul 26, 2022

View reviewed changes

python/dask_cudf/dask_cudf/tests/test_groupby.py Outdated Show resolved Hide resolved

charlesbluca and others added 2 commits July 26, 2022 11:10

Update python/dask_cudf/dask_cudf/tests/test_groupby.py

2379a0a

xfail and point to dask

096f5d9

Merge branch 'fea-daskcudf-cumsum-etc' of github.com:brandon-b-miller…

9992e2e

…/cudf into fea-daskcudf-cumsum-etc

brandon-b-miller requested a review from charlesbluca July 27, 2022 18:52

charlesbluca approved these changes Jul 27, 2022

View reviewed changes

rapids-bot bot merged commit 4d2211e into rapidsai:branch-22.08 Jul 27, 2022

charlesbluca mentioned this pull request Nov 2, 2022

Add checks for HLG layers in dask-cudf groupby tests #10853

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dask-cuDF cumulative groupby ops #10889

Dask-cuDF cumulative groupby ops #10889

brandon-b-miller commented May 18, 2022

charlesbluca Jul 21, 2022

brandon-b-miller Jul 25, 2022

charlesbluca Jul 25, 2022

brandon-b-miller Jul 26, 2022

charlesbluca Jul 21, 2022

charlesbluca Jul 21, 2022 •

edited

Loading

codecov bot commented Jul 26, 2022 •

edited

Loading

brandon-b-miller commented Jul 27, 2022

		@pytest.mark.parametrize("aggregation", CUMULATIVE_AGGS)
		def test_groupby_cumulative(aggregation, pdf):

	pytest.param(
	False,
	["a", "b"],
	marks=pytest.mark.xfail(
	reason="https://github.com/dask/dask/issues/8817"
	),
	),

	assert _aggs_supported(arg, AGGS) is supported
	assert _aggs_supported(arg, SUPPORTED_AGGS) is supported

Dask-cuDF cumulative groupby ops #10889

Dask-cuDF cumulative groupby ops #10889

Conversation

brandon-b-miller commented May 18, 2022

charlesbluca Jul 21, 2022

Choose a reason for hiding this comment

brandon-b-miller Jul 25, 2022

Choose a reason for hiding this comment

charlesbluca Jul 25, 2022

Choose a reason for hiding this comment

brandon-b-miller Jul 26, 2022

Choose a reason for hiding this comment

charlesbluca Jul 21, 2022

Choose a reason for hiding this comment

charlesbluca Jul 21, 2022 • edited Loading

Choose a reason for hiding this comment

codecov bot commented Jul 26, 2022 • edited Loading

Codecov Report

brandon-b-miller commented Jul 27, 2022

charlesbluca Jul 21, 2022 •

edited

Loading

codecov bot commented Jul 26, 2022 •

edited

Loading