Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dask-cuDF cumulative groupby ops #10889

Merged

Conversation

brandon-b-miller
Copy link
Contributor

Closes #10296

These should actually just work if the following PRs get merged, after which this diff might be really small:

#10815
#10838
dask/dask#9074

@brandon-b-miller brandon-b-miller added feature request New feature or request Python Affects Python cuDF API. 0 - Blocked Cannot progress due to external reasons non-breaking Non-breaking change labels May 18, 2022
@brandon-b-miller brandon-b-miller changed the base branch from branch-22.06 to branch-22.08 May 25, 2022 18:55
rapids-bot bot pushed a commit that referenced this pull request Jul 8, 2022
During #10889 I found that the result was wrong for `cumcount` in the case of more than one single partition. Digging I found that this was because cuDF python always resets the index of `cumcount` operations meaning the index of the reassembled result would be wrong. It also needs the temporary object it groups on to have to original objects index in order for the post-processing functions to correctly set the index. This PR fixes it as such and adds a test.

example old behavior:
```python
>>> import pandas as pd
>>> import cudf
>>> df = pd.DataFrame({
...     'a':[1,2,3,4,5,6]
... }, index=[1,2,3,4,5,6]
... )
>>> df
   a
1  1
2  2
3  3
4  4
5  5
6  6
>>> df.groupby('a').cumcount()
1    0
2    0
3    0
4    0
5    0
6    0
dtype: int64
>>> cudf.from_pandas(df).groupby('a').cumcount()
0    0
1    0
2    0
3    0
4    0
5    0
dtype: int32
```

Authors:
  - https://github.com/brandon-b-miller

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #11188
@brandon-b-miller brandon-b-miller marked this pull request as ready for review July 20, 2022 20:58
@brandon-b-miller brandon-b-miller requested a review from a team as a code owner July 20, 2022 20:58
@brandon-b-miller brandon-b-miller removed the 0 - Blocked Cannot progress due to external reasons label Jul 20, 2022
Comment on lines 60 to 61
@pytest.mark.parametrize("aggregation", CUMULATIVE_AGGS)
def test_groupby_cumulative(aggregation, pdf):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also be testing on series groupbys here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I added Series tests here and encountered what I think might be a bug in upstream dask. Here's a reproducer with no cuDF, modeled off of these tests:

import pandas as pd
import numpy as np
import dask.dataframe as dd

np.random.seed(0)
size=10
npartitions=2


pdf = pd.DataFrame(
    {
        "xx": np.random.randint(0, 5, size=size),
        "x": np.random.normal(size=size),
        "y": np.random.normal(size=size),
    }
)

ddf = dd.from_pandas(pdf, npartitions=npartitions)

pdf_grouped = pdf.groupby('xx').xx
ddf_grouped = ddf.groupby('xx').xx

pdf_grouped.cumsum()
ddf_grouped.cumsum().compute()

It's a little hard for me to reason about what the result "should" be here (we're aggregating one column of a dataframe groupby and taking...the cumulative sum of that?) but the above nets me different results for the last two lines. What do you think the best thing to do here is? I could file an issue and solve it before merging this, I could xfail this, etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this! IMO I would add the tests and xfail, this is what I've done for other tests that would otherwise fail here due to upstream Dask issues, for example:

pytest.param(
False,
["a", "b"],
marks=pytest.mark.xfail(
reason="https://github.com/dask/dask/issues/8817"
),
),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -679,7 +695,7 @@ def test_groupby_agg_redirect(aggregations):
],
)
def test_is_supported(arg, supported):
assert _aggs_supported(arg, SUPPORTED_AGGS) is supported
assert _aggs_supported(arg, AGGS) is supported
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpicky, but we might want to keep this as SUPPORTED_AGGS to make sure we don't eventually mess something up with support for new aggregations down the line:

Suggested change
assert _aggs_supported(arg, AGGS) is supported
assert _aggs_supported(arg, SUPPORTED_AGGS) is supported

Copy link
Member

@charlesbluca charlesbluca Jul 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, disregard this, I am forgetting that _aggs_supported really only needs to be tested for different groupby agg structures 😅 I think that a reasonable way to check that all aggregations are actually "supported" (i.e. use dask-cudf's groupby codepath) is to add the layer check I proposed in #10853

python/dask_cudf/dask_cudf/tests/test_groupby.py Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Jul 26, 2022

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.08@3f7bb6b). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff               @@
##             branch-22.08   #10889   +/-   ##
===============================================
  Coverage                ?   86.38%           
===============================================
  Files                   ?      143           
  Lines                   ?    22767           
  Branches                ?        0           
===============================================
  Hits                    ?    19668           
  Misses                  ?     3099           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3f7bb6b...9992e2e. Read the comment docs.

@brandon-b-miller
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 4d2211e into rapidsai:branch-22.08 Jul 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Support for Dask groupby cumulative sum, count
2 participants