-
Notifications
You must be signed in to change notification settings - Fork 908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dask-cuDF cumulative groupby ops #10889
Dask-cuDF cumulative groupby ops #10889
Conversation
During #10889 I found that the result was wrong for `cumcount` in the case of more than one single partition. Digging I found that this was because cuDF python always resets the index of `cumcount` operations meaning the index of the reassembled result would be wrong. It also needs the temporary object it groups on to have to original objects index in order for the post-processing functions to correctly set the index. This PR fixes it as such and adds a test. example old behavior: ```python >>> import pandas as pd >>> import cudf >>> df = pd.DataFrame({ ... 'a':[1,2,3,4,5,6] ... }, index=[1,2,3,4,5,6] ... ) >>> df a 1 1 2 2 3 3 4 4 5 5 6 6 >>> df.groupby('a').cumcount() 1 0 2 0 3 0 4 0 5 0 6 0 dtype: int64 >>> cudf.from_pandas(df).groupby('a').cumcount() 0 0 1 0 2 0 3 0 4 0 5 0 dtype: int32 ``` Authors: - https://github.com/brandon-b-miller Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Vyas Ramasubramani (https://github.com/vyasr) URL: #11188
@pytest.mark.parametrize("aggregation", CUMULATIVE_AGGS) | ||
def test_groupby_cumulative(aggregation, pdf): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also be testing on series groupbys here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I added Series
tests here and encountered what I think might be a bug in upstream dask. Here's a reproducer with no cuDF, modeled off of these tests:
import pandas as pd
import numpy as np
import dask.dataframe as dd
np.random.seed(0)
size=10
npartitions=2
pdf = pd.DataFrame(
{
"xx": np.random.randint(0, 5, size=size),
"x": np.random.normal(size=size),
"y": np.random.normal(size=size),
}
)
ddf = dd.from_pandas(pdf, npartitions=npartitions)
pdf_grouped = pdf.groupby('xx').xx
ddf_grouped = ddf.groupby('xx').xx
pdf_grouped.cumsum()
ddf_grouped.cumsum().compute()
It's a little hard for me to reason about what the result "should" be here (we're aggregating one column of a dataframe groupby and taking...the cumulative sum of that?) but the above nets me different results for the last two lines. What do you think the best thing to do here is? I could file an issue and solve it before merging this, I could xfail this, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this! IMO I would add the tests and xfail, this is what I've done for other tests that would otherwise fail here due to upstream Dask issues, for example:
cudf/python/dask_cudf/dask_cudf/tests/test_groupby.py
Lines 288 to 294 in 97adac5
pytest.param( | |
False, | |
["a", "b"], | |
marks=pytest.mark.xfail( | |
reason="https://github.com/dask/dask/issues/8817" | |
), | |
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raised dask/dask#9313
@@ -679,7 +695,7 @@ def test_groupby_agg_redirect(aggregations): | |||
], | |||
) | |||
def test_is_supported(arg, supported): | |||
assert _aggs_supported(arg, SUPPORTED_AGGS) is supported | |||
assert _aggs_supported(arg, AGGS) is supported |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpicky, but we might want to keep this as SUPPORTED_AGGS
to make sure we don't eventually mess something up with support for new aggregations down the line:
assert _aggs_supported(arg, AGGS) is supported | |
assert _aggs_supported(arg, SUPPORTED_AGGS) is supported |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, disregard this, I am forgetting that _aggs_supported
really only needs to be tested for different groupby agg structures 😅 I think that a reasonable way to check that all aggregations are actually "supported" (i.e. use dask-cudf's groupby codepath) is to add the layer check I proposed in #10853
…/cudf into fea-daskcudf-cumsum-etc
Codecov Report
@@ Coverage Diff @@
## branch-22.08 #10889 +/- ##
===============================================
Coverage ? 86.38%
===============================================
Files ? 143
Lines ? 22767
Branches ? 0
===============================================
Hits ? 19668
Misses ? 3099
Partials ? 0 Continue to review full report at Codecov.
|
@gpucibot merge |
Closes #10296
These should actually just work if the following PRs get merged, after which this diff might be really small:
#10815
#10838
dask/dask#9074