Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support named aggregations in df.groupby().agg() #15967

Closed
GregoryKimball opened this issue Jun 11, 2024 · 9 comments · Fixed by #16528
Closed

[FEA] Support named aggregations in df.groupby().agg() #15967

GregoryKimball opened this issue Jun 11, 2024 · 9 comments · Fixed by #16528
Assignees
Labels
External Issues or PRs created by external contributors feature request New feature or request good first issue Good for newcomers Python Affects Python cuDF API.

Comments

@GregoryKimball
Copy link
Contributor

Is your feature request related to a problem? Please describe.
I wish cudf python supported column name keywords in df.groupby().agg()

See pandas docs here

Here the output has one column for each element in **kwargs. The name of the column is keyword, whereas the value determines the aggregation used to compute the values in the column.

Describe the solution you'd like
Let's add support for column name keywords in df.groupby().agg(). This should be a python-only change for API convenience. It will prevent host fallback in cudf.pandas

Describe alternatives you've considered
Instead of:
df.groupby('a').agg(mycol=('b','sum'))
we can say
df.groupby('a').agg({'b': 'sum'}).rename(columns={'b': 'named_agg'})
and renaming columns doesn't work with a list of aggregations on the same column
df.groupby('a').agg({'b': ['sum', 'mean']}).rename(columns={???: 'sum'})

It seems like we could write a convenience API here to capture any kwargs and their tuples to make new columns.

Additional context
Many nice pandas queries for TPC-H were published in the fireducks-dev repo, and the agg column name keywords style is common.

@GregoryKimball GregoryKimball added feature request New feature or request Python Affects Python cuDF API. labels Jun 11, 2024
@github-actions github-actions bot added the External Issues or PRs created by external contributors label Jun 11, 2024
@GregoryKimball GregoryKimball added the good first issue Good for newcomers label Jun 11, 2024
@NeilGeorge1
Copy link

Hi! First timer here.

I'm eager to contribute and would like to take on this issue. Could you please assign it to me?

Thanks!

@wence-
Copy link
Contributor

wence- commented Jun 11, 2024

Related request: #15118

@Matt711
Copy link
Contributor

Matt711 commented Jun 11, 2024

Hi @NeilGeorge1, thanks for your interest in this issue! I'll assign it to you now. Also, take a look at our contributing guidelines here.

@NeilGeorge1
Copy link

Hi @Matt711
Thank you for assigning me this issue. I shall look into the contributing guidelines as well

AyodeAwe pushed a commit that referenced this issue Jun 11, 2024
## Description
This PR resolves two bugs in the recent pr #15945

## external issue labeling
Recent runs show that it is labeling [issues
created](#15967) by team members
as `External`

Using graphQL to explore the authorAssociation shows
`"authorAssociation": "MEMBER"` - I've updated the permissions to be
specific to the job in an attempt to ensure that we have the permissions
we need. Testing this action in personal repos shows it works as
expected so not 100% on what's going on.

A PR was also unable to run due to the token only having read
permissions, so hopefully this is a two birds one stone fix.

It may be beneficial to re-run
https://github.com/rapidsai/cudf/actions/runs/9462546964/job/26065765728
with debug mode on to see if `author_association` is different to the
action (which would be concerning)

*edit test*

## project automation
This fixes the workflow incorrectly calling my personal workflows for
testing.


## Checklist
- [x] I am familiar with the [Contributing
Guidelines](https://github.com/rapidsai/cudf/blob/HEAD/CONTRIBUTING.md).
- [ ] ~New or existing tests cover these changes.~
- [ ] ~The documentation is up to date with these changes.~
@NeilGeorge1
Copy link

Hi,

I've successfully implemented the feature as described above. However, while running pytest to verify the functionality across the project, I encountered an issue: 'module cudf is missing' , which is necessary for the tests to run properly.

Could you please provide guidance on how to resolve this? It seems that cudf should have been included or configured differently in the environment setup.

Should i manually install cudf?

Thank you for your assistance.

@wence-
Copy link
Contributor

wence- commented Jun 24, 2024

Hi,

I've successfully implemented the feature as described above. However, while running pytest to verify the functionality across the project, I encountered an issue: 'module cudf is missing' , which is necessary for the tests to run properly.

Could you please provide guidance on how to resolve this? It seems that cudf should have been included or configured differently in the environment setup.

Should i manually install cudf?

Thank you for your assistance.

Having cloned the repo, have you built cudf? If not, that is probably your issue. The simplest (IMO) way to do this is to use the devcontainer setup in the repository and then (inside the devcontainer) run build-cudf.

@NeilGeorge1
Copy link

Hi,
I've successfully implemented the feature as described above. However, while running pytest to verify the functionality across the project, I encountered an issue: 'module cudf is missing' , which is necessary for the tests to run properly.
Could you please provide guidance on how to resolve this? It seems that cudf should have been included or configured differently in the environment setup.
Should i manually install cudf?
Thank you for your assistance.

Having cloned the repo, have you built cudf? If not, that is probably your issue. The simplest (IMO) way to do this is to use the devcontainer setup in the repository and then (inside the devcontainer) run build-cudf.

Oh yes I did figure that out eventually. However, I'm facing errors after running pytest related to cuDF, likely because the installed NVIDIA drivers are not able to function properly. Since I'm using a pc without a gpu is this the root of the issue?

@wence-
Copy link
Contributor

wence- commented Jun 25, 2024

Since I'm using a pc without a gpu is this the root of the issue?

Yes, you can build cudf without a gpu, but you won't be able to run it.

Best thing to do is to push a branch and open a PR, we can then review and run things through CI.

@NeilGeorge1
Copy link

Ok thank you, I'll push the branch soon and open a PR. Looking forward to your review and CI validation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
External Issues or PRs created by external contributors feature request New feature or request good first issue Good for newcomers Python Affects Python cuDF API.
Projects
Status: Done
4 participants