Skip to content

feat(api): make topk() and value_counts() more flexible #10928

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Mar 26, 2025

Conversation

NickCrews
Copy link
Contributor

@NickCrews NickCrews commented Mar 3, 2025

This PR changes many things. I bet we don't want all of them, but I thought it would be easiest to just put up a laundry list and then I will prune out the things we don't like.

  • Add a Table.topk(). I want this frequently to check for duplicates. I don't think it makes sense to provide a by argument here?
  • Add "See Also" links to the docstrings
  • made the "k" param to topk() be optional. If you don't supply one, it just ranks the values in descending order without limiting them. I do this often in interactive analysis, I like less typing, the .head(10) during the repr in interactive mode takes care of it for me. This isn't breaking for anyone.
  • Makes the default column name for Column.topk() be {column_name}_count, which is the same as for Column.value_counts(). This default is better because it is more consistent, and because it allows you to use .column syntax in subsequent expressions, eg col.topk().col_n.max(), where the current default with the parenthesis makes this impossible. For more complex by clauses, this suffix doesn't make sense, so I left the current behavior as is. This IS a breaking change for people relying on the old generation scheme, but IDK, they probably shouldn't have been relying on it. We could in fact add a note saying "consider this unstable" going forward, so we are more free later to change this again.
  • improved the top-line docstring for Column.topk(). The old Return a "top k" expression. is self-referential and fairly useless.

With all these changes, Column and Table both have .topk() and .value_counts(), and they all behave consistently, except Column.topk() has a by param, and Table.topk() does not.

@NickCrews NickCrews force-pushed the improve-topk-value-counts branch from 75204d0 to f2284f7 Compare March 3, 2025 04:17
@NickCrews NickCrews added the ux User experience related issues label Mar 3, 2025
@cpcloud cpcloud added the feature Features or general enhancements label Mar 26, 2025
Copy link
Member

@cpcloud cpcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to all the changes here. I'm guessing that someone may eventually ask for a custom by with the table topk (not entirely sure for what at the moment though), but we can leave it as just count for now.

Thanks for working through this!

@cpcloud cpcloud force-pushed the improve-topk-value-counts branch from f2284f7 to d7f2d66 Compare March 26, 2025 12:19
@cpcloud cpcloud added the polars The polars backend label Mar 26, 2025
@github-actions github-actions bot added the tests Issues or PRs related to tests label Mar 26, 2025
@cpcloud cpcloud added the ci-run-cloud Run BigQuery, Snowflake, Databricks, and Athena backend tests label Mar 26, 2025
@ibis-docs-bot ibis-docs-bot bot removed the ci-run-cloud Run BigQuery, Snowflake, Databricks, and Athena backend tests label Mar 26, 2025
@cpcloud cpcloud force-pushed the improve-topk-value-counts branch from d7f2d66 to 148a540 Compare March 26, 2025 12:55
@cpcloud cpcloud merged commit 329ad7c into ibis-project:main Mar 26, 2025
106 of 107 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Features or general enhancements polars The polars backend tests Issues or PRs related to tests ux User experience related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants