feat(api): make topk() and value_counts() more flexible #10928

NickCrews · 2025-03-03T04:07:08Z

This PR changes many things. I bet we don't want all of them, but I thought it would be easiest to just put up a laundry list and then I will prune out the things we don't like.

Add a Table.topk(). I want this frequently to check for duplicates. I don't think it makes sense to provide a by argument here?
Add "See Also" links to the docstrings
made the "k" param to topk() be optional. If you don't supply one, it just ranks the values in descending order without limiting them. I do this often in interactive analysis, I like less typing, the .head(10) during the repr in interactive mode takes care of it for me. This isn't breaking for anyone.
Makes the default column name for Column.topk() be {column_name}_count, which is the same as for Column.value_counts(). This default is better because it is more consistent, and because it allows you to use .column syntax in subsequent expressions, eg col.topk().col_n.max(), where the current default with the parenthesis makes this impossible. For more complex by clauses, this suffix doesn't make sense, so I left the current behavior as is. This IS a breaking change for people relying on the old generation scheme, but IDK, they probably shouldn't have been relying on it. We could in fact add a note saying "consider this unstable" going forward, so we are more free later to change this again.
improved the top-line docstring for Column.topk(). The old Return a "top k" expression. is self-referential and fairly useless.

With all these changes, Column and Table both have .topk() and .value_counts(), and they all behave consistently, except Column.topk() has a by param, and Table.topk() does not.

cpcloud

+1 to all the changes here. I'm guessing that someone may eventually ask for a custom by with the table topk (not entirely sure for what at the moment though), but we can leave it as just count for now.

Thanks for working through this!

NickCrews force-pushed the improve-topk-value-counts branch from 75204d0 to f2284f7 Compare March 3, 2025 04:17

NickCrews added the ux User experience related issues label Mar 3, 2025

cpcloud added the feature Features or general enhancements label Mar 26, 2025

cpcloud approved these changes Mar 26, 2025

View reviewed changes

cpcloud force-pushed the improve-topk-value-counts branch from f2284f7 to d7f2d66 Compare March 26, 2025 12:19

cpcloud added the polars The polars backend label Mar 26, 2025

github-actions bot added the tests Issues or PRs related to tests label Mar 26, 2025

cpcloud added the ci-run-cloud Run BigQuery, Snowflake, Databricks, and Athena backend tests label Mar 26, 2025

ibis-docs-bot bot removed the ci-run-cloud Run BigQuery, Snowflake, Databricks, and Athena backend tests label Mar 26, 2025

NickCrews and others added 2 commits March 26, 2025 08:55

feat(api): make topk() and value_counts() more flexible

0b52990

feat(polars): support new topk features

148a540

cpcloud force-pushed the improve-topk-value-counts branch from d7f2d66 to 148a540 Compare March 26, 2025 12:55

cpcloud merged commit 329ad7c into ibis-project:main Mar 26, 2025
106 of 107 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(api): make topk() and value_counts() more flexible #10928

feat(api): make topk() and value_counts() more flexible #10928

Uh oh!

NickCrews commented Mar 3, 2025 •

edited

Loading

Uh oh!

cpcloud left a comment

Uh oh!

Uh oh!

Uh oh!

feat(api): make topk() and value_counts() more flexible #10928

feat(api): make topk() and value_counts() more flexible #10928

Uh oh!

Conversation

NickCrews commented Mar 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cpcloud left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

NickCrews commented Mar 3, 2025 •

edited

Loading