Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(api): make topk() and value_counts() more flexible #10928

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

NickCrews
Copy link
Contributor

@NickCrews NickCrews commented Mar 3, 2025

This PR changes many things. I bet we don't want all of them, but I thought it would be easiest to just put up a laundry list and then I will prune out the things we don't like.

  • Add a Table.topk(). I want this frequently to check for duplicates. I don't think it makes sense to provide a by argument here?
  • Add "See Also" links to the docstrings
  • made the "k" param to topk() be optional. If you don't supply one, it just ranks the values in descending order without limiting them. I do this often in interactive analysis, I like less typing, the .head(10) during the repr in interactive mode takes care of it for me. This isn't breaking for anyone.
  • Makes the default column name for Column.topk() be {column_name}_count, which is the same as for Column.value_counts(). This default is better because it is more consistent, and because it allows you to use .column syntax in subsequent expressions, eg col.topk().col_n.max(), where the current default with the parenthesis makes this impossible. For more complex by clauses, this suffix doesn't make sense, so I left the current behavior as is. This IS a breaking change for people relying on the old generation scheme, but IDK, they probably shouldn't have been relying on it. We could in fact add a note saying "consider this unstable" going forward, so we are more free later to change this again.
  • improved the top-line docstring for Column.topk(). The old Return a "top k" expression. is self-referential and fairly useless.

With all these changes, Column and Table both have .topk() and .value_counts(), and they all behave consistently, except Column.topk() has a by param, and Table.topk() does not.

@NickCrews NickCrews force-pushed the improve-topk-value-counts branch from 75204d0 to f2284f7 Compare March 3, 2025 04:17
@NickCrews NickCrews added the ux User experience related issues label Mar 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ux User experience related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant