feat(api): make topk() and value_counts() more flexible #10928
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR changes many things. I bet we don't want all of them, but I thought it would be easiest to just put up a laundry list and then I will prune out the things we don't like.
by
argument here?{column_name}_count
, which is the same as for Column.value_counts(). This default is better because it is more consistent, and because it allows you to use .column syntax in subsequent expressions, egcol.topk().col_n.max()
, where the current default with the parenthesis makes this impossible. For more complexby
clauses, this suffix doesn't make sense, so I left the current behavior as is. This IS a breaking change for people relying on the old generation scheme, but IDK, they probably shouldn't have been relying on it. We could in fact add a note saying "consider this unstable" going forward, so we are more free later to change this again.Return a "top k" expression.
is self-referential and fairly useless.With all these changes, Column and Table both have .topk() and .value_counts(), and they all behave consistently, except Column.topk() has a
by
param, and Table.topk() does not.