Open
Description
Is it possible to implement method DataFrame.nsorted that equivalent to df.sort_values(by=["col1", "col2"], ascending=[True, False]).head(k)
, but with time complexity O(n log k)? I know about nlargest
and nsmallest
methods but unfortunately they work only with numeric columns.
Example:
df.nsorted(k=6, by=["col1", "col2"], ascending=[True, False])
Activity
rhshadrach commentedon Mar 22, 2025
Thanks for the request. I see no reason
nlargest
andnsmallest
cannot be made to work on object dtype. We'd need to add in a Cython implementation ofnlargest
as currently onlynsmallest
is implemented and we apply this to-values
which will not work with object dtype. Certainly this would be preferred over adding a new method that does the same thing.Further investigations and PRs to implement are welcome!
[-]ENH: nsorted method[/-][+]ENH: Enable nsmallest/nlargest on object dtype[/+]Gri72 commentedon Mar 22, 2025
Thank you for the reply. As far as I understood correctly, in the end it might look something like this?
rhshadrach commentedon Mar 22, 2025
I do not think we should add an
ascending
argument; this can be accomplished by sorting the result.RutujaGhodake commentedon Mar 24, 2025
take
MartinBraquet commentedon Mar 25, 2025
@rhshadrach I'd like to push back a little bit to make sure the issue is clearly understood, as it appears to me that the request extends beyond simply enabling dtype objects in nsmallest/nlargest.
Indeed, I believe there is a second ENH request which solely concerns sorting, irrespective of the type (that is, it also concerns numeric values), which I'll try to explain below.
What is currently missing in those nlargest and nsmallest methods is a way to sort values where the order is ascending for one column and descending for another column. One can see it as a combination of nlargest and smallest, depending on the column, which is what is being proposed by OP by a new nsorted method.
Here is an example that, I believe, can only be achieved via sort_values, which OP would like to get more efficiently, and which none of nlargest or smallest can perform, irrespective of a post-hoc sorting.
For instance, nmallest returns a different result, which irreversibly (regardless of subsequent operations) prevents the user from obtaining the desired result above:
Naturally, most users have managed their way through this issue by negating the columns for which they want nsmallest and then using nlargest on that modified dataframe; but this hack only works for numeric types, which, I assume, is the reason for the OP posting this ENH request.
Please let me know if I missed something.
If not, perhaps we should indeed have that nsorted method, with its
ascending
parameter, which would not only allow for dtypes but also have a cleaner code on the user end (without ad hoc negation of some columns).rhshadrach commentedon Mar 25, 2025
Agreed that this is not doable in the API today. However given the input
I do not think it makes sense to call the result:
neither "smallest" nor "largest". In other words, I think it'd be trying to force the ability to do an operation into a method where it does not belong.
MartinBraquet commentedon Mar 26, 2025
Right. Hence our suggestion from OP and I to make a new method called nsorted.
rhshadrach commentedon Mar 26, 2025
Ah, indeed, thanks! It does seem rather straightforward to parameterize
SelectN
to support a by-column ascending flag. Note this doesn't quite satisfy the OP, we'd also need to add Cython for object dtype. I'm +1 on addingnsorted
, and havingnsmallest
andnlargest
become convenience methods (e.g.nlargest
is justascending=False
).cc @pandas-dev/pandas-core
MartinBraquet commentedon Mar 26, 2025
Alright. So, to clarify, there are two independent upgrades needed, which can thus be addressed in two different PRs:
libalgos.kth_smallest
, a Cython function, to dtypesHow should we handle two PRs required for the same issue? Shouldn't we create a new issue to close after upgrade 1 is merged, and then close the issue here when upgrade 2 is merged?
rhshadrach commentedon Mar 26, 2025
That proposal has my support.
No - a PR need not close any issue. One can add the line
Part of #XYZ
in the OP of a PR and merging the PR will not close the issue.15 remaining items