Skip to content

ENH: Enable nsmallest/nlargest on object dtype #61166

Open
@Gri72

Description

@Gri72

Is it possible to implement method DataFrame.nsorted that equivalent to df.sort_values(by=["col1", "col2"], ascending=[True, False]).head(k), but with time complexity O(n log k)? I know about nlargest and nsmallest methods but unfortunately they work only with numeric columns.

Example:

df.nsorted(k=6, by=["col1", "col2"], ascending=[True, False])

Activity

rhshadrach

rhshadrach commented on Mar 22, 2025

@rhshadrach
Member

Thanks for the request. I see no reason nlargest and nsmallest cannot be made to work on object dtype. We'd need to add in a Cython implementation of nlargest as currently only nsmallest is implemented and we apply this to -values which will not work with object dtype. Certainly this would be preferred over adding a new method that does the same thing.

Further investigations and PRs to implement are welcome!

changed the title [-]ENH: nsorted method[/-] [+]ENH: Enable nsmallest/nlargest on object dtype[/+] on Mar 22, 2025
Gri72

Gri72 commented on Mar 22, 2025

@Gri72
Author

Thank you for the reply. As far as I understood correctly, in the end it might look something like this?

df.nlargest(n=6, columns=['col1', 'col2'], ascending=[True, False])
rhshadrach

rhshadrach commented on Mar 22, 2025

@rhshadrach
Member

I do not think we should add an ascending argument; this can be accomplished by sorting the result.

RutujaGhodake

RutujaGhodake commented on Mar 24, 2025

@RutujaGhodake

take

MartinBraquet

MartinBraquet commented on Mar 25, 2025

@MartinBraquet
Contributor

@rhshadrach I'd like to push back a little bit to make sure the issue is clearly understood, as it appears to me that the request extends beyond simply enabling dtype objects in nsmallest/nlargest.

Indeed, I believe there is a second ENH request which solely concerns sorting, irrespective of the type (that is, it also concerns numeric values), which I'll try to explain below.

What is currently missing in those nlargest and nsmallest methods is a way to sort values where the order is ascending for one column and descending for another column. One can see it as a combination of nlargest and smallest, depending on the column, which is what is being proposed by OP by a new nsorted method.

Here is an example that, I believe, can only be achieved via sort_values, which OP would like to get more efficiently, and which none of nlargest or smallest can perform, irrespective of a post-hoc sorting.

import pandas as pd

df = pd.DataFrame({
    'c1': [1, 2, 2],
    'c2': [1, 2, 3],
})
   c1  c2
0   1   1
1   2   2
2   2   3
df.sort_values(['c1', 'c2'], ascending=[True, False]).head(2)
   c1  c2
0   1   1
2   2   3

For instance, nmallest returns a different result, which irreversibly (regardless of subsequent operations) prevents the user from obtaining the desired result above:

df.nsmallest(2, ['c1', 'c2'])
   c1  c2
0   1   1
1   2   2

Naturally, most users have managed their way through this issue by negating the columns for which they want nsmallest and then using nlargest on that modified dataframe; but this hack only works for numeric types, which, I assume, is the reason for the OP posting this ENH request.

Please let me know if I missed something.

If not, perhaps we should indeed have that nsorted method, with its ascending parameter, which would not only allow for dtypes but also have a cleaner code on the user end (without ad hoc negation of some columns).

rhshadrach

rhshadrach commented on Mar 25, 2025

@rhshadrach
Member

Agreed that this is not doable in the API today. However given the input

   c1  c2
0   1   1
1   2   2
2   2   3

I do not think it makes sense to call the result:

   c1  c2
0   1   1
2   2   3

neither "smallest" nor "largest". In other words, I think it'd be trying to force the ability to do an operation into a method where it does not belong.

MartinBraquet

MartinBraquet commented on Mar 26, 2025

@MartinBraquet
Contributor

Right. Hence our suggestion from OP and I to make a new method called nsorted.

rhshadrach

rhshadrach commented on Mar 26, 2025

@rhshadrach
Member

Ah, indeed, thanks! It does seem rather straightforward to parameterize SelectN to support a by-column ascending flag. Note this doesn't quite satisfy the OP, we'd also need to add Cython for object dtype. I'm +1 on adding nsorted, and having nsmallest and nlargest become convenience methods (e.g. nlargest is just ascending=False).

cc @pandas-dev/pandas-core

MartinBraquet

MartinBraquet commented on Mar 26, 2025

@MartinBraquet
Contributor

Alright. So, to clarify, there are two independent upgrades needed, which can thus be addressed in two different PRs:

  1. Add nsorted method to Dataframe and Series, where the bulk of the change should be done in pandas.core.methods.selectn.SelectNFrame
  2. Enable dtypes for nsorted (and, as a consequence, nlargest and nsmallest), where the bulk of the change should be done in pandas.core.methods.selectn.SelectNSeries, and more specifically extending libalgos.kth_smallest, a Cython function, to dtypes

How should we handle two PRs required for the same issue? Shouldn't we create a new issue to close after upgrade 1 is merged, and then close the issue here when upgrade 2 is merged?

rhshadrach

rhshadrach commented on Mar 26, 2025

@rhshadrach
Member

That proposal has my support.

How should we handle two PRs required for the same issue? Shouldn't we create a new issue to close after upgrade 1 is merged, and then close the issue here when upgrade 2 is merged?

No - a PR need not close any issue. One can add the line Part of #XYZ in the OP of a PR and merging the PR will not close the issue.

15 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

    Participants

    @snitish@simonjayhawkins@Dr-Irv@MartinBraquet@rhshadrach

    Issue actions

      ENH: Enable nsmallest/nlargest on object dtype · Issue #61166 · pandas-dev/pandas