-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: Save original index and remap after function completes #61116
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! I'm seeing a performance regression when the index does not contain duplicates. Can we do this conditionally.
pandas/core/methods/selectn.py
Outdated
# Save index and reset to default index to avoid performance impact | ||
# from when index contains duplicates | ||
original_index: Index = self.obj.index | ||
cur_series = self.obj.reset_index(drop=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of renaming cur_series -> noindex
and final_series -> result
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about default_index
as it more accurately represents that the index is reset to default 1..n
.
I appreciate the review! Can you confirm the performance regression numbers that you are seeing? Note that the timings in the original ticket were from a different machine than the one I'm working on. ASV Benchmarks appear similar enough to be reasonable in the standard case. -- Original Function --
-- Function with my changes --
|
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.Note: I'm new to this project, so this is my first PR.
Saves the index for SeriesNLargest at algorithm start and resets it before returning. This fixes performance issues when the index has many duplicate values.
Results:
Tests
The existing tests should cover this unless we want to add specific tests via the
asv_bench
.Addendum
I also modified the call to sort to use
sort(kind="stable")
to get consistent ordering which is what is currently happening in the equivalent Frame method (it was usingkind=mergesort
which is equivalent tokind=stable
, but kept for portability). I can remove this -- it may be better in another PR.https://numpy.org/doc/stable/reference/generated/numpy.sort.html#numpy.sort