Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: improve stats.nanmedian more by assuming nans are rare #3396

Merged
merged 2 commits into from
Mar 9, 2014

Conversation

juliantaylor
Copy link
Contributor

Move nans to end of array instead of creating a new array without them.
This is faster under the reasonable assumption that nans are rare.

Move nans to end of array instead of creating a new array without them.
This is faster under the reasonable assumption that nans are rare.
@juliantaylor
Copy link
Contributor Author

as 0.14.x branching is delayed I might as well improve it more :)
difference to normal median is now less than 30%, even better with numpy 1.9.dev due to better np.where

In [12]: d = np.arange(1000000.) # np 1.8.0
In [13]: np.random.shuffle(d)
In [15]: d[::20] = np.nan
In [16]: %timeit scipy.stats.nanmedian(d)
100 loops, best of 3: 19.6 ms per loop
In [17]: %timeit np.median(d)
100 loops, best of 3: 16.2 ms per loop

@juliantaylor
Copy link
Contributor Author

do we want to add a warning in the all nan case?
numpy.nanmedian will likely do that

@rgommers
Copy link
Member

+1 for warning on all-nan input

@rgommers
Copy link
Member

This PR doesn't really help for readability. How much does this improve performance compared to current master?

@juliantaylor
Copy link
Contributor Author

about 30%-40% faster

@rgommers
Copy link
Member

Thanks. That's (just) above my "don't care" threshold, so +0.5 on this PR.

@juliantaylor
Copy link
Contributor Author

added warnings and improved tests a little

@josef-pkt
Copy link
Member

looks fine to me, even if it's a bit tricky.

I don't understand where the time savings are really coming from, we need to create a copy in both cases. Is np.compress so much slower than the new version?

@juliantaylor
Copy link
Contributor Author

the gains come from doing the reverse what np.compress does
np.compress runs np.where on the negated boolean condition to produce the index array and then runs np.take with that (which is fast but slower than memcpy used by copy()). It also involves allocating a index array as large as the original data - Nans, this of causes a lot of page faults.
the old way is faster if you have more NaN than normal numbers but I'm assuming having less invalid numbers than valid ones is the more common case.

@josef-pkt
Copy link
Member

Thanks for the explanation.
I can see that shuffling a few nans can be faster than indexing large parts of the array.
The same might also apply to some masked array functions (that cannot be calculated by filling in with neutral elements).

rgommers added a commit that referenced this pull request Mar 9, 2014
ENH: improve stats.nanmedian more by assuming nans are rare
@rgommers rgommers merged commit aabdb6d into scipy:master Mar 9, 2014
@rgommers
Copy link
Member

rgommers commented Mar 9, 2014

OK time to merge this. Thanks Julian, Josef.

@rgommers rgommers added this to the 0.15.0 milestone Mar 9, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.stats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants