ENH: improve stats.nanmedian more by assuming nans are rare #3396

Merged
merged 2 commits into from Mar 9, 2014

Conversation

Projects
None yet
3 participants
Contributor

juliantaylor commented Feb 26, 2014

Move nans to end of array instead of creating a new array without them.
This is faster under the reasonable assumption that nans are rare.

ENH: improve stats.nanmedian more by assuming nans are rare
Move nans to end of array instead of creating a new array without them.
This is faster under the reasonable assumption that nans are rare.
Contributor

juliantaylor commented Feb 26, 2014

as 0.14.x branching is delayed I might as well improve it more :)
difference to normal median is now less than 30%, even better with numpy 1.9.dev due to better np.where

In [12]: d = np.arange(1000000.) # np 1.8.0
In [13]: np.random.shuffle(d)
In [15]: d[::20] = np.nan
In [16]: %timeit scipy.stats.nanmedian(d)
100 loops, best of 3: 19.6 ms per loop
In [17]: %timeit np.median(d)
100 loops, best of 3: 16.2 ms per loop
Contributor

juliantaylor commented Feb 26, 2014

do we want to add a warning in the all nan case?
numpy.nanmedian will likely do that

Owner

rgommers commented Feb 26, 2014

+1 for warning on all-nan input

Owner

rgommers commented Feb 26, 2014

This PR doesn't really help for readability. How much does this improve performance compared to current master?

Contributor

juliantaylor commented Feb 26, 2014

about 30%-40% faster

Owner

rgommers commented Feb 26, 2014

Thanks. That's (just) above my "don't care" threshold, so +0.5 on this PR.

Contributor

juliantaylor commented Feb 26, 2014

added warnings and improved tests a little

Member

josef-pkt commented Feb 27, 2014

looks fine to me, even if it's a bit tricky.

I don't understand where the time savings are really coming from, we need to create a copy in both cases. Is np.compress so much slower than the new version?

Contributor

juliantaylor commented Feb 27, 2014

the gains come from doing the reverse what np.compress does
np.compress runs np.where on the negated boolean condition to produce the index array and then runs np.take with that (which is fast but slower than memcpy used by copy()). It also involves allocating a index array as large as the original data - Nans, this of causes a lot of page faults.
the old way is faster if you have more NaN than normal numbers but I'm assuming having less invalid numbers than valid ones is the more common case.

Member

josef-pkt commented Feb 27, 2014

Thanks for the explanation.
I can see that shuffling a few nans can be faster than indexing large parts of the array.
The same might also apply to some masked array functions (that cannot be calculated by filling in with neutral elements).

rgommers added a commit that referenced this pull request Mar 9, 2014

Merge pull request #3396 from juliantaylor/nanmedian-improve2
ENH: improve stats.nanmedian more by assuming nans are rare

@rgommers rgommers merged commit aabdb6d into scipy:master Mar 9, 2014

1 check failed

default The Travis CI build failed
Details
Owner

rgommers commented Mar 9, 2014

OK time to merge this. Thanks Julian, Josef.

@rgommers rgommers added this to the 0.15.0 milestone Mar 9, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment