ENH: improve stats.nanmedian more by assuming nans are rare #3396

juliantaylor · 2014-02-26T21:07:56Z

Move nans to end of array instead of creating a new array without them.
This is faster under the reasonable assumption that nans are rare.

Move nans to end of array instead of creating a new array without them. This is faster under the reasonable assumption that nans are rare.

juliantaylor · 2014-02-26T21:09:17Z

as 0.14.x branching is delayed I might as well improve it more :)
difference to normal median is now less than 30%, even better with numpy 1.9.dev due to better np.where

In [12]: d = np.arange(1000000.) # np 1.8.0
In [13]: np.random.shuffle(d)
In [15]: d[::20] = np.nan
In [16]: %timeit scipy.stats.nanmedian(d)
100 loops, best of 3: 19.6 ms per loop
In [17]: %timeit np.median(d)
100 loops, best of 3: 16.2 ms per loop

juliantaylor · 2014-02-26T21:49:42Z

do we want to add a warning in the all nan case?
numpy.nanmedian will likely do that

rgommers · 2014-02-26T22:10:52Z

+1 for warning on all-nan input

rgommers · 2014-02-26T22:12:59Z

This PR doesn't really help for readability. How much does this improve performance compared to current master?

juliantaylor · 2014-02-26T22:16:58Z

about 30%-40% faster

rgommers · 2014-02-26T22:20:23Z

Thanks. That's (just) above my "don't care" threshold, so +0.5 on this PR.

juliantaylor · 2014-02-26T22:35:40Z

added warnings and improved tests a little

josef-pkt · 2014-02-27T14:55:54Z

looks fine to me, even if it's a bit tricky.

I don't understand where the time savings are really coming from, we need to create a copy in both cases. Is np.compress so much slower than the new version?

juliantaylor · 2014-02-27T17:33:36Z

the gains come from doing the reverse what np.compress does
np.compress runs np.where on the negated boolean condition to produce the index array and then runs np.take with that (which is fast but slower than memcpy used by copy()). It also involves allocating a index array as large as the original data - Nans, this of causes a lot of page faults.
the old way is faster if you have more NaN than normal numbers but I'm assuming having less invalid numbers than valid ones is the more common case.

josef-pkt · 2014-02-27T17:46:27Z

Thanks for the explanation.
I can see that shuffling a few nans can be faster than indexing large parts of the array.
The same might also apply to some masked array functions (that cannot be calculated by filling in with neutral elements).

ENH: improve stats.nanmedian more by assuming nans are rare

rgommers · 2014-03-09T16:37:13Z

OK time to merge this. Thanks Julian, Josef.

ENH: improve stats.nanmedian more by assuming nans are rare

dd68c56

Move nans to end of array instead of creating a new array without them. This is faster under the reasonable assumption that nans are rare.

juliantaylor mentioned this pull request Feb 26, 2014

ENH: added functionality nanmedian to numpy numpy/numpy#4307

Merged

rgommers added scipy.stats labels Feb 26, 2014

ENH: add runtime warning for all nan slice of nanmedian

ff6bc9e

rgommers added a commit that referenced this pull request Mar 9, 2014

Merge pull request #3396 from juliantaylor/nanmedian-improve2

aabdb6d

ENH: improve stats.nanmedian more by assuming nans are rare

rgommers merged commit aabdb6d into scipy:master Mar 9, 2014

rgommers added this to the 0.15.0 milestone Mar 9, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: improve stats.nanmedian more by assuming nans are rare #3396

ENH: improve stats.nanmedian more by assuming nans are rare #3396

juliantaylor commented Feb 26, 2014

juliantaylor commented Feb 26, 2014

juliantaylor commented Feb 26, 2014

rgommers commented Feb 26, 2014

rgommers commented Feb 26, 2014

juliantaylor commented Feb 26, 2014

rgommers commented Feb 26, 2014

juliantaylor commented Feb 26, 2014

josef-pkt commented Feb 27, 2014

juliantaylor commented Feb 27, 2014

josef-pkt commented Feb 27, 2014

rgommers commented Mar 9, 2014

ENH: improve stats.nanmedian more by assuming nans are rare #3396

ENH: improve stats.nanmedian more by assuming nans are rare #3396

Conversation

juliantaylor commented Feb 26, 2014

juliantaylor commented Feb 26, 2014

juliantaylor commented Feb 26, 2014

rgommers commented Feb 26, 2014

rgommers commented Feb 26, 2014

juliantaylor commented Feb 26, 2014

rgommers commented Feb 26, 2014

juliantaylor commented Feb 26, 2014

josef-pkt commented Feb 27, 2014

juliantaylor commented Feb 27, 2014

josef-pkt commented Feb 27, 2014

rgommers commented Mar 9, 2014