Cythonize stats.ks_2samp for a ~33% gain in speed. #5938

anntzer · 2016-03-07T17:54:23Z

larsoner · 2016-03-07T18:18:51Z

scipy/stats/stats.py

-    cdf2 = np.searchsorted(data2, data_all, side='right') / (1.0*n2)
-    d = np.max(np.absolute(cdf1 - cdf2))
+    from . import _stats
+    d = _stats.ks_2samp(data1, data2)


It might actually be worth keeping the old lines and just commenting them out, saying your code implements something equivalent to those lines, but more efficiently.

To be honest I'm not even sure the old version is clearer than the new one. I don't really care though.

I agree it's not :) It can still be useful to see how the code could be done in Python land without loops if necessary, though.

pv · 2016-03-07T18:43:18Z

In setup.py, copypaste the vonmises_cython lines and replace vonmises_cython -> _stats.

ev-br · 2016-03-07T18:46:09Z

scipy/stats/_stats.pyx

+    double
+
+
+cpdef double ks_2samp(real[:] data1, real[:] data2):


As long as it's only called from Python space, the function can be just def.

I mostly made it a cpdef so that I can annotate the return type as double (mostly for documentation purposes -- "def" functions can't have a return type annotated, I think).

ev-br · 2016-03-07T19:14:54Z

scipy/stats/_stats.pyx

@@ -0,0 +1,31 @@
+#cython: boundscheck=False
+#cython: nonecheck=False


I think this is better done at a function level, especially in view of consolidating functions some of which may be user-facing.

josef-pkt · 2016-03-08T00:57:56Z

just a few general comments:

searchsorted is easy to read as cdf, after a bit of thinking. We use it in an analogous way in k-sample anderson darling test. (I'm not able to easily figure out algorithmic loops.)

I never thought of using kolmogorov-smirnov for anything except numbers. So I think strings and object arrays could be deprecated, if someone ever thought of using it that way.

about fusing types: The current code is combining the data arrays and so has to cast to a common dtype. Since the test is for equal distribution of two data arrays, I would assume that casting to a common dtype would be acceptable, I guess they would differ only in some outlier use cases.

josef-pkt · 2016-03-08T01:17:50Z

one special case: What happens compared to before when there are nans in the array?

anntzer · 2016-03-08T02:13:41Z

Good catch regarding nans: the old code would return some nonsensical result; the current implementation fails because both d1i <= d2j and d1i >= d2j evaluate to False when one of them is None, and thus the loop never ends.
I think the correct thing to do here is just to break that behavior and raise an exception on non-real or nan-containing inputs. Thoughts?

josef-pkt · 2016-03-08T02:28:11Z

As far as I can figure out from an example and the behavior of sort and searchsorted (I only tried on older versions of numpy and scipy, which I had open): nans are sorted to the end and treated as equal.

Which may or may not be what anyone uses. I haven't seen a use case of it.

>>> x = np.array([1,2,3.5, np.nan, np.nan, 4, 5.5])
>>> stats.ks_2samp(x, np.sqrt(x))
(0.4285714285714286, 0.4232182945334888)

>>> xn = x.copy()
>>> xn[np.isnan(x)] = 10
>>> xn2 = np.sqrt(x)
>>> xn2[np.isnan(x)] = 10
>>> stats.ks_2samp(xn, xn2)
(0.4285714285714286, 0.4232182945334888)

>>> import scipy
>>> scipy.__version__
'0.13.3'
>>> import numpy
>>> numpy.__version__
'1.6.1'
``

josef-pkt · 2016-03-08T02:33:33Z

BTW: I couldn't think of any other special case that could cause problems. Ties and inf should all be unchanged, AFAICS.

codecov-io · 2016-03-08T03:02:50Z

@@            master   #5938   diff @@
======================================
  Files          238     238       
  Stmts        43803   43803       
  Branches      8211    8213     +2
  Methods          0       0       
======================================
- Hit          34230   34226     -4
- Partial       2603    2605     +2
- Missed        6970    6972     +2

Review entire Coverage Diff as of c74bd14

Powered by Codecov. Updated on successful CI builds.

josef-pkt · 2016-03-08T13:23:07Z

I'm thinking about the nan issue again.

AFAICS, we can get now missing =drop essentially for free. Because nans are sorted to the end, we can just stop at the first nan in each data array and adjust the number of observations, n in the final statistic.

josef-pkt · 2016-03-08T14:40:16Z

scipy/stats/stats.py

+            not np.issubdtype(common_type, np.complexfloating)):
+        raise ValueError('ks_2samp only accepts real inputs')
+    if np.any(np.isnan(data1)) or np.any(np.isnan(data2)):
+        raise ValueError('ks_2samp only accepts non-nan inputs')


an optimization to avoid the overhead for the standard case of no nans

np.sort moves nans at the end, so only data1[-1] and data2[-2] need to be checked for isnan

anntzer · 2016-03-08T16:46:37Z

@josef-pkt Included the nan-optimization.
I don't think including nan=drop semantics by default is a good thing (explicit better than implicit, etc.).

ev-br · 2016-03-14T04:11:51Z

scipy/stats/stats.py

+        raise ValueError('ks_2samp only accepts real inputs')
+    # nans, if any, are at the end after sorting.
+    if np.isnan(data1[-1]) or np.isnan(data2[-1]):
+        raise ValueError('ks_2samp only accepts non-nan inputs')


This is, strictly speaking, a back-compat break, so it needs a mention in the release notes.

Something like

stats.ks_2samp now only accepts real, non-nan inputs. It used to return nonsensical values for such inputs before.

?
If that looks good to you I'll add that and squash the commit history.

anntzer · 2016-03-16T05:27:14Z

Edited release notes and squashed commit history.

ev-br · 2016-03-26T15:50:59Z

doc/release/0.18.0-notes.rst

@@ -65,6 +65,9 @@ is now consistently added after the matrix is applied,
 independent of if the matrix is specified using a one-dimensional
 or a two-dimensional array.

+``stats.ks_2samp`` now only accepts real, non-nan inputs. It used to return
+nonsensical values for such inputs before.


It used to return nonsensical values for ... inputs which are not real and not nan? I'm just confused what "such" stands for here.

Yeah, that actually sounds horrible. What about:
stats.ks_2samp used to return nonsensical values if the input was not real or contained nans. It now raises an exception for such inputs.

Remove nonsensical output for non-real or nan-containing inputs, raise an exception instead for them.

ev-br · 2016-03-27T23:56:43Z

Looks good, Travis is green, merging. Thank you Antony

ev-br · 2016-03-28T08:40:39Z

Needs a rebase.

anntzer · 2016-03-28T16:31:49Z

You mean the other PR right?

ev-br · 2016-03-28T16:36:34Z

Yuk, yup, wrong link from the phone. Sorry for the noise

Since it's no longer used. It was added in scipygh-5938 for scipy 0.18.0 to get some speedup for ks_2samp, but then the addition was reverted in scipygh-6545, following the discussion in scipygh-6435: it gives different answers on different machines, it changes one ad hoc statistic to a different ad hoc statistic, and neither of them are clearly "correct".

Revert gh-5938, restore ks_2samp

Since it's no longer used. It was added in scipygh-5938 for scipy 0.18.0 to get some speedup for ks_2samp, but then the addition was reverted in scipygh-6545, following the discussion in scipygh-6435: it gives different answers on different machines, it changes one ad hoc statistic to a different ad hoc statistic, and neither of them are clearly "correct".

larsoner reviewed Mar 7, 2016
View reviewed changes

ev-br reviewed Mar 7, 2016
View reviewed changes

ev-br added scipy.stats maintenance Items related to regular maintenance tasks labels Mar 7, 2016

ev-br reviewed Mar 7, 2016
View reviewed changes

josef-pkt reviewed Mar 8, 2016
View reviewed changes

anntzer force-pushed the faster-ks2samp branch from d4800cb to 7a0c629 Compare March 8, 2016 16:44

ev-br reviewed Mar 14, 2016
View reviewed changes

anntzer force-pushed the faster-ks2samp branch from 7a0c629 to e244563 Compare March 16, 2016 05:26

ev-br reviewed Mar 26, 2016
View reviewed changes

Cythonize stats.ks_2samp for a ~33% gain in speed.

2f1720a

Remove nonsensical output for non-real or nan-containing inputs, raise an exception instead for them.

anntzer force-pushed the faster-ks2samp branch from e244563 to 2f1720a Compare March 27, 2016 22:58

ev-br merged commit fa9a6f5 into scipy:master Mar 27, 2016

ev-br added this to the 0.18.0 milestone Mar 27, 2016

ev-br mentioned this pull request Mar 27, 2016

faster implementation of ks_2samp #5936

Closed

anntzer deleted the faster-ks2samp branch March 28, 2016 00:04

josef-pkt mentioned this pull request Apr 25, 2016

ks_2samp accepts array of builtin_function_or_method and returns result #6099

Closed

pv mentioned this pull request Jul 30, 2016

scipy.stats.ks_2samp returns different values on different computers #6435

Closed

ev-br mentioned this pull request Sep 4, 2016

Revert gh-5938, restore ks_2samp #6545

Merged

rgommers added a commit that referenced this pull request Sep 13, 2016

Merge pull request #6545 from ev-br/revert_ks_2samp

2526df7

Revert gh-5938, restore ks_2samp

rgommers mentioned this pull request Sep 14, 2016

Behavior of stats.ks_2samp for ties needs investigating #6575

Open

This was referenced Nov 9, 2018

Cythonize ks_2samp. #9462

Closed

Fast implementation of kuiper_two astropy/astropy#8098

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cythonize stats.ks_2samp for a ~33% gain in speed. #5938

Cythonize stats.ks_2samp for a ~33% gain in speed. #5938

anntzer commented Mar 7, 2016

larsoner Mar 7, 2016

anntzer Mar 7, 2016

larsoner Mar 7, 2016

pv commented Mar 7, 2016

ev-br Mar 7, 2016

anntzer Mar 7, 2016

ev-br Mar 7, 2016

ev-br Mar 7, 2016

josef-pkt commented Mar 8, 2016

josef-pkt commented Mar 8, 2016

anntzer commented Mar 8, 2016

josef-pkt commented Mar 8, 2016

josef-pkt commented Mar 8, 2016

codecov-io commented Mar 8, 2016

josef-pkt commented Mar 8, 2016

josef-pkt Mar 8, 2016

anntzer commented Mar 8, 2016

ev-br Mar 14, 2016

anntzer Mar 14, 2016

anntzer commented Mar 16, 2016

ev-br Mar 26, 2016

anntzer Mar 27, 2016

ev-br Mar 27, 2016

anntzer Mar 27, 2016

ev-br commented Mar 27, 2016

ev-br commented Mar 28, 2016

anntzer commented Mar 28, 2016

ev-br commented Mar 28, 2016

		@@ -0,0 +1,31 @@
		#cython: boundscheck=False
		#cython: nonecheck=False

Cythonize stats.ks_2samp for a ~33% gain in speed. #5938

Cythonize stats.ks_2samp for a ~33% gain in speed. #5938

Conversation

anntzer commented Mar 7, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pv commented Mar 7, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

josef-pkt commented Mar 8, 2016

josef-pkt commented Mar 8, 2016

anntzer commented Mar 8, 2016

josef-pkt commented Mar 8, 2016

josef-pkt commented Mar 8, 2016

codecov-io commented Mar 8, 2016

josef-pkt commented Mar 8, 2016

Choose a reason for hiding this comment

anntzer commented Mar 8, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anntzer commented Mar 16, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ev-br commented Mar 27, 2016

ev-br commented Mar 28, 2016

anntzer commented Mar 28, 2016

ev-br commented Mar 28, 2016