Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cythonize stats.ks_2samp for a ~33% gain in speed. #5938

Merged
merged 1 commit into from
Mar 27, 2016

Conversation

anntzer
Copy link
Contributor

@anntzer anntzer commented Mar 7, 2016

See #5936.

cdf2 = np.searchsorted(data2, data_all, side='right') / (1.0*n2)
d = np.max(np.absolute(cdf1 - cdf2))
from . import _stats
d = _stats.ks_2samp(data1, data2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might actually be worth keeping the old lines and just commenting them out, saying your code implements something equivalent to those lines, but more efficiently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest I'm not even sure the old version is clearer than the new one. I don't really care though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it's not :) It can still be useful to see how the code could be done in Python land without loops if necessary, though.

@pv
Copy link
Member

pv commented Mar 7, 2016

In setup.py, copypaste the vonmises_cython lines and replace vonmises_cython -> _stats.

double


cpdef double ks_2samp(real[:] data1, real[:] data2):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as it's only called from Python space, the function can be just def.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mostly made it a cpdef so that I can annotate the return type as double (mostly for documentation purposes -- "def" functions can't have a return type annotated, I think).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@ev-br ev-br added scipy.stats maintenance Items related to regular maintenance tasks labels Mar 7, 2016
@@ -0,0 +1,31 @@
#cython: boundscheck=False
#cython: nonecheck=False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is better done at a function level, especially in view of consolidating functions some of which may be user-facing.

@josef-pkt
Copy link
Member

just a few general comments:

searchsorted is easy to read as cdf, after a bit of thinking. We use it in an analogous way in k-sample anderson darling test. (I'm not able to easily figure out algorithmic loops.)

I never thought of using kolmogorov-smirnov for anything except numbers. So I think strings and object arrays could be deprecated, if someone ever thought of using it that way.

about fusing types: The current code is combining the data arrays and so has to cast to a common dtype. Since the test is for equal distribution of two data arrays, I would assume that casting to a common dtype would be acceptable, I guess they would differ only in some outlier use cases.

@josef-pkt
Copy link
Member

one special case: What happens compared to before when there are nans in the array?

@anntzer
Copy link
Contributor Author

anntzer commented Mar 8, 2016

Good catch regarding nans: the old code would return some nonsensical result; the current implementation fails because both d1i <= d2j and d1i >= d2j evaluate to False when one of them is None, and thus the loop never ends.
I think the correct thing to do here is just to break that behavior and raise an exception on non-real or nan-containing inputs. Thoughts?

@josef-pkt
Copy link
Member

As far as I can figure out from an example and the behavior of sort and searchsorted (I only tried on older versions of numpy and scipy, which I had open): nans are sorted to the end and treated as equal.

Which may or may not be what anyone uses. I haven't seen a use case of it.

>>> x = np.array([1,2,3.5, np.nan, np.nan, 4, 5.5])
>>> stats.ks_2samp(x, np.sqrt(x))
(0.4285714285714286, 0.4232182945334888)

>>> xn = x.copy()
>>> xn[np.isnan(x)] = 10
>>> xn2 = np.sqrt(x)
>>> xn2[np.isnan(x)] = 10
>>> stats.ks_2samp(xn, xn2)
(0.4285714285714286, 0.4232182945334888)

>>> import scipy
>>> scipy.__version__
'0.13.3'
>>> import numpy
>>> numpy.__version__
'1.6.1'
``

@josef-pkt
Copy link
Member

BTW: I couldn't think of any other special case that could cause problems. Ties and inf should all be unchanged, AFAICS.

@codecov-io
Copy link

@@            master   #5938   diff @@
======================================
  Files          238     238       
  Stmts        43803   43803       
  Branches      8211    8213     +2
  Methods          0       0       
======================================
- Hit          34230   34226     -4
- Partial       2603    2605     +2
- Missed        6970    6972     +2

Review entire Coverage Diff as of c74bd14

Powered by Codecov. Updated on successful CI builds.

@josef-pkt
Copy link
Member

I'm thinking about the nan issue again.

AFAICS, we can get now missing =drop essentially for free. Because nans are sorted to the end, we can just stop at the first nan in each data array and adjust the number of observations, n in the final statistic.

not np.issubdtype(common_type, np.complexfloating)):
raise ValueError('ks_2samp only accepts real inputs')
if np.any(np.isnan(data1)) or np.any(np.isnan(data2)):
raise ValueError('ks_2samp only accepts non-nan inputs')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an optimization to avoid the overhead for the standard case of no nans

np.sort moves nans at the end, so only data1[-1] and data2[-2] need to be checked for isnan

@anntzer
Copy link
Contributor Author

anntzer commented Mar 8, 2016

@josef-pkt Included the nan-optimization.
I don't think including nan=drop semantics by default is a good thing (explicit better than implicit, etc.).

raise ValueError('ks_2samp only accepts real inputs')
# nans, if any, are at the end after sorting.
if np.isnan(data1[-1]) or np.isnan(data2[-1]):
raise ValueError('ks_2samp only accepts non-nan inputs')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is, strictly speaking, a back-compat break, so it needs a mention in the release notes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like

stats.ks_2samp now only accepts real, non-nan inputs. It used to return nonsensical values for such inputs before.

?
If that looks good to you I'll add that and squash the commit history.

@anntzer
Copy link
Contributor Author

anntzer commented Mar 16, 2016

Edited release notes and squashed commit history.

@@ -65,6 +65,9 @@ is now consistently added after the matrix is applied,
independent of if the matrix is specified using a one-dimensional
or a two-dimensional array.

``stats.ks_2samp`` now only accepts real, non-nan inputs. It used to return
nonsensical values for such inputs before.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It used to return nonsensical values for ... inputs which are not real and not nan? I'm just confused what "such" stands for here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that actually sounds horrible. What about:
stats.ks_2samp used to return nonsensical values if the input was not real or contained nans. It now raises an exception for such inputs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Committed.

Remove nonsensical output for non-real or nan-containing inputs, raise
an exception instead for them.
@ev-br
Copy link
Member

ev-br commented Mar 27, 2016

Looks good, Travis is green, merging. Thank you Antony

@ev-br ev-br merged commit fa9a6f5 into scipy:master Mar 27, 2016
@ev-br ev-br added this to the 0.18.0 milestone Mar 27, 2016
@anntzer anntzer deleted the faster-ks2samp branch March 28, 2016 00:04
@ev-br
Copy link
Member

ev-br commented Mar 28, 2016

Needs a rebase.

@anntzer
Copy link
Contributor Author

anntzer commented Mar 28, 2016

You mean the other PR right?

@ev-br
Copy link
Member

ev-br commented Mar 28, 2016

Yuk, yup, wrong link from the phone. Sorry for the noise

ev-br added a commit to ev-br/scipy that referenced this pull request Sep 12, 2016
Since it's no longer used. It was added in scipygh-5938 for scipy 0.18.0
to get some speedup for ks_2samp, but then the addition was reverted
in scipygh-6545, following the discussion in scipygh-6435: it gives different
answers on different machines, it changes one ad hoc statistic to
a different ad hoc statistic, and neither of them are clearly "correct".
rgommers added a commit that referenced this pull request Sep 13, 2016
ev-br added a commit to ev-br/scipy that referenced this pull request Sep 13, 2016
Since it's no longer used. It was added in scipygh-5938 for scipy 0.18.0
to get some speedup for ks_2samp, but then the addition was reverted
in scipygh-6545, following the discussion in scipygh-6435: it gives different
answers on different machines, it changes one ad hoc statistic to
a different ad hoc statistic, and neither of them are clearly "correct".
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance Items related to regular maintenance tasks scipy.stats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants