-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cythonize stats.ks_2samp for a ~33% gain in speed. #5938
Conversation
cdf2 = np.searchsorted(data2, data_all, side='right') / (1.0*n2) | ||
d = np.max(np.absolute(cdf1 - cdf2)) | ||
from . import _stats | ||
d = _stats.ks_2samp(data1, data2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might actually be worth keeping the old lines and just commenting them out, saying your code implements something equivalent to those lines, but more efficiently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest I'm not even sure the old version is clearer than the new one. I don't really care though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it's not :) It can still be useful to see how the code could be done in Python land without loops if necessary, though.
In setup.py, copypaste the |
double | ||
|
||
|
||
cpdef double ks_2samp(real[:] data1, real[:] data2): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as it's only called from Python space, the function can be just def
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mostly made it a cpdef so that I can annotate the return type as double (mostly for documentation purposes -- "def" functions can't have a return type annotated, I think).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
@@ -0,0 +1,31 @@ | |||
#cython: boundscheck=False | |||
#cython: nonecheck=False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is better done at a function level, especially in view of consolidating functions some of which may be user-facing.
just a few general comments: searchsorted is easy to read as cdf, after a bit of thinking. We use it in an analogous way in k-sample anderson darling test. (I'm not able to easily figure out algorithmic loops.) I never thought of using kolmogorov-smirnov for anything except numbers. So I think strings and object arrays could be deprecated, if someone ever thought of using it that way. about fusing types: The current code is combining the data arrays and so has to cast to a common dtype. Since the test is for equal distribution of two data arrays, I would assume that casting to a common dtype would be acceptable, I guess they would differ only in some outlier use cases. |
one special case: What happens compared to before when there are nans in the array? |
Good catch regarding nans: the old code would return some nonsensical result; the current implementation fails because both |
As far as I can figure out from an example and the behavior of sort and searchsorted (I only tried on older versions of numpy and scipy, which I had open): nans are sorted to the end and treated as equal. Which may or may not be what anyone uses. I haven't seen a use case of it.
|
BTW: I couldn't think of any other special case that could cause problems. Ties and inf should all be unchanged, AFAICS. |
@@ master #5938 diff @@
======================================
Files 238 238
Stmts 43803 43803
Branches 8211 8213 +2
Methods 0 0
======================================
- Hit 34230 34226 -4
- Partial 2603 2605 +2
- Missed 6970 6972 +2
|
I'm thinking about the nan issue again. AFAICS, we can get now missing =drop essentially for free. Because nans are sorted to the end, we can just stop at the first nan in each data array and adjust the number of observations, n in the final statistic. |
not np.issubdtype(common_type, np.complexfloating)): | ||
raise ValueError('ks_2samp only accepts real inputs') | ||
if np.any(np.isnan(data1)) or np.any(np.isnan(data2)): | ||
raise ValueError('ks_2samp only accepts non-nan inputs') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an optimization to avoid the overhead for the standard case of no nans
np.sort moves nans at the end, so only data1[-1] and data2[-2] need to be checked for isnan
@josef-pkt Included the nan-optimization. |
raise ValueError('ks_2samp only accepts real inputs') | ||
# nans, if any, are at the end after sorting. | ||
if np.isnan(data1[-1]) or np.isnan(data2[-1]): | ||
raise ValueError('ks_2samp only accepts non-nan inputs') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is, strictly speaking, a back-compat break, so it needs a mention in the release notes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like
stats.ks_2samp
now only accepts real, non-nan inputs. It used to return nonsensical values for such inputs before.
?
If that looks good to you I'll add that and squash the commit history.
Edited release notes and squashed commit history. |
@@ -65,6 +65,9 @@ is now consistently added after the matrix is applied, | |||
independent of if the matrix is specified using a one-dimensional | |||
or a two-dimensional array. | |||
|
|||
``stats.ks_2samp`` now only accepts real, non-nan inputs. It used to return | |||
nonsensical values for such inputs before. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It used to return nonsensical values for ... inputs which are not real and not nan? I'm just confused what "such" stands for here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that actually sounds horrible. What about:
stats.ks_2samp used to return nonsensical values if the input was not real or contained nans. It now raises an exception for such inputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Committed.
Remove nonsensical output for non-real or nan-containing inputs, raise an exception instead for them.
Looks good, Travis is green, merging. Thank you Antony |
Needs a rebase. |
You mean the other PR right? |
Yuk, yup, wrong link from the phone. Sorry for the noise |
Since it's no longer used. It was added in scipygh-5938 for scipy 0.18.0 to get some speedup for ks_2samp, but then the addition was reverted in scipygh-6545, following the discussion in scipygh-6435: it gives different answers on different machines, it changes one ad hoc statistic to a different ad hoc statistic, and neither of them are clearly "correct".
Revert gh-5938, restore ks_2samp
Since it's no longer used. It was added in scipygh-5938 for scipy 0.18.0 to get some speedup for ks_2samp, but then the addition was reverted in scipygh-6545, following the discussion in scipygh-6435: it gives different answers on different machines, it changes one ad hoc statistic to a different ad hoc statistic, and neither of them are clearly "correct".
See #5936.