-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Port cKDTree query methods to C++, allow pickling on Python 2, and add support for periodic queries #4890
Conversation
#include "ckdtree_cpp_rectangle.h" | ||
|
||
static void | ||
__count_neighbors_traverse(const ckdtree *self, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't all double underscore names reserved in C++?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The C++ compilers do not complain about the double underscores, but I'll check. They are in the Cython code, but it is trivial to remove them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double underscores are gone. They served no purpose anyway in the C++ code.
|
Update: Found one typo, but now it returns 32 instead of 33. Bugger. |
|
All query methods now in C++ Added formerly requested option of returning other containers than |
Remaining issues:
|
Travis is happy with this, it seems. |
Copying @pv's question that didn't get answered yet: does this help with the performance regressions? https://pv.github.io/scipy-bench/#regressions?sort=3&dir=desc ? I did some browsing through all the |
I can do a systematic benchmark of cKDTree in 0.15.x, 0.16.x and this PR. That is clearly needed before this is ready to merge. I tried to read the benchmarks from @pv , but I am not quite sure what they measure and how to read them. 0.16.x provides two ways of constructing the kd-tree: sliding midpoint as before and median as in scikit-learn. I would not be surprised if the new method could give performance boost or regression depending on the query method. Since the median method makes kd-tree construction slower (but queries can be faster), the net effect can also depend on how much the kd-tree construction account for the whole run-time (e.g. it matters relatively more for smaller data sets than for large). The effect on query time will also depend on the distribution of the data, though @jakevdp thinks the median method is only inferior in contrived situations. The effect on run-time from the changes here probably depends on the hardware as well, and maybe also the compiler. Another thing to note is that speed is not everything. This PR also allows the GIL to be released while doing the queries. In 0.15.x the GIL is never released; in 0.16.x only |
I should probably set up a repo for the benchmarks. |
Repo for benchmarks: |
The only performance regression from 0.15.x to 0.16.x is in the kd-tree creation time. But that is on a time scale three orders of magnitude smaller than the duration of the query methods. This PR is equally fast or faster than 0.16.x, depending on the query method. |
The existing benchmarks are here: The rendered benchmarks are at https://pv.github.io/scipy-bench/# (look for Now for a particular benchmark it can be the case that that benchmark is not realistic and can therefore be ignored or should be adapted. The Note that you can run the benchmarks locally with |
The output you'll get is:
|
|
In my tests I used the dummy data generator from scikit-learn to get a more realistic dataset than just random numbers. |
sparse_distance_matrix has been in cKDTree since 882820e in
2012. Obviously, the regression is after that.
|
The first regression in the sparse_distance_matrix is probably due to fixing the implementation to give correct result, probably a9565c0 The kdtree benchmarks currently under benchmarks/ are those that were previously under scipy/spatial/benchmarks --- I have no idea whether they are sensible or not. If you think they are not sensible, they should be replaced. (Given that KDTrees have various applications, it's not clear to me there is a good condition for "realistic" datasets.) |
The more recent performance regression in sparse_distance_matrix benchmarks is bea41ad, as indicated in the plots. |
Did you look at the timings in the ipython notebook? Here you can see the runtime of |
What I mean by "realistic dataset" is that the data has some structure so that the n-dimensional sample space is not uniformly populated. This is at least the common situation is classification problems. |
Note that it's not really "my" benchmark --- these are the old benchmarks written by the original ckdtree authors. The benchmarks are in the scipy repository --- you can try running them on your machine and to see if you can reproduce the results. |
I am not expecting the Travis build to succeed (I don't know how to shut it down), but now the merge conflicts are taken care of. The next step is to fix the build issue on SunOS and Cygwin due to |
But on the positive side, we are up to date with scipy/master again :-) |
As expected, all the builds are failing because the symbol |
Bugger, I forgot about using prefixes on the commit messages :-( |
Great, what do you think @rainwoodman ? Are we happy now? More that needs to be done? Fix my last commit messages perhaps (I don't know how), but apart from that, does this look good? |
@rgommers Should we put an 0.17 milestone on this? This PR is basically done, unless @rainwoodman has anything else to add. (It remains to verify that it builds on Wintendo. My old PC just died of old age, but I am setting up a virtual box to run on my Mac. Thank God I kept the DVD.) |
Summary of what is in this PR:
|
Sure. I am happy with this. Thanks! |
WIP tag is removed. |
We will likely write a code for 2-point galaxy clustering analysis in cosmology, with an extension to the original Gray and Moore 2000 to the anisotropic case. This code will be based on the current ckdtree code in this branch, but would add these features to ckdtree:
From my eyes these two features are quite doable. What do your think? @sturlamolden @rgommers ? We will be testing the full program (which would just blend some parallelism and IO interface with the modified ckdtree) against the standard code used in SDSS galaxy clustering analysis (https://github.com/bareid/xi), and fix bugs along the way. I am guessing the time-line for a new ckdtree PR will be the next 2 to 3 months, probably not soon enough to make it into next release? |
They are probably doable. I am not sure what the weights do, though, but now that we use templates it is easy to plug them in without producing a performance regression. 0.17 will probably be tagged and branched soon. It is that time of the year. Then 0.18 will be next summer, or something like that. :-) |
Can someone please tag this PR with 0.17 milestone? |
I've verified that this builds and passes all tests with MinGW (python 3.4, sse3 build). So in it goes. Thanks @sturlamolden and @rainwoodman! |
ENH: Port cKDTree query methods to C++, allow pickling on Python 2, and add support for periodic queries
@rainwoodman a next release is never more than 6 months away, so no worries about the timing for the next PR. |
I'll send an update for the release notes. |
New WIP PR after rebase disaster.