-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: add remove method to scipy.spatial.cKDTree #13050
base: main
Are you sure you want to change the base?
Conversation
# Conflicts: # scipy/spatial/ckdtree/src/remove.cxx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly minor comments but I also have some serious concern about how you're updating size
.
I would also say that not updating the bounding boxes seems like the wrong compromise. Shuffling the index array in the way you are doing is already an O(n) operation so I'm not convinced that an O(log(n)) update to the bounding boxes would impact performance noticeably.
indexing syntax Co-authored-by: peterbell10 <peterbell10@live.co.uk>
doc correction Co-authored-by: peterbell10 <peterbell10@live.co.uk>
indexing syntax Co-authored-by: peterbell10 <peterbell10@live.co.uk>
index syntax Co-authored-by: peterbell10 <peterbell10@live.co.uk>
c++ ranged loop syntax Co-authored-by: peterbell10 <peterbell10@live.co.uk>
benchmark correction (remove by index) Co-authored-by: peterbell10 <peterbell10@live.co.uk>
benchmarks/benchmarks/spatial.py
Outdated
""" | ||
Time to remove one point from cKDTree. | ||
""" | ||
self.T.remove(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I run this benchmark and the results were I would say "suspiciously" good, on the order of 300 ns or only a few python function calls. After investigating, the setup
method isn't re-run between iterations of the timing loop, so the point is only actually removed from the tree once.
I'm not sure if there's a good way to run non-timed code (@pv?) but you really need to build a new tree for each call to time_remove
:
tree = cKDTree(self.dataset, boxsize=boxsize, leafsize=leafsize)
tree.remove(0)
But then the timing is dominated by the time to build the tree, to the point where remove
is lost in the measurement error. Calling remove
in a loop can bring it above the noise floor though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using that technique I estimate this is somewhere on the order of 100x faster than building a new tree from scratch. Somewhere in the 10s of microseconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems it can be done properly with pytest-benchmark (https://stackoverflow.com/questions/51288551/pytest-benchmark-run-setup-on-each-benchmark-iteration)
I've thought to initialize a list of cKDTrees in the setup method. In this case, time_remove has a little overhead due to an index incrementation, but I don't know the number of iterations
Co-authored-by: peterbell10 <peterbell10@live.co.uk>
Formatting of remove.cxx should be cleaned up to adhere to NumPy (snd SciPy) coding standards. It is a bit messy. Also only one function should be exported into the global namespace. |
#include <math.h> | ||
|
||
|
||
int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static int
avoid global namespace pollution
return -1; | ||
} | ||
|
||
int subtree_size(ckdtreenode *root){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inline int (or static int)
The It does not even help to retain the GIL for You will therefore need to add a synchonization object to
(Possibly other things as well, but this is all I could think of now.) A simple mutex is not sufficient as it would serialize query calls against each other. The synchronization problem is nightmarish, but not undoable. Also we need test cases to make sure cKDTree remains threadsafe. We cannot consider to merge this PR before the threadsafety issues are solved. It is absolutely mandatory that they be solved. |
Shouldn't it be up to the user to not concurrently modify the tree. Or provide their own reader-writer lock if this is important to them. |
If we were just tempering with Python objects I would concur, but here we are mutating the underlying C++ objects. We do synchronize access to C/C++/Fortran data elsewhere in SciPy, either with the GIL or some other means (e.g. threading.RLock). |
That is why we have this: (But here we need a different synchronization mechanism.) |
I thought the reentrant locks were for fortran functions that have internal shared state and so independent function calls are not thread safe. Whereas in this case, the user has to consciously mutate state that they've shared between multiple threads. |
When we do concurrent read-write on C++ structures that contain pointers, then all bets are off as to what might happen. It is better to be safe than sorry and synchronize the access. It is not like it will be difficult to do the appropriate synchronization, nor will it affect the performance of cKDTree. |
Thanks for this precise analysis. |
If you are querying the data structure concurrently with modifying it, then your results are non-deterministic; even with a reader-writer lock. So, either way I would say the user's code is broken and they need to introduce higher-level synchronization into their program.
Well locks aren't free. It's probably somewhere in the region of 10-100 ns overhead, so not huge, but still not free. For me the question is: is okay for a user's broken multi-threaded code to segfault? And is it worth pessimising everyone's code to avoid that? @rgommers is there any standard approach to multi-threading in SciPy? |
The tools in
the user shouldn't have to know that the underlying Fortran code isn't reentrant; the user expects to be able to make calls to a function like In this case it's clear that a I haven't followed the details of the discussion here, hope the comment is somewhat helpful. |
So we should perhaps just document this in the docstring and state that the user has to provide the synchronization? |
A notes section with a few warnings would be useful.
|
Reference issue #12897
What does this implement/fix?
A simple remove method for cKDTree. This remove a given point (if it belongs to the tree) and update self.n.
If needed, leaves are collapsed into parent node (when parent_node.children <= leafsize) and self.size is updated.
For performance reasons, this method doesn't remove the point from self.data and it doesn't update maxes and mins but, if it is required, I can do it.
Additional information
The method accept only one point and doesn't perform any kind of rebalancing. I'm working about these.