ENH: add remove method to scipy.spatial.cKDTree #13050

younik · 2020-11-08T14:43:39Z

Reference issue #12897

What does this implement/fix?

A simple remove method for cKDTree. This remove a given point (if it belongs to the tree) and update self.n.
If needed, leaves are collapsed into parent node (when parent_node.children <= leafsize) and self.size is updated.
For performance reasons, this method doesn't remove the point from self.data and it doesn't update maxes and mins but, if it is required, I can do it.

Additional information

The method accept only one point and doesn't perform any kind of rebalancing. I'm working about these.

younik · 2020-11-08T15:16:09Z

Here you can see a very simple example with leafsize=1 and n=8. A point is removed, so two leaves are collapsed into parent.

This reverts commit 10d1514

scipy/spatial/ckdtree/src/remove.cxx

# Conflicts: # scipy/spatial/ckdtree/src/remove.cxx

peterbell10

Mostly minor comments but I also have some serious concern about how you're updating size.

I would also say that not updating the bounding boxes seems like the wrong compromise. Shuffling the index array in the way you are doing is already an O(n) operation so I'm not convinced that an O(log(n)) update to the bounding boxes would impact performance noticeably.

scipy/spatial/ckdtree/src/remove.cxx

scipy/spatial/ckdtree.pyx

benchmarks/benchmarks/spatial.py

indexing syntax Co-authored-by: peterbell10 <peterbell10@live.co.uk>

doc correction Co-authored-by: peterbell10 <peterbell10@live.co.uk>

indexing syntax Co-authored-by: peterbell10 <peterbell10@live.co.uk>

index syntax Co-authored-by: peterbell10 <peterbell10@live.co.uk>

c++ ranged loop syntax Co-authored-by: peterbell10 <peterbell10@live.co.uk>

benchmark correction (remove by index) Co-authored-by: peterbell10 <peterbell10@live.co.uk>

scipy/spatial/ckdtree/src/find_min_max.cxx

peterbell10 · 2020-11-17T19:22:13Z

benchmarks/benchmarks/spatial.py

+        """
+        Time to remove one point from cKDTree.
+        """
+        self.T.remove(0)


I run this benchmark and the results were I would say "suspiciously" good, on the order of 300 ns or only a few python function calls. After investigating, the setup method isn't re-run between iterations of the timing loop, so the point is only actually removed from the tree once.

I'm not sure if there's a good way to run non-timed code (@pv?) but you really need to build a new tree for each call to time_remove:

tree = cKDTree(self.dataset, boxsize=boxsize, leafsize=leafsize) tree.remove(0)

But then the timing is dominated by the time to build the tree, to the point where remove is lost in the measurement error. Calling remove in a loop can bring it above the noise floor though.

Using that technique I estimate this is somewhere on the order of 100x faster than building a new tree from scratch. Somewhere in the 10s of microseconds.

It seems it can be done properly with pytest-benchmark (https://stackoverflow.com/questions/51288551/pytest-benchmark-run-setup-on-each-benchmark-iteration)

I've thought to initialize a list of cKDTrees in the setup method. In this case, time_remove has a little overhead due to an index incrementation, but I don't know the number of iterations

benchmarks/benchmarks/spatial.py

Co-authored-by: peterbell10 <peterbell10@live.co.uk>

sturlamolden · 2021-02-10T01:11:24Z

Formatting of remove.cxx should be cleaned up to adhere to NumPy (snd SciPy) coding standards. It is a bit messy.

Also only one function should be exported into the global namespace.

sturlamolden · 2021-02-10T01:05:08Z

scipy/spatial/ckdtree/src/remove.cxx

+#include <math.h>
+
+
+int


static int
avoid global namespace pollution

sturlamolden · 2021-02-10T01:06:06Z

scipy/spatial/ckdtree/src/remove.cxx

+    return -1;
+}
+
+int subtree_size(ckdtreenode *root){


inline int (or static int)

sturlamolden · 2021-02-10T02:12:42Z

The remove method of cKDTree is not threadsafe. It also renders the query methods no longer threadsafe.

It does not even help to retain the GIL for remove, as other threads might be doing a query on cKDTree in nogil context while it is being mutated. This will lead to race conditions or even a segfault. There is also the possibility of two concurrent calls to remove creating a race and possibly segfaulting.

You will therefore need to add a synchonization object to cKDTree that will:

Suspend a call to remove while query calls are running.
Suspend any query call while a call to remove is running.
Suspend a call to remove while another call to remove is running.
Suspend pickling of the cKDTree object while a call to remove is running.
Allow concurrent and reentrant query calls.
Suspend any access to the property cKDTree.tree while remove is running.
Prohibit any call to remove after the first access to cKDTree.tree.

(Possibly other things as well, but this is all I could think of now.)

A simple mutex is not sufficient as it would serialize query calls against each other. The synchronization problem is nightmarish, but not undoable.

Also we need test cases to make sure cKDTree remains threadsafe.

We cannot consider to merge this PR before the threadsafety issues are solved. It is absolutely mandatory that they be solved.

peterbell10 · 2021-02-10T14:29:45Z

We cannot consider to merge this PR before the threadsafety issues are solved.

Shouldn't it be up to the user to not concurrently modify the tree. Or provide their own reader-writer lock if this is important to them.

sturlamolden · 2021-02-10T16:57:54Z

If we were just tempering with Python objects I would concur, but here we are mutating the underlying C++ objects. We do synchronize access to C/C++/Fortran data elsewhere in SciPy, either with the GIL or some other means (e.g. threading.RLock).

sturlamolden · 2021-02-10T17:01:04Z

That is why we have this:

https://github.com/scipy/scipy/blob/5f4c4d802e5a56708d86909af6e5685cd95e6e66/scipy/_lib/_threadsafety.py

(But here we need a different synchronization mechanism.)

peterbell10 · 2021-02-10T18:02:14Z

I thought the reentrant locks were for fortran functions that have internal shared state and so independent function calls are not thread safe. Whereas in this case, the user has to consciously mutate state that they've shared between multiple threads.

sturlamolden · 2021-02-15T15:43:40Z

I thought the reentrant locks were for fortran functions that have internal shared state and so independent function calls are not thread safe. Whereas in this case, the user has to consciously mutate state that they've shared between multiple threads.

When we do concurrent read-write on C++ structures that contain pointers, then all bets are off as to what might happen.

It is better to be safe than sorry and synchronize the access. It is not like it will be difficult to do the appropriate synchronization, nor will it affect the performance of cKDTree.

younik · 2021-02-15T22:13:22Z

The remove method of cKDTree is not threadsafe. It also renders the query methods no longer threadsafe.

It does not even help to retain the GIL for remove, as other threads might be doing a query on cKDTree in nogil context while it is being mutated. This will lead to race conditions or even a segfault. There is also the possibility of two concurrent calls to remove creating a race and possibly segfaulting.

You will therefore need to add a synchonization object to cKDTree that will:

Suspend a call to remove while query calls are running.

Suspend any query call while a call to remove is running.

Suspend a call to remove while another call to remove is running.

Suspend pickling of the cKDTree object while a call to remove is running.

Allow concurrent and reentrant query calls.

Suspend any access to the property cKDTree.tree while remove is running.

Prohibit any call to remove after the first access to cKDTree.tree.

(Possibly other things as well, but this is all I could think of now.)

A simple mutex is not sufficient as it would serialize query calls against each other. The synchronization problem is nightmarish, but not undoable.

Also we need test cases to make sure cKDTree remains threadsafe.

We cannot consider to merge this PR before the threadsafety issues are solved. It is absolutely mandatory that they be solved.

Thanks for this precise analysis.
I suspect this will be hard for me to do it in a reasonable amount of time, but surely I will give it a shot.

peterbell10 · 2021-02-16T16:34:51Z

When we do concurrent read-write on C++ structures that contain pointers, then all bets are off as to what might happen.

If you are querying the data structure concurrently with modifying it, then your results are non-deterministic; even with a reader-writer lock. So, either way I would say the user's code is broken and they need to introduce higher-level synchronization into their program.

It is better to be safe than sorry and synchronize the access. It is not like it will be difficult to do the appropriate synchronization, nor will it affect the performance of cKDTree.

Well locks aren't free. It's probably somewhere in the region of 10-100 ns overhead, so not huge, but still not free.

For me the question is: is okay for a user's broken multi-threaded code to segfault? And is it worth pessimising everyone's code to avoid that?

@rgommers is there any standard approach to multi-threading in SciPy?

rgommers · 2021-02-16T16:52:10Z

For me the question is: is okay for a user's broken multi-threaded code to segfault? And is it worth pessimising everyone's code to avoid that?

@rgommers is there any standard approach to multi-threading in SciPy?

The tools in scipy/_lib/_threadsafety.py are used in multiple modules, that's probably as standard as it gets. I'm not sure that's exactly the same issue or not. E.g. here:

# ARPACK is not threadsafe or reentrant (SAVE variables), so we need a
# lock and a re-entering check.
_ARPACK_LOCK = ReentrancyLock("Nested calls to eigs/eighs not allowed:      "
                              "ARPACK is not re-entrant")

the user shouldn't have to know that the underlying Fortran code isn't reentrant; the user expects to be able to make calls to a function like eigs in multi-threaded/processing code.

In this case it's clear that a .remove call is modifying the data, so doing .remove plus a query in separate threads is arguably user error.

I haven't followed the details of the discussion here, hope the comment is somewhat helpful.

sturlamolden · 2021-02-16T17:13:32Z

In this case it's clear that a .remove call is modifying the data, so doing .remove plus a query in separate threads is arguably user error.

So we should perhaps just document this in the docstring and state that the user has to provide the synchronization?

peterbell10 · 2021-02-16T17:21:39Z

A notes section with a few warnings would be useful.

Don't query concurrently, the tree may be in an invalid state
Don't expect references to kdtree.tree from before the remove call to be up to date

younik added 2 commits November 8, 2020 13:45

implementation of cKDTree.remove

8328f07

implementation of cKDTree.remove

d4b9b38

younik added 5 commits November 8, 2020 20:42

add remove method to scipy.spatial.cKDTree

10d1514

Revert "add remove method to scipy.spatial.cKDTree"

63b4506

This reverts commit 10d1514

doc correction

10de9b3

doc correction

35c8737

test correction

e02cb72

peterbell10 reviewed Nov 9, 2020

View reviewed changes

scipy/spatial/ckdtree/src/remove.cxx Outdated Show resolved Hide resolved

younik added 3 commits November 9, 2020 18:46

change double comparison

df2f1d9

remove point from index of self.data

d1e65da

Merge remote-tracking branch 'origin/master' into master

bfbd268

# Conflicts: # scipy/spatial/ckdtree/src/remove.cxx

younik requested a review from peterbell10 November 10, 2020 08:57

AtsushiSakai added enhancement A new feature or improvement scipy.spatial labels Nov 10, 2020

peterbell10 requested changes Nov 10, 2020

View reviewed changes

peterbell10 reviewed Nov 10, 2020

View reviewed changes

benchmarks/benchmarks/spatial.py Outdated Show resolved Hide resolved

younik and others added 9 commits November 10, 2020 16:55

Update scipy/spatial/ckdtree/src/remove.cxx

e1be2c2

indexing syntax Co-authored-by: peterbell10 <peterbell10@live.co.uk>

Update scipy/spatial/ckdtree/src/remove.cxx

260eb09

doc correction Co-authored-by: peterbell10 <peterbell10@live.co.uk>

Update scipy/spatial/ckdtree/src/remove.cxx

12b9652

indexing syntax Co-authored-by: peterbell10 <peterbell10@live.co.uk>

Update scipy/spatial/ckdtree/src/remove.cxx

0234d7a

index syntax Co-authored-by: peterbell10 <peterbell10@live.co.uk>

Update scipy/spatial/ckdtree/src/remove.cxx

6895910

c++ ranged loop syntax Co-authored-by: peterbell10 <peterbell10@live.co.uk>

Update benchmarks/benchmarks/spatial.py

fe46db0

benchmark correction (remove by index) Co-authored-by: peterbell10 <peterbell10@live.co.uk>

update benchmark & C++11 for shortcut

99883ef

add remove wrapper for KDTree

14cb333

adjust size update

6a28d99

peterbell10 reviewed Nov 17, 2020

View reviewed changes

scipy/spatial/ckdtree/src/find_min_max.cxx Outdated Show resolved Hide resolved

younik force-pushed the master branch from c8515b2 to 6a28d99 Compare November 17, 2020 17:00

Merge branch 'master' into master

c1403ee

peterbell10 reviewed Nov 17, 2020

View reviewed changes

benchmarks/benchmarks/spatial.py Show resolved Hide resolved

Update benchmarks/benchmarks/spatial.py

27ea519

Co-authored-by: peterbell10 <peterbell10@live.co.uk>

younik changed the title ~~Add remove method to scipy.spatial.cKDTree~~ ENH: add remove method to scipy.spatial.cKDTree Nov 24, 2020

alternative for benchmark

5e106d2

younik requested a review from peterbell10 November 28, 2020 20:26

sturlamolden suggested changes Feb 10, 2021

View reviewed changes

lucascolley added the needs-work Items that are pending response from the author label Dec 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: add remove method to scipy.spatial.cKDTree #13050

ENH: add remove method to scipy.spatial.cKDTree #13050

younik commented Nov 8, 2020 •

edited

Loading

younik commented Nov 8, 2020

peterbell10 left a comment

peterbell10 Nov 17, 2020

peterbell10 Nov 17, 2020

younik Nov 21, 2020

sturlamolden commented Feb 10, 2021

sturlamolden Feb 10, 2021

sturlamolden Feb 10, 2021

sturlamolden commented Feb 10, 2021

peterbell10 commented Feb 10, 2021

sturlamolden commented Feb 10, 2021

sturlamolden commented Feb 10, 2021

peterbell10 commented Feb 10, 2021

sturlamolden commented Feb 15, 2021 •

edited

Loading

younik commented Feb 15, 2021

peterbell10 commented Feb 16, 2021

rgommers commented Feb 16, 2021

sturlamolden commented Feb 16, 2021

peterbell10 commented Feb 16, 2021

ENH: add remove method to scipy.spatial.cKDTree #13050

Are you sure you want to change the base?

ENH: add remove method to scipy.spatial.cKDTree #13050

Conversation

younik commented Nov 8, 2020 • edited Loading

Reference issue #12897

What does this implement/fix?

Additional information

younik commented Nov 8, 2020

peterbell10 left a comment

Choose a reason for hiding this comment

peterbell10 Nov 17, 2020

Choose a reason for hiding this comment

peterbell10 Nov 17, 2020

Choose a reason for hiding this comment

younik Nov 21, 2020

Choose a reason for hiding this comment

sturlamolden commented Feb 10, 2021

sturlamolden Feb 10, 2021

Choose a reason for hiding this comment

sturlamolden Feb 10, 2021

Choose a reason for hiding this comment

sturlamolden commented Feb 10, 2021

peterbell10 commented Feb 10, 2021

sturlamolden commented Feb 10, 2021

sturlamolden commented Feb 10, 2021

peterbell10 commented Feb 10, 2021

sturlamolden commented Feb 15, 2021 • edited Loading

younik commented Feb 15, 2021

peterbell10 commented Feb 16, 2021

rgommers commented Feb 16, 2021

sturlamolden commented Feb 16, 2021

peterbell10 commented Feb 16, 2021

younik commented Nov 8, 2020 •

edited

Loading

sturlamolden commented Feb 15, 2021 •

edited

Loading