Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add remove method to scipy.spatial.cKDTree #13050

Open
wants to merge 22 commits into
base: main
Choose a base branch
from

Conversation

younik
Copy link

@younik younik commented Nov 8, 2020

Reference issue #12897

What does this implement/fix?

A simple remove method for cKDTree. This remove a given point (if it belongs to the tree) and update self.n.
If needed, leaves are collapsed into parent node (when parent_node.children <= leafsize) and self.size is updated.
For performance reasons, this method doesn't remove the point from self.data and it doesn't update maxes and mins but, if it is required, I can do it.

Additional information

The method accept only one point and doesn't perform any kind of rebalancing. I'm working about these.

@younik
Copy link
Author

younik commented Nov 8, 2020

Here you can see a very simple example with leafsize=1 and n=8. A point is removed, so two leaves are collapsed into parent.

Istantanea_2020-11-08_16-12-26

@AtsushiSakai AtsushiSakai added enhancement A new feature or improvement scipy.spatial labels Nov 10, 2020
Copy link
Member

@peterbell10 peterbell10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly minor comments but I also have some serious concern about how you're updating size.

I would also say that not updating the bounding boxes seems like the wrong compromise. Shuffling the index array in the way you are doing is already an O(n) operation so I'm not convinced that an O(log(n)) update to the bounding boxes would impact performance noticeably.

scipy/spatial/ckdtree/src/remove.cxx Outdated Show resolved Hide resolved
scipy/spatial/ckdtree/src/remove.cxx Outdated Show resolved Hide resolved
scipy/spatial/ckdtree/src/remove.cxx Outdated Show resolved Hide resolved
scipy/spatial/ckdtree/src/remove.cxx Outdated Show resolved Hide resolved
scipy/spatial/ckdtree/src/remove.cxx Outdated Show resolved Hide resolved
scipy/spatial/ckdtree/src/remove.cxx Outdated Show resolved Hide resolved
scipy/spatial/ckdtree.pyx Show resolved Hide resolved
benchmarks/benchmarks/spatial.py Outdated Show resolved Hide resolved
younik and others added 9 commits November 10, 2020 16:55
indexing syntax

Co-authored-by: peterbell10 <peterbell10@live.co.uk>
doc correction

Co-authored-by: peterbell10 <peterbell10@live.co.uk>
indexing syntax

Co-authored-by: peterbell10 <peterbell10@live.co.uk>
index syntax

Co-authored-by: peterbell10 <peterbell10@live.co.uk>
c++ ranged loop syntax

Co-authored-by: peterbell10 <peterbell10@live.co.uk>
benchmark correction (remove by index)

Co-authored-by: peterbell10 <peterbell10@live.co.uk>
"""
Time to remove one point from cKDTree.
"""
self.T.remove(0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I run this benchmark and the results were I would say "suspiciously" good, on the order of 300 ns or only a few python function calls. After investigating, the setup method isn't re-run between iterations of the timing loop, so the point is only actually removed from the tree once.

I'm not sure if there's a good way to run non-timed code (@pv?) but you really need to build a new tree for each call to time_remove:

tree = cKDTree(self.dataset, boxsize=boxsize, leafsize=leafsize)
tree.remove(0)

But then the timing is dominated by the time to build the tree, to the point where remove is lost in the measurement error. Calling remove in a loop can bring it above the noise floor though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using that technique I estimate this is somewhere on the order of 100x faster than building a new tree from scratch. Somewhere in the 10s of microseconds.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems it can be done properly with pytest-benchmark (https://stackoverflow.com/questions/51288551/pytest-benchmark-run-setup-on-each-benchmark-iteration)

I've thought to initialize a list of cKDTrees in the setup method. In this case, time_remove has a little overhead due to an index incrementation, but I don't know the number of iterations

Co-authored-by: peterbell10 <peterbell10@live.co.uk>
@younik younik changed the title Add remove method to scipy.spatial.cKDTree ENH: add remove method to scipy.spatial.cKDTree Nov 24, 2020
@sturlamolden
Copy link
Contributor

Formatting of remove.cxx should be cleaned up to adhere to NumPy (snd SciPy) coding standards. It is a bit messy.

Also only one function should be exported into the global namespace.

#include <math.h>


int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static int
avoid global namespace pollution

return -1;
}

int subtree_size(ckdtreenode *root){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inline int (or static int)

@sturlamolden
Copy link
Contributor

The remove method of cKDTree is not threadsafe. It also renders the query methods no longer threadsafe.

It does not even help to retain the GIL for remove, as other threads might be doing a query on cKDTree in nogil context while it is being mutated. This will lead to race conditions or even a segfault. There is also the possibility of two concurrent calls to remove creating a race and possibly segfaulting.

You will therefore need to add a synchonization object to cKDTree that will:

  • Suspend a call to remove while query calls are running.
  • Suspend any query call while a call to remove is running.
  • Suspend a call to remove while another call to remove is running.
  • Suspend pickling of the cKDTree object while a call to remove is running.
  • Allow concurrent and reentrant query calls.
  • Suspend any access to the property cKDTree.tree while remove is running.
  • Prohibit any call to remove after the first access to cKDTree.tree.

(Possibly other things as well, but this is all I could think of now.)

A simple mutex is not sufficient as it would serialize query calls against each other. The synchronization problem is nightmarish, but not undoable.

Also we need test cases to make sure cKDTree remains threadsafe.

We cannot consider to merge this PR before the threadsafety issues are solved. It is absolutely mandatory that they be solved.

@peterbell10
Copy link
Member

We cannot consider to merge this PR before the threadsafety issues are solved.

Shouldn't it be up to the user to not concurrently modify the tree. Or provide their own reader-writer lock if this is important to them.

@sturlamolden
Copy link
Contributor

If we were just tempering with Python objects I would concur, but here we are mutating the underlying C++ objects. We do synchronize access to C/C++/Fortran data elsewhere in SciPy, either with the GIL or some other means (e.g. threading.RLock).

@sturlamolden
Copy link
Contributor

That is why we have this:

https://github.com/scipy/scipy/blob/5f4c4d802e5a56708d86909af6e5685cd95e6e66/scipy/_lib/_threadsafety.py

(But here we need a different synchronization mechanism.)

@peterbell10
Copy link
Member

I thought the reentrant locks were for fortran functions that have internal shared state and so independent function calls are not thread safe. Whereas in this case, the user has to consciously mutate state that they've shared between multiple threads.

@sturlamolden
Copy link
Contributor

sturlamolden commented Feb 15, 2021

I thought the reentrant locks were for fortran functions that have internal shared state and so independent function calls are not thread safe. Whereas in this case, the user has to consciously mutate state that they've shared between multiple threads.

When we do concurrent read-write on C++ structures that contain pointers, then all bets are off as to what might happen.

It is better to be safe than sorry and synchronize the access. It is not like it will be difficult to do the appropriate synchronization, nor will it affect the performance of cKDTree.

@younik
Copy link
Author

younik commented Feb 15, 2021

The remove method of cKDTree is not threadsafe. It also renders the query methods no longer threadsafe.

It does not even help to retain the GIL for remove, as other threads might be doing a query on cKDTree in nogil context while it is being mutated. This will lead to race conditions or even a segfault. There is also the possibility of two concurrent calls to remove creating a race and possibly segfaulting.

You will therefore need to add a synchonization object to cKDTree that will:

  • Suspend a call to remove while query calls are running.
  • Suspend any query call while a call to remove is running.
  • Suspend a call to remove while another call to remove is running.
  • Suspend pickling of the cKDTree object while a call to remove is running.
  • Allow concurrent and reentrant query calls.
  • Suspend any access to the property cKDTree.tree while remove is running.
  • Prohibit any call to remove after the first access to cKDTree.tree.

(Possibly other things as well, but this is all I could think of now.)

A simple mutex is not sufficient as it would serialize query calls against each other. The synchronization problem is nightmarish, but not undoable.

Also we need test cases to make sure cKDTree remains threadsafe.

We cannot consider to merge this PR before the threadsafety issues are solved. It is absolutely mandatory that they be solved.

Thanks for this precise analysis.
I suspect this will be hard for me to do it in a reasonable amount of time, but surely I will give it a shot.

@peterbell10
Copy link
Member

When we do concurrent read-write on C++ structures that contain pointers, then all bets are off as to what might happen.

If you are querying the data structure concurrently with modifying it, then your results are non-deterministic; even with a reader-writer lock. So, either way I would say the user's code is broken and they need to introduce higher-level synchronization into their program.

It is better to be safe than sorry and synchronize the access. It is not like it will be difficult to do the appropriate synchronization, nor will it affect the performance of cKDTree.

Well locks aren't free. It's probably somewhere in the region of 10-100 ns overhead, so not huge, but still not free.

For me the question is: is okay for a user's broken multi-threaded code to segfault? And is it worth pessimising everyone's code to avoid that?

@rgommers is there any standard approach to multi-threading in SciPy?

@rgommers
Copy link
Member

For me the question is: is okay for a user's broken multi-threaded code to segfault? And is it worth pessimising everyone's code to avoid that?

@rgommers is there any standard approach to multi-threading in SciPy?

The tools in scipy/_lib/_threadsafety.py are used in multiple modules, that's probably as standard as it gets. I'm not sure that's exactly the same issue or not. E.g. here:

# ARPACK is not threadsafe or reentrant (SAVE variables), so we need a
# lock and a re-entering check.
_ARPACK_LOCK = ReentrancyLock("Nested calls to eigs/eighs not allowed:      "
                              "ARPACK is not re-entrant")

the user shouldn't have to know that the underlying Fortran code isn't reentrant; the user expects to be able to make calls to a function like eigs in multi-threaded/processing code.

In this case it's clear that a .remove call is modifying the data, so doing .remove plus a query in separate threads is arguably user error.

I haven't followed the details of the discussion here, hope the comment is somewhat helpful.

@sturlamolden
Copy link
Contributor

In this case it's clear that a .remove call is modifying the data, so doing .remove plus a query in separate threads is arguably user error.

So we should perhaps just document this in the docstring and state that the user has to provide the synchronization?

@peterbell10
Copy link
Member

A notes section with a few warnings would be useful.

  • Don't query concurrently, the tree may be in an invalid state
  • Don't expect references to kdtree.tree from before the remove call to be up to date

@lucascolley lucascolley added the needs-work Items that are pending response from the author label Dec 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement needs-work Items that are pending response from the author scipy.spatial
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants