Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KDTree for accelerating NMS #40

Closed
wants to merge 6 commits into from
Closed

KDTree for accelerating NMS #40

wants to merge 6 commits into from

Conversation

GFleishman
Copy link

@GFleishman GFleishman commented Mar 18, 2020

This pull request solves issue #39.

After polygon candidates have been reformatted to parallel lists of 3D points and distances (sorted by probability), I construct a KDTree from the points using scipy.spatial.cKDTree. This takes negligible time for all of my tested cases (a few seconds at most). I then call kdtree.query_ball_tree which returns a list of lists - which is also parallel to points/dists. The i^th sublist contains the points which are sufficiently close to point i to merit comparison in NMS. This query is somewhat time consuming, for the largest case tested (6.32e6 points) it took 23 minutes.

This list of lists is passed to c_non_max_suppression_inds using the O& option to PyArg_ParseTuple and a custom python -> c data types converter function lists_to_vectors.

The inner loop of _COMMON_non_maximum_suppression_sparse is modified to only iterate over these candidate pairs. This capitalizes on the investment of the previous bounded neighbors query. For the largest tested case it brought NMS time down from 7.5 hours to 20 minutes, while still finding the same set of polygons.

I evaluated NMS time over a wide number of polygon candidates. Here are the results (note the log_10 scale on the vertical axis):
image

In all cases except the largest, the number of detected polygons between the baseline method and using the kdtree was exactly the same. For the largest case, the baseline method detected 90535 polygons and the kdtree method detected 90536 polygons. Here is a side by side comparison of a single plane of polygons detected in the largest case for both methods:

full_image_results_side_by_side

and an overlay (baseline method in red, with kdtree in green):

full_image_results_comparison_overlay

Correspondence is exact through the whole stack - I'm not sure where that extra one is - too hard to find by eye :) !

Some additional notes
The kdtree query time also scales with number of points, but with a worst case complexity of O(N*logN). For very few points (left side of plot) the query time is negligible (milliseconds) - so the kdtree doesn't hurt in those cases.

The O(NlogN) query time for the kdtree could also be reduced to O(ClogN) where C is a constant strictly smaller than N, using a better implementation. Currently, I'm building the kdtree in python, running the bounded neighbors query for all points and then passing the result to the C++ code. Some of these points get suppressed before their bounded neighbors lists are used, and thus computing their bounded neighbors was a waste. If the kdtree data structure was directly accessible in the C++ code, you would instead want to query for bounded neighbors of each individual point at the start of the inner loop of NMS. That way, points which were already suppressed never need to be queried; I think this would cut the query time in half at least. It's a bit more tricky to implement, but definitely possible. I would probably grab the cKDTree implementation from scipy and start from there.

The one extra polygon detected for the largest test case illustrates an interesting aspect of the kdtree.query_ball_tree (bounded neighbors) search. You must provide an upper bound. You can guarantee that any polygon which could be suppressed by a point is included in it's bounded neighbors list by using 2 * np.max(dists) as the upper bound (any polygons further than that can not possibly overlap), but this results in a pretty long query time, though likely still an overall gain compared to NMS w/o the kdtree. The current choice is 2 * (np.mean(dists) + np.std(dists)) which gives acceptable results for my dataset and is a bit quicker than the max - so I've stuck with it.

@GFleishman
Copy link
Author

Oh I also just realized, the approximate times I give for the cKDTree.query_ball_tree query times also include passing the list of lists to the c code and converting it to a vector of vectors. My C++ is really rusty, so I might be doing something really inefficient in this conversion (lists_to_vectors function in stardist3d.cpp), e.g. the push_back function seems suspect to me because it reallocates RAM every time the vector grows. There is likely a better way to write this.

@maweigert
Copy link
Member

Hi Greg,

thanks a lot for this contribution - it already looks very promising!

I evaluated NMS time over a wide number of polygon candidates. Here are the results (note the log_10 scale on the vertical axis):

Thats a nice speed up! The 6Mio candidates are for the whole stack, I assume? So using predict_big doesn't have any effect on the prediction time?

I'm slightly worried about the additional memory footprint that a full kdtree of all candidates might bring - did you try measuring the peak memory usage? Additionally I suspect there is quite some heavy differences even for different kdtree implementations e.g. there are KDTree and BallTree from scikit-learn. Here's a short test I did with random points in 3D and those different implementation
"ckdtree" -> scipy.spatial.ckdtree
"kdtree" -> sklearn.neighbors.KDTree
"ball" -> sklearn.neighbors.BallTree

Screenshot 2020-03-19 at 18 10 04

shows that i) there are quite some difference esp. for memory, and ii) the overall memory footprint is pretty high (e.g. 18GB for a mere 30k points for ckdtree).

So having a similar plot as above but for memory consumption in your example would be nice and instructive. Additionally moving to BallTree might make things more efficient...

@GFleishman
Copy link
Author

GFleishman commented Mar 19, 2020

Hi Martin,

Yes, the 6Mio candidates are for prediction on the whole stack. I did try predict_big on this volume. What I noticed was that the memory footprint went way down. Using Uwe's recommended parameters I was able to run the job with a single cpu (# cores requested and maximum amount of RAM available are mixed on our cluster; each core comes with 15GB RAM). However the remaining runtime predicted by tqdm after a few iterations had accumulated was something like 36 hours - so I cancelled it and gave up on it. Uwe reminded me that with one core, I likely had a maximum of 2 threads, but I did not go back and try predict_big with more cores. Do you think predict_big with say 32 cores (64 threads) could finish in less than an hour? If so I will give it another try.

For different kdtree implementations - I did look through this article written by the scikit-learn kdtree and balltree developer, but I didn't think there was enough differences in his graphs to justify introducing a new dependency (I don't think scikit-learn was a dependency before?) and I also preferred something with an underlying C++ implementation, as there are more big speed ups available if the kdtree data structure is available for query directly within the NMS algorithm.

Unfortunately the only RAM measurements I have are the maximum and average usage (over the full run, prediction and NMS) for prediction on the full volumes both w/ and w/o the kdtree. The command in both cases was:

pred, det = model.predict_instances(image_norm, verbose=True, n_tiles=(4, 8, 2))
On a volume with 860x1676x301 voxels and 6.32e6 polygon candidates

w/o kdtree max: 434.439GB
w/ kdtree max: 430.482GB

w/o kdtree avg: 221.672GB
w/ kdtree avg: 297.733GB

I'm thinking that the kdtree and the query results for this many candidates took something like 70GB RAM. One thing that I haven't looked at in the code - it seems like the biggest expansion of data in the method is from raw data --> prediction; e.g. in my case I'm using 128 rays, so I need (129 * 2) more RAM relative to my input image (assuming 32 bit floats for probs and dists and 16 bit int for image). The n_tiles parameter cuts the image up into overlapping tiles for prediction. Does each tile also get separately prob thresholded before stitching back up for NMS? Based on the way RAM usage appears to me, it seems like no; but that would save a lot of space, since the dists of most voxels could be thrown out before the dists of new voxels were predicted/stored. I think if this were implemented, then the additional RAM of the kdtree would become an important factor.

@GFleishman
Copy link
Author

Hi Martin,

I think you are right about scikit-learn kdtree outperforming scipy cKDTree. While working on another change to the package, I was led to trying the scikit-learn kdtree. For the same dataset considered above, the query was a few minutes faster than scipy cKDTree, but more importantly - due to the underlying data types used - the query results consumed significantly less RAM.

For both packages I've found that the tree structure itself takes a negligible amount of RAM - I'm not sure exactly how this is implemented under the hood, but I suppose each node in the tree could be as small as a single pointer to where in the array the points are split. If that's the case you really only need something like 3*log_2(N) references (3 axes and N points). However it's implemented, empirically it's quite small.

The query results however can be quite large, in the abstract it's a list of lists. The outer list is of length N (for N points) and the ith inner list is of length equal to the number of points within radius distance of the ith point. So this could be on the order of ~1000 * N long ints - that's something like 25GB without considering any extra metadata overhead.

Scipy represents this object as a Python list of Python lists of Python ints. Python ints contain a cumbersome amount of metadata, so I don't think they're a great data type here. Scikit-learn on the other hand represents this as a Numpy array with dtype = 'O', that is an array of objects. The objects are 1D Numpy arrays themselves with dtype = np.int64. This seems to be overall more space efficient. I'd love to select np.int32 instead, since that's more than adequate, but it does not seem to be possible without modifying the Scikit-learn.neighbors.kdtree code (maybe still worth doing in the long run).

So - combining the small query speed up and the better space efficiency, I've added code to use scikit-learn kdtree in a new commit.

@GFleishman
Copy link
Author

GFleishman commented Mar 30, 2020

I've made some significant improvements by switching to the scikit-learn version of KDTree (good suggestion!) and changing the way the KDTree is used within NMS. To keep things clean, I'm going to delete my existing fork, create a new one, put the final version of the changes there, and submit them as (hopefully) a single commit.

@GFleishman GFleishman closed this Mar 30, 2020
@maweigert
Copy link
Member

Hi @GFleishman,

Thanks for all the work! Two things that I think are still worth pondering:

  1. The max_distance cutoff for the spatial data structure should ideally be chosen such that no potential intersecting polygon is missed. Some options are
    • (Mean) max_distance= 2*(np.mean(dist)+np.std(dist)) what is used in the PR, but might miss some pairs
    • (Max) max_distance= 2*np.max(dist), preferable as it won't miss any pairs but which leads to order of magnitude more pairs.

There likely won't be any difference between (Mean) and (Max) in practice (as your initial comparison showed), but I am still a bit wary of not having that guarantee.

  1. Memory: I tried prediction on a small crop (with ~250k candidates) of the test data above:
    • without kdtree: 15.4 GB, 97s
    • with kdtree (Mean): 15.5 GB, 98s
    • with kdtree (Max): 39.2 GB, 121s

So using (Max) will clearly take a hit on memory...

@GFleishman
Copy link
Author

Hi @maweigert,

Right now I'm working on a different strategy from the software standpoint that will solve the extra time cost for using Max and the extra memory from having to store the larger lists of neighbors (potential suppression candidates). My first implementation in the PR was a good demonstration of kdtree, but definitely suboptimal w.r.t. what can be done with the Python C-API, which I'm learning on the fly so things have gone through various stages of kludge.

The idea now is that I can call the sklearn kdtree object, and it's query_radius method, from within the C++ code itself - enabling a separate query for each polygon candidate just before the inner loop of NMS begins. That way, points which are "suppressed early" will never be queried against the tree - reducing total query time even if the distance threshold is large. The second benefit is that the code now only has to store one neighbor list at a time, thus reducing the total RAM used. This is definitely the best way to built it algorithmically but I still don't have the Python C-API down perfectly so implementing this today has been tricky.

I also implemented what I mentioned before - applying the probability threshold for each tile separately during prediction. For the whole image, it cut max RAM down from ~400GB to ~50GB, and that was including the neighbors list from kdtree. So, if I can get the tile based threshold combined with the embedded python approach to the kdtree, I think I can get the whole zebrafish image segmented in like 10-15 minutes for something like 30GB RAM. Of course, the tile based prob threshold breaks a bunch of stuff, so I'll leave it to you guys to decide if you want to move in that direction generally or to just have my commit somewhere accessible on the repo so if others run into RAM issues there's some path to cutting it down without changing so many internal dependencies between functions.

This is all WIP but I'm hoping to finish it soon.

Thanks,
Greg

@maweigert
Copy link
Member

Hi @GFleishman ,

sounds great!

The idea now is that I can call the sklearn kdtree object, and it's query_radius method, from within the C++ code itself

Yes, that seems to be the preferable way. Having external source code be compiled and linked correctly via setup.py probably the biggest hurdle (it was quite some hassle with qhull).

I also implemented what I mentioned before - applying the probability threshold for each tile separately during prediction

Yes, I have a local branch that does exactly this. We haven't merged that feature yet, as long term we want to use the whole prob/stardist map for further shape refinement. But it might be worth reconsidering it for the pure stardist shapes (as it leads to huge memory reduction).
For this PR though, I would probably keep these two things (kdtree, prob thresholding) separate and first focus on the kdtree.

Great to see this rolling! :)

@GFleishman
Copy link
Author

Hi @maweigert ,

Having external source code be compiled and linked correctly via setup.py probably the biggest hurdle (it was quite some hassle with qhull).

Unfortunately I don't think I can do it this way, since sklearn kdtree is all scientific python stack under the hood, i.e. no C code. AFAIK you can't compile and link this like you would an external C library? My approach was to try and run sklearn kdtree from within the C code using the Python C-API - something like what's going on here.

But it's been tricky to determine exactly what is needed, since this and the linked situation are different. Here we have a Python extension module (Stardist3d) which handles all the Python interfacing then calls implementations from another C++ file (Stardist3d_impl) passing only C style array pointers. In the linked code, they have an embedded Python interpreter which they interact with entirely within C.

Although I see now it kind of mixes up the Python interface vs. implementation separation, I was trying to do things like:

#include <Python.h>
#include "numpy/arrayobject.h"
.
.
.
PyArrayObject* point = (PyArrayObject*) PyArray_SimpleNew(2, dims_point, NPY_FLOAT);

in Stardist3d_impl, which compiles with no warnings, but then the program just hangs on this line when executed - I guess it might eventually SegFault, but I never give it long enough to find out, after hanging for a few minutes I just kill the process, since that line shouldn't take longer than a few milliseconds(?).

After trying multiple ways to initialize a numpy array in Stardist3d_impl and all showing the same failure mode, I'm trying to understand why this doesn't work, and also considering other designs like somehow having the kdtree object live in Stardist3d and just passing a callback function to stardist3d_impl or something. Do you have any ideas here?

Yes, I have a local branch that does exactly this. We haven't merged that feature yet, as long term we want to use the whole prob/stardist map for further shape refinement. But it might be worth reconsidering it for the pure stardist shapes (as it leads to huge memory reduction).

Yeah, I figured that was the strongest pro for keeping the whole image of predictions. I think having a separate branch in the repo for RAM considerations is good though - then users can just switch and recompile to move into "maximum memory efficient" mode if that suits their application better. It could also eventually be part of the Config. So, I'll submit it on a separate branch/PR.

This has been fun to work on so far :)

@maweigert
Copy link
Member

Hi @GFleishman,

Do you have any ideas here?

Yes, as far as I see incorporating sklearn into the c extension itself will become tricky and messy outright. There is another reason why this might not what we want: Currently, the c++ implementation (stardist3d_impl.cpp) and the python wrapper (stardist3d.cpp) are separated, such to allow all the C++ code to be compiled into a clean native library that can be used e.g. from Java/Fiji and where linking against libpython is not desirable. There indeed is already a Makefile providing exactly that functionality. So I would opt for something C++ only.

But there might be good news! I just had a look into C++ kdtree implementations and there are some that look suitable for us. A good candidate might be nanoflann, which is BSD and even header only (meaning including a single hpp would be enough). There already is a radius query example which is exactly what we would need and it works just fine for me. Even better, some quick benchmark I did suggests, that query times are a magnitude faster than sklearn!

nanoflann

So looks like a perfect match!

then users can just switch and recompile to move into "maximum memory efficient" mode if that suits their application better.

Yep, that is a sensible strategy.

This has been fun to work on so far :)

It is! :)

@GFleishman
Copy link
Author

GFleishman commented Mar 31, 2020

Hi @maweigert

Awww - I just got the sklearn integration to work a few hours ago! There is a bit of trickyness involved in using the numpy C-API from multiple compilation units (here it would be stardist3d.cpp and stardist3d_impl.cpp); the answer there helped sort it out and got things working.

I agree that keeping the Python module and C++ implementations separate is best. For now, the integrated sklearn strategy can at least demonstrate what kind of gains we'll get by putting the kdtree properly in the nms loop. I'm running some tests now and will report back. I did switch back to the "max" approach, but quite nicely I'm now using radius = dist_max + local_dist_max where dist_max is the largest distance in the whole array of polygon candidate distances, and local_dist_max is the max over dists for just the current polygon candidate (polygon i within the outer NMS loop).

Great find with nanoflann! I'll have a look and see if I can get it to compile into stardist!

Oh - did you see this note on the nanoflann readme about distances?:

Important note: If L2 norms are used, notice that search radius and all passed and returned distances are actually squared distances.

Is it possible that you compared radius = 0.1**2 for nanoflann to radius=0.1 for sklearn? That would definitely cause a speedup (of course it could also just be really fast!)

@maweigert
Copy link
Member

maweigert commented Mar 31, 2020

I'm now using radius = dist_max + local_dist_max

Makes sense!

Great find with nanoflann! I'll have a look and see if I can get it to compile into stardist!

Here's a short demo that shows how to perform a radius search (adapted from the nanoflann example):

https://gist.github.com/maweigert/651355861950c5c310421e434ba3ceae

Should be easy to adapt...

@maweigert
Copy link
Member

maweigert commented Mar 31, 2020

Is it possible that you compared radius = 0.1**2 for nanoflann to radius=0.1 for sklearn? That would definitely cause a speedup (of course it could also just be really fast!)

Yes, that tripped me off at the beginning, but the other way around using radius^2=0.1 for nanoflann, so massively underestimating the speed up :)

@GFleishman
Copy link
Author

Great, thanks for the demo!

In the meantime, I can confirm the expected speed up and RAM reduction for embedding the kdtree in the NMS loop. These are quick results from the same image I shared before:

The NMS time was the same between embedded-kdtree and external kdtree, ~20 minutes

However, the ~13 minutes required to query the tree for all candidate points is no longer present for the embedded-kdtree (since queries are done individually within the NMS outer loop).

When combined with prob_thresholding on individual tiles, the maximum RAM over prediction and NMS for the embedded-kdtree was 17.378GB.

@maweigert
Copy link
Member

maweigert commented Apr 1, 2020

Hi,

Following on this (great!) discussion , there are now two new branches spare and kdtree
which I would love to see your benchmarks on:

  1. I added the sparse prediction code that I mentioned earlier into a new branch sparse, which provides now an optional parameter
    model.predict_instances(... sparse=True)
    which will reduce memory footprint considerably for large volumes.
    Let me know if that works for you!

  2. I did play around with the nanoflann code a bit and added it to another new branch kdtree. It uses a kdtree radius search with the max_distance following your earlier comment (max+ local_max). There is now an option
    model.predict_instances(... use_kdtree=True)
    Here's a comparison for crops of your data with and without kdtree (updated the graph):

Screenshot 2020-04-02 at 13 48 59

So while memory footprint is the same, the kdtree implementation is already 6x times for larger subvolumes. Pretty good!

Note that this branch kdtree is as of now only a short demo that shows how it could be done with nanoflann - I would wait for your PR to see would piece all together with your code.

@GFleishman
Copy link
Author

Hi again!
Sorry for the long delay - when I got sklearn KDTree working with tile based thresholding locally I was able to step back and work on some of the datasets that I need results for. I will try out your branches soon, but I expect they're excellent and won't need any modifications from me :) But I will let you know.

@GFleishman
Copy link
Author

Hello @maweigert!

I hope the last 3 months have been good for you and that you're staying safe and sane. I've just gotten back into structural segmentation after a long trip doing functional segmentation. I've used the kdtree branch of the repo with nanoflann and it works very well. I haven't done any actual comparison, but the nms time seems as fast or faster than my implementation calling the sklearn methods using the python-API.

One weird quirk I've noticed though - the NMS loop seems to run about 2x slower when I run model.predict_instances with a gpu available. The prediction itself is 2-3x faster on the gpu (definitely worthwhile), but again the NMS loop is about 2x slower in the gpu case. I have confirmed this with two separate experiments on our gpu/cpu cluster and both times ensured that the loop was running with the same number of threads available. I'm working with our system administrators to see if there is some difference between our cpu exclusive and gpu connected nodes but so far we haven't identified anything that could explain the slowdown - the two different node types use the same OpenMP library for example.

So, I'm wondering where the prob/dist maps are after prediction on the gpu. Are these for sure sent back from gpu memory to system memory before the NMS loop starts? It seems like they for sure must be since they are C++ arrays/pointers but I don't know what I don't know so I thought I would ask before ruling this possibility out and looking further into potential differences on our cluster nodes.

@maweigert
Copy link
Member

Hi @GFleishman

I've used the kdtree branch of the repo with nanoflann and it works very well. I haven't done any actual comparison, but the nms time seems as fast or faster than my implementation calling the sklearn methods using the python-API.

Good to hear! :)

the NMS loop seems to run about 2x slower when I run model.predict_instances

This is strange and I don't see a reason why that would be the case.

Are these for sure sent back from gpu memory to system memory before the NMS loop starts?

Yes, they are in the CPU (they are returned by keras as numpy arrays). Before the NMS, both prob and dist are additionally converted to continuous c arrays, so they definitely are already on the host memory.

How do you time the NMS loop?

@GFleishman
Copy link
Author

Hi @maweigert :),

Yes, they are in the CPU (they are returned by keras as numpy arrays). Before the NMS, both prob and dist are additionally converted to continuous c arrays, so they definitely are already on the host memory.

This is what I suspected. Ok, I think it must be some difference between the gpu connected nodes and cpu exclusive nodes on our cluster. They are different processor models, so maybe I should try compiling on the gpu connected nodes? Previously I had compiled on one of the cpu exclusive nodes, so maybe it's more optimized for that processor architecture?

How do you time the NMS loop?

I'm running with verbose=True so the end of the NMS loops writes out a total NMS time in seconds. I'm just using that.

@maweigert
Copy link
Member

They are different processor models, so maybe I should try compiling on the gpu connected nodes?

Yes, that could be. I would suggest comparing CPU vs GPU speed on the very same GPU node and disable tf-gpu via CUDA_VISIBLE_DEVICES= python ... in one of the two runs.

@uschmidt83
Copy link
Member

@GFleishman FYI, the block-wise prediction in the latest release is now also shown here: https://github.com/mpicbg-csbd/stardist/blob/dev/examples/other2D/predict_big_data.ipynb
While the example notebook is for a 2D example, it should also work for 3D images.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants