-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel Operation #6
Comments
The rendering phase is trivial to parallelize. There are a few ways to go about the labeling phase.
|
This design is a little more tricky than one might think. If you make a separate DisjointSet per thread, it incurs a large memory overhead unless we change the design. Experiments with std::unordered_map have not been promising, but it's likely I was using it wrong (several times slower, large memory usage). |
I think I have a good method for doing parallel, but it's so fast already... 70 MVx/sec/core isn't bad. c.f. #12 |
I've written a similar library in C# for voxel game physics (C# because I'm using Unity3d). I've given a bunch of thought about how to parallelize without sharing data between threads and here are a few approaches, I've come up with, in case you're interested:
of course rather than xz planes and yDir, can do xy planes and zDir |
These are great approaches for 6 connected, which despite being the simplest version, is also often the slowest because you can't exploit connections between the arms of the stencil like you can in 18 and 26 connected. I think a similar thing could be done with blocks instead of planes while retaining generality for the other connected versions. The thing is though, so far cc3d has been fast enough that parallel hasn't been needed for me. What target time and volume size are you working with? I find the biggest time suck in most of these 3D algorithms is the Z pass (assuming Fortran order) where the cache misses cause the CPU to spend most of its time waiting for main memory. Parallel probably helps with that. One trick you may have overlooked, which I discovered by accident, is that the union-find data structure should be using full path compression and no balancing by rank. This is because during the equivalence pass, we kind of don't care how bulky the subtrees get so long as the union operation is fast. Then, when we apply find during the render phase, the subtrees get collapsed quickly from path compression and repeated evaluation becomes rapid. This division of labor in the union-find data-structure is particular to connected components and thus this trick is not generally applicable beyond this problem so far as I know. |
I found that union-by-rank and union-by-size gave me no speedup, but i'm doing path compression each time I mark two labels as equivalent. Right now, the largest game world I'm using is 512x512x64 (LxWxH). Probably in contrast to brain imaging data, most of the action is nearer to the ground; get closer to max height and it's mostly empty voxels (sky) (except for the top of skyscrapers, etc) Your CCL algorithm (called as a DLL from C#) takes about 50ms to complete for this size (mine, coded in C#, takes about 115ms). both are sufficiently quick when the level is loading (it takes longer to render the voxels independently of this calculation, for one thing).
Currently for the dynamic part, I have a second version of the CCL algorithm where instead of iterating over all voxels in the world, iterate only over voxels in the list belonging to the label of the voxel just deleted (minus the deleted voxel itself). |
This is a really interesting use case. For a single threaded application, a single linear scan of a uint32 512x512x64 volume takes about 26 msec. uint16 takes 18 msec and a uint8 volume takes 8 msec on my MBP but they are too small to hold the maximum possible number of labels in the image. This leads me to believe, like you indicated, that parallel is the only way to get 60 FPS using a generic algorithm. However, I suspect that you'll need to share data between threads because you must avoid a full single threaded scan of the volume even to render the final labels. In terms of improved single threaded algorithms, it's possible to do better than mine using Block Based Decision Trees by Grana et al, but that team wrote their algorithm for 2D and the (somewhat hairy) logic would have to be extended to 3D. |
I was looking at the code again. I think it should be possible to simultaneously process blocks in a possibly lock free manner by setting |
I'm a bit jazzed about this parallel implementation concept, but I suspect it won't help your use case. What are the minimum requirements for your users? Just for cc3d, it looks like to hit 60 FPS using perfect parallelization with no algorithm overhead you'd need 3 fully occupied cores. I'm gonna guess w/ overhead, you'll need 4 cores. The only way to make the problem tractable is to a) use a GPU implementation or b) shrink the target area (like you've done with your second implementation). If you can shrink the size of the world to be evaluated by a factor of 4, you can use a single threaded general implementation. |
I don't have any users right now, is a personal project so far =) I tried a compute shader in Unity3d but I found it too slow transferring the data back from the GPU to CPU. I'd much prefer a CPU solution anyways. There's two ways I'm computing the labels for CCL algorithm (independent of static vs dynamic) The alternative is to have a user-defined yFloor so that the ground below is not considered for labeling during CCL, which results in for example, every office building in the level being its own component instead of all office buildings connected to each other through the ground and road. When everything is separated, I can handle easily. I should probably separate the game world into 8 pieces, one octree level i guess, and keep track of when voxels in components in one piece are connected to a component in another piece |
I'm a bit time constrained at the moment, but I'm going to try to implement the parallel version of this. It might take a few days or weeks before you see anything useful though. I've gotten parallel requests from people running enormous multi-GB scientific datasets in the past, so I'm sure they'll appreciate it too. Who knows, maybe this will be my excuse for attempting to do 3D Block Based Decision Trees which would probably increase single core performance by ~1/3 (though it's hard to extrapolate the 2D numbers to 3D since the mathematical topology and memory latencies are so different). |
Dang. I was curious and did some reading and realized that I had been thinking (for some reason in contradiction to what I wrote in the README) that this library is in the same column as He's 2007 algorithm, but it really implements a 3D variation of Wu's 2005 algorithm. There's actually substantially more to be gained by improving the single core performance. |
@matthew-walters I'm not sure if you're still experimenting, but the six connected algorithm has gotten significantly faster in recent releases. I'd be interested to hear how it's doing for you. There were two major improvements (a) the use of the phantom label technique detailed in the readme (worth up to 1.5x speed) and (b) an improvement to the relabeling phase which is worth up to 1.3x. That should hopefully reduce 50 msec to around 25 msec without the use of parallel and would give you 40 FPS updates. |
|
Thanks for the update. |
Hi Will! I'm currently running My approach is to process the volume in chunks of 1000 x 1000 x 1000 voxels and distribute the chunks across multiple cores. Each process roughly does this:
The hand-over period in steps 3-5 slows things down a little but it's not too bad. The chunking also means that I will have to do some post-processing to "knit" labels at the seams between the chunks. Some snags I countered:
I'm sure you must also have processed large volumes in chunks at some points. Curious to hear what your strategy is? |
Hi Philipp, I've thought about this a little and wrote my thoughts here: seung-lab/igneous#109 However, no one has yet asked me to actually do this so I haven't implemented anything. I think Stephan Gerhardt may have taken a stab at it once, but I never heard how far he got. The path I'd take would be similar to this scidb paper with maybe my own spin on it. Oloso, A., Kuo, K.-S., Clune, T., Brown, P., Poliakov, A., Yu, H.: Implementing connected component labeling as a user defined operator for SciDB. In: 2016 IEEE International Conference on Big Data (Big Data). pp. 2948–2952. IEEE, Washington DC,USA (2016). https://doi.org/10.1109/BigData.2016.7840945. With respect to (1) older versions of cc3d did have that feature before I had a method to be very confident about the minimum size necessary. Version 1.14.0 was the last one that supported it. I'll think about reintroducing it (and throw an error if the provided out_dtype is predicted to be too small). (2) By too expensive, do you mean in memory or time? If memory, I think the only problem there is incrementing everything except zero. I think you can use numpy |
Haha glad I'm not the only mad person to think about that application. Another thought: not knowing a thing about how the C++ code works under the hood but would it be possible for Re 1) That'd be great, thanks! Re 2) Time - I haven't even checked memory to be honest. I am effectively using a mask and numpy |
It's certainly something I think will start to see more use cases once we have an easy way to do it.
This works to effectively reduce the memory usage of cc3d by removing the factor of the size of the input array and possible the output array. However, the size of the union-find array continues to exist. So the way to look at the strategy is that it would increase the size of arrays that could be processed on a single machine by a constant factor (approximately 2x). Removing the input file can already be done by mmapping a single contiguous array. In order to make this actually low memory, we'd need to estimate the number of provisional labels needed for the UF array. This estimation could be accomplished by computing the estimated provisional labels in each chunk and summing them. This will result in a higher estimate than if full rows are used as chunk boundaries will count towards the estimate. I can kind of see the outlines of a chunk based strategy where you pass in image chunks in a predetermined order and the algorithm only needs to retain neighboring image rows. cc3d would then return not an output image, but the mapping you need to apply to the larger image. It's an interesting idea. It's not easy to parallelize though. For your 3.7 TB image, even if cc3d ran at full speed (~120 MVx/sec typically for 26-connected connectomics images) it would take 8.4 hours to process just to get the mappings. The chunking will slow it down some more, so probably more like 12-24 hours (maybe much more if any IO step is not optimized).
You're right that it would be faster because we'd condense two operations into one, but I'd prefer not to do it if I can avoid it because some of the other functions in this library (e.g. I think you may want to take another look at |
This addresses Philipp's comments in #6. He wants to write the inital CCL volume to a uint32 to avoid needing to upscale the image for the next step.
* feat: bring back out_dtype This addresses Philipp's comments in #6. He wants to write the inital CCL volume to a uint32 to avoid needing to upscale the image for the next step. * docs: show how to use out_dtype
Greetings. If I treat the entire volume as connected (every building eventually connects to the ground either directly or indirectly), it takes about 29 ms with either version. If I consider the "ground" plane not to be connected (such that each building, etc ends up with its own label as they're no longer connected to each other through the ground), it's slightly faster, around 26 ms with either version. I'm timing specifically the invocation of Maybe it has to do with the structure of my volume -- representing a city, rather that medical imaging data -- that causes some bottleneck. Here's a screenshot for reference. For illustration, I've colored everything according to its label (but I'm reusing colors for multiple labels) |
Hi Matthew, Thanks for following up! Your null result is a little surprising. Your data look similar enough in image statistics to the medical data that you should see a benefit. In my opinion, the main reason you wouldn't see a benefit would be that your data are smaller (16.7MB) than the ones I typically work with (512x512x512 32 or 64 bit, 0.5-1 GB) so you might be getting a lot better cache performance that nullifies the benefits of the optimizations. In fact, the optimizations could be working against you at smaller sizes. You can try using the other version of cc3d::connected_components as they perform additional (fast) passes on the image. No guarantee you'll see a significant benefit, but maybe worth a try.
|
Oh! One other thing you can try. Instead of generating a new |
Okay lastly, I do have a prototype of parallel operation working, but it is clunky code-wise and requires further tuning. It was slower than single threaded in my tests, but I haven't exhausted all options are improving performance. |
This has gotten me down to about 20 ms on average! I've also taken the suggestion about pre-allocating the |
Yay! I'm glad that helped. Here's one more thing you can try: You can adjust downwards |
For large volumes, it would be nice to process it in parallel.
The text was updated successfully, but these errors were encountered: