Use thread_local var's in FCL distanceCallback() #2698

AndyZe · 2021-05-30T20:14:01Z

While reviewing #2682 I noticed there are some variables declared inside the FCL distanceCallback. That's pretty bad, performance-wise, for a function that should be fast.

This simply makes these variables thread_local so they don't have to be allocated with every callback.

I see a 52% reduction in distanceCallback duration. The spreadsheet is attached and you can see how it was tested in the commit history (std::chrono::high_resolution_clock). Mean duration drops from 8.75 microseconds to 4.23 microseconds. Worst-case duration drops from 43->21 us. I can give PickNik'rs guidance on how to reproduce the test setup.
print_of_distance_callback.ods

I think there's lots of opportunity to do this type of improvement in other places within the collision-checking code.

moveit_core/collision_detection_fcl/src/collision_common.cpp

codecov · 2021-05-31T04:15:24Z

Codecov Report

Merging #2698 (fdf3fc1) into master (b75d37e) will increase coverage by 0.02%.
The diff coverage is 71.43%.

❗ Current head fdf3fc1 differs from pull request most recent head 8085c7b. Consider uploading reports for the commit 8085c7b to get more accurate results

@@            Coverage Diff             @@
##           master    #2698      +/-   ##
==========================================
+ Coverage   60.62%   60.64%   +0.02%     
==========================================
  Files         402      402              
  Lines       29615    29624       +9     
==========================================
+ Hits        17952    17962      +10     
+ Misses      11663    11662       -1

Impacted Files	Coverage Δ
...e/collision_detection_fcl/src/collision_common.cpp	`77.98% <71.43%> (+0.06%)`	⬆️
...e/src/parameterization/model_based_state_space.cpp	`68.31% <0.00%> (-2.81%)`	⬇️
moveit_core/robot_model/src/joint_model_group.cpp	`63.06% <0.00%> (-1.94%)`	⬇️
...ipulation/pick_place/src/manipulation_pipeline.cpp	`71.85% <0.00%> (-1.94%)`	⬇️
...on/pick_place/src/approach_and_translate_stage.cpp	`73.24% <0.00%> (-0.70%)`	⬇️
...dl_kinematics_plugin/src/kdl_kinematics_plugin.cpp	`81.00% <0.00%> (-0.41%)`	⬇️
...nning_scene_monitor/src/planning_scene_monitor.cpp	`68.75% <0.00%> (+0.07%)`	⬆️
...t_setup_assistant/src/tools/moveit_config_data.cpp	`22.05% <0.00%> (+2.09%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b75d37e...8085c7b. Read the comment docs.

This reverts commit f2ed42b.

AndyZe · 2021-05-31T14:26:33Z

@v4hn I really appreciate the comments and the level of detail you put into them!

I stopped using the bio_ik control pipeline to collect benchmarking data. It was too flaky -- too many moving pieces. Instead I dropped a std::chrono high-res clock right in the distanceCallback, printed the duration to screen, then copy/pasted to a spreadsheet. You can see the code for that in the commit history.

Results are good! Average duration drops from 8.74->4.22 microseconds!

I think one of the main culprits is

const std::pair<std::string, std::string> pc = cd1->getID() < cd2->getID() ?
                                                              std::make_pair(cd1->getID(), cd2->getID()) :  
                                                              std::make_pair(cd2->getID(), cd1->getID());

Allocating a pair of strings is probably expensive!

Spreadsheet attached and I'll update the PR description again.
print_of_distance_callback.ods

xqms

Hey, saw this one because you referred to my PR ;)

Similar to @v4hn I'd be very cautious regarding this optimization. I'm not sure you kept all the behavior intact, since it's very easy to miss something (see the problems I found below, and there might be more).

All in all, stack variables are very cheap and should nearly always be cheaper than accessing the heap / static variables, since the stack should always be in the cache. Allocating on the stack is also very cheap, it's just incrementing the stack pointer. So in my mind there should be little reason to turn primitive types such as bool or int into thread_local variables. (But who knows, computers are complicated beasts ^^)

For types with costly constructors there may be some argument, but I don't see those here. The std::pair gets completely overwritten on every call anyways, so there shouldn't be any benefit.

Maybe we can find out (e.g. through trial and error) which change exactly makes it faster in your setup?

I can recommend https://github.com/KDAB/hotspot (open source) or Intel VTune (paid, but free for academics and such) to profile the original function in more detail and see which operations are actually costly in this function.

moveit_core/collision_detection_fcl/src/collision_common.cpp

xqms · 2021-06-01T12:34:25Z

Additionally, I just looked at your measurements and in the "master" case the reported timings have an average of 4.13µs for the first half (which is same or below "this PR"), and in the second half they suddenly go up. That is somewhat suspicious - maybe something else was running at the same time? Maybe your CPU was throttling? Maybe the collision scene changed?

Just for reference again, here is the small benchmark I wrote for the other PR:
https://gist.github.com/xqms/0500ed88ac9836f6bbe0022298635c8e
Maybe it can be used to test this PR as well.

AndyZe · 2021-06-01T13:36:48Z

Thanks @xqms. I get it -- it will be tough to convince you guys these changes are necessary, and maybe I went overboard. Will try to benchmark in more detail using one of those methods you guys have described.

Hopefully we all agree that it's not good practice to declare variables in functions that could be used in realtime control applications. Thus, something needs to be improved here.

xqms · 2021-06-01T14:30:50Z

Thanks @xqms. I get it -- it will be tough to convince you guys these changes are necessary, and maybe I went overboard.

No worries! It's an interesting topic. I hope I didn't go overboard with my review ;)

Hopefully we all agree that it's not good practice to declare variables in functions that could be used in realtime control applications.

I'm curious, why? Local variables on the stack have zero allocation time (the stack pointer is usually incremented once on function entry and this needs to be done anyway). So I'd see no problem with local variables in general. But maybe I'm missing something.

If there is something the constructor of the variable is allocating on the heap and there is the possibility to re-use that space in the next call, then I guess there is some potential for improvement. I see that only for the fcl::CollisionResult struct, which contains std::vector and std::set - both of which store data on the heap.

simonschmeisser · 2021-06-01T14:49:07Z

This function is huge! I think you will get more rewarding results by analyzing the whole operation and looking for things that could be handled more efficiently on a higher level. Besides ´hotspot` which I really recommend (I used it to figure out the thread-local cache of FCL collision objects further down that file) you can also try an instrumenting profiler like perfetto https://perfetto.dev/ to get a time-resolved view of what's happening.

From my memory there are some points worth looking at

Allowed Collision Matrix (ACM) was about 1/3 of the time for collision checks (not distance). This could be improved by using std::unordered_map instead of std::map or even better by building up a ACM with fcl::CollisionObjects pointers instead of strings as key
geometric transformations and AABB calculations. Maybe something can be cached here as well?
maybe even reusing the BVH (Bounding Volume Hierarchy) in FCL between requests somehow?

Specifically in this function I also see that towards the end
https://github.com/ros-planning/moveit/pull/2698/files#diff-c0ca395481a0bd3a3708f3ef2bd50f6049da0f7c136d0a2bd131f187986f85dbR648
some data structure is expanded dynamically (std::set?)
Maybe you can at least partly reuse it (it being the CollisionData) and generously call reserve() in your code?

Also you haven't stated your high-level goal, if you want "pre-impact" collision checking (boolean checking if distance < threshold) in a real-time setup maybe we need a special variant of that callback function that exits early? Or maybe you can already achieve that with the flags?

v4hn · 2021-06-02T09:52:39Z

I hope I didn't go overboard with my review ;)

I'm grateful for your comments! There is no "going overboard" w.r.t. patches on runtime optimization. 👍

Hopefully we all agree that it's not good practice to declare variables in functions that could be used in realtime control applications.

I'm curious, why? Local variables on the stack have zero allocation time (the stack pointer is usually incremented once on function entry and this needs to be done anyway). So I'd see no problem with local variables in general. But maybe I'm missing something.

No I agree. As I said above, individual values will often just end up in registers and not even on the stack...

If there is something the constructor of the variable is allocating on the heap and there is the possibility to re-use that space in the next call, then I guess there is some potential for improvement. I see that only for the fcl::CollisionResult struct, which contains std::vector and std::set - both of which store data on the heap.

That's not just a potential improvement. Avoiding allocations is probably what @AndyZe means anyway.
Turning any kind of standard container structure into a static/thread_local makes a lot of sense.

AndyZe · 2021-06-02T14:47:20Z

Well, I've now benchmarked this repeatedly with code that is very similar to what @xqms shared (attached). I took a quick look at hotspot but had some initial trouble compiling on 18.04 and there is only a debian available for Ubuntu 20+.

So, benchmarking results for 10,000 iterations. This is distance checking for a panda arm that is in collision with a box.

master: 1.81 ms/iteration
This PR: 1.65 ms/iteration

The POD variables (double, bool) did not seem to make any difference, whether they were thread-local or not. So you guys were right in that regard.

BTW @xqms, you may be interested to know that --

req.type = collision_detection::DistanceRequestType::GLOBAL;

runs much faster than

req.type = collision_detection::DistanceRequestType::SINGLE;

~~I don't know why.~~ Prob. due to something in this area:

if (cdata->req->type != DistanceRequestType::GLOBAL)

It is counterintuitive that a GLOBAL check runs faster than a single check.

benchmark_collisions.cpp.txt

AndyZe · 2021-06-02T14:48:55Z

Perhaps we can get this merged soon as a small, incremental improvement?

simonschmeisser · 2021-06-02T14:55:17Z

Well, I've now benchmarked this repeatedly with code that is very similar to what @xqms shared (attached). I took a quick look at hotspot but had some initial trouble compiling on 18.04 and there is only a debian available for Ubuntu 20

You can simply download the AppImage, chmod +x it and execute it, you can find it among the assets here https://github.com/KDAB/hotspot/releases

AndyZe · 2021-06-03T16:19:46Z

Thanks for the tip @simonschmeisser! hotspot is a nice tool.

I started issue #2714 where we can really dive into the analysis. I suspect it's going to be a long discussion and I don't really want this PR to get dragged into the muck.

The .clear() function is documented here: https://github.com/ros-planning/moveit/blob/master/moveit_core/collision_detection/include/moveit/collision_detection/collision_common.h Everything that the .clear() call does is already covered by the following lines.

xqms · 2021-06-07T08:09:21Z

BTW @xqms, you may be interested to know that --

req.type = collision_detection::DistanceRequestType::GLOBAL;

runs much faster than

req.type = collision_detection::DistanceRequestType::SINGLE;

~~I don't know why.~~ Prob. due to something in this area:

if (cdata->req->type != DistanceRequestType::GLOBAL)

It is counterintuitive that a GLOBAL check runs faster than a single check.

benchmark_collisions.cpp.txt

Interesting! I think one reason could be that FCL can apply more early exits to the GLOBAL case: since fcl::distance() is called with the minimum distance computed so far as the distance threshold, it can exit early when it is clear that the objects are far enough apart.
In the SINGLE case, we have to compute pairwise distances for all objects (distance threshold stays at initial value), so it cannot exit early.

tylerjw

Reading through this change I think it is a nice improvement. We have tested this quite a bit on a customer project and it seems to work as intended.

tylerjw · 2021-06-08T17:45:05Z

@v4hn I requested a review from you as you requested changes in your review previously. I believe your comments have been addressed and Andy is correctly cleaning up the memory that is reused before using it now.

rhaschke

I didn't check the semantic code changes. Formally, there are still some stack-allocatable data types declared as thread_local.

moveit_core/collision_detection_fcl/src/collision_common.cpp

addressed

v4hn · 2021-06-09T09:31:38Z

I requested a review from you as you requested changes in your review previously. I believe your comments have been addressed and Andy is correctly cleaning up the memory that is reused before using it now.

Sorry, I did not spend more time on this yet and dismissed my stale review (you could have done that too @tylerjw).
Looks like Robert invested some time here though.

AndyZe · 2021-06-09T13:30:21Z

@rhaschke Thanks! Run time after your changes dropped a tiny little bit. I think it's because clearing the fcl::DistanceResultd wasn't necessary anymore.

Runtime change: 0.113 ms/iteration --> 0.110 ms/iteration

AndyZe · 2021-06-09T13:36:11Z

Let's take a minute to recap. For my test case, this PR drops runtime from 1.667 ms/iteration to 0.110 ms/iteration. That's a 93% drop. Wow! It makes me happy to think of all the saved CPU cycles. Thanks everybody.

moveit_core/collision_detection_fcl/src/collision_common.cpp

rhaschke

Generally, I approve. See AndyZe#2 for some more improvements.
Please address the open issues, i.e. add a comment or a dist_result.clear()

moveit_core/collision_detection_fcl/src/collision_common.cpp

Andy's update for moveit/moveit#2698

* wording * date update * add image * Update 2021-07-08-moveit-galactic.md here 1 > here * Update _posts/2021-07-08-moveit-galactic.md Co-authored-by: AndyZe <zelenak@picknik.ai> * Update 2021-07-08-moveit-galactic.md Andy's update for moveit/moveit#2698 * Update 2021-07-08-moveit-galactic.md re-order bullets * Fix bullet Co-authored-by: AndyZe <zelenak@picknik.ai>

…cope error This optimization was originally introduced in moveit#2698 to avoid string copying.

captain-yoshi · 2022-03-25T15:30:13Z

@AndyZe Do you remember your CollisionRequest setup ? I guess it would look like below but perhaps you bumped up the max_contacts field.

# collision_detection::CollisionRequest
distance: True              # Compute proximity distance (Slow)
cost: False                 # Compute collision cost
contacts: True              # Compute contacts
max_contacts: 1             # Maximum number of contacts
max_contacts_per_pair: 1    # Maximum numbe of contacts per pair of bodies
max_cost_sources: 1         # Defines how many of the top cost sources should be returned
verbose: False              # Report information about collision

This is distance checking for a panda arm that is in collision with a box.

How much boxes was in the PlanningScene ?

I'm testing regression across MoveIt versions specifically on collision checks so this would be a good benchmark to validate the optimisation of this PR.

AndyZe · 2022-03-25T15:32:32Z

I remember it was just one box in the scene. Not that you need to copy what I did. I don't remember the other parameters

v4hn previously requested changes May 30, 2021

View reviewed changes

moveit_core/collision_detection_fcl/src/collision_common.cpp Outdated Show resolved Hide resolved

moveit_core/collision_detection_fcl/src/collision_common.cpp Outdated Show resolved Hide resolved

moveit_core/collision_detection_fcl/src/collision_common.cpp Outdated Show resolved Hide resolved

AndyZe force-pushed the andyz/distance_collision_allocation branch 2 times, most recently from ff6868c to a161d78 Compare May 31, 2021 03:16

Use thread_local var's in FCL distanceCallback()

0857454

AndyZe force-pushed the andyz/distance_collision_allocation branch from 3cf2fe7 to 0857454 Compare May 31, 2021 03:39

AndyZe changed the title ~~Use thread_local var's in FCL distanceCallback()~~ WIP: Use thread_local var's in FCL distanceCallback() May 31, 2021

AndyZe added 3 commits May 31, 2021 08:11

Use thread_local on std::pair, too

7ffe7e1

Add temporary benchmarking code

f2ed42b

Revert "Add temporary benchmarking code"

45dc06f

This reverts commit f2ed42b.

AndyZe changed the title ~~WIP: Use thread_local var's in FCL distanceCallback()~~ Use thread_local var's in FCL distanceCallback() May 31, 2021

Format

43bee5d

xqms reviewed Jun 1, 2021

View reviewed changes

moveit_core/collision_detection_fcl/src/collision_common.cpp Outdated Show resolved Hide resolved

moveit_core/collision_detection_fcl/src/collision_common.cpp Show resolved Hide resolved

Clear thread-local vars between iterations

d9e339d

AndyZe force-pushed the andyz/distance_collision_allocation branch from 88060ec to 5d665a1 Compare June 2, 2021 14:34

Do not use thread_local for POD data types

4545f63

AndyZe force-pushed the andyz/distance_collision_allocation branch from 5d665a1 to 4545f63 Compare June 2, 2021 14:40

AndyZe mentioned this pull request Jun 3, 2021

[Help wanted] Analyzing FCL collision distance performance #2714

Open

AndyZe added 2 commits June 3, 2021 12:49

One include is not needed

df5ec0d

Short-circuit GLOBAL distance_threshold initialization

273187f

tylerjw approved these changes Jun 8, 2021

View reviewed changes

tylerjw requested a review from v4hn June 8, 2021 17:43

rhaschke reviewed Jun 8, 2021

View reviewed changes

xqms reviewed Jun 9, 2021

View reviewed changes

moveit_core/collision_detection_fcl/src/collision_common.cpp Outdated Show resolved Hide resolved

moveit_core/collision_detection_fcl/src/collision_common.cpp Outdated Show resolved Hide resolved

Went a little overboard with thread-local

8848af0

AndyZe force-pushed the andyz/distance_collision_allocation branch from 2cd721b to 8848af0 Compare June 9, 2021 14:10

Slightly more readable conversion to Eigen vector.

fdf3fc1

rhaschke mentioned this pull request Jun 9, 2021

Andyz/distance collision allocation AndyZe/moveit#2

Merged

rhaschke approved these changes Jun 9, 2021

View reviewed changes

moveit_core/collision_detection_fcl/src/collision_common.cpp Outdated Show resolved Hide resolved

moveit_core/collision_detection_fcl/src/collision_common.cpp Outdated Show resolved Hide resolved

moveit_core/collision_detection_fcl/src/collision_common.cpp Show resolved Hide resolved

rhaschke and others added 3 commits June 9, 2021 10:59

Use std::pair of string references

0ce72a1

Avoid extra copy for Eigen::Vector3d

581928d

Add comments about clearing thread_local variables

8085c7b

AndyZe merged commit 1eb2af5 into moveit:master Jun 9, 2021

AndyZe mentioned this pull request Jul 8, 2021

Galactic Blog Post moveit/moveit.ros.org#618

Merged

2 tasks

skolk added a commit to moveit/moveit.ros.org that referenced this pull request Jul 8, 2021

Update 2021-07-08-moveit-galactic.md

e132f34

Andy's update for moveit/moveit#2698

AndyZe mentioned this pull request Jul 31, 2021

Draft: Support timed locking in PlanningSceneMonitor #2794

Open

3 tasks

rhaschke added a commit to ubi-agni/moveit that referenced this pull request Aug 17, 2021

Don't use std::pair of string refs as it triggers a stack-use-after-s…

1c2d349

…cope error This optimization was originally introduced in moveit#2698 to avoid string copying.

rhaschke mentioned this pull request Aug 17, 2021

Fix issue #2809 (broken test with clang) #2820

Merged

AndyZe mentioned this pull request Mar 3, 2022

Add continous FCL collision checking for speedup #29

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use thread_local var's in FCL distanceCallback() #2698

Use thread_local var's in FCL distanceCallback() #2698

AndyZe commented May 30, 2021 •

edited

Loading

codecov bot commented May 31, 2021 •

edited

Loading

AndyZe commented May 31, 2021 •

edited

Loading

xqms left a comment

xqms commented Jun 1, 2021

AndyZe commented Jun 1, 2021

xqms commented Jun 1, 2021

simonschmeisser commented Jun 1, 2021

v4hn commented Jun 2, 2021

AndyZe commented Jun 2, 2021 •

edited

Loading

AndyZe commented Jun 2, 2021

simonschmeisser commented Jun 2, 2021

AndyZe commented Jun 3, 2021

xqms commented Jun 7, 2021

tylerjw left a comment

tylerjw commented Jun 8, 2021

rhaschke left a comment

v4hn commented Jun 9, 2021

AndyZe commented Jun 9, 2021

AndyZe commented Jun 9, 2021

rhaschke left a comment

captain-yoshi commented Mar 25, 2022

AndyZe commented Mar 25, 2022

Use thread_local var's in FCL distanceCallback() #2698

Use thread_local var's in FCL distanceCallback() #2698

Conversation

AndyZe commented May 30, 2021 • edited Loading

codecov bot commented May 31, 2021 • edited Loading

Codecov Report

AndyZe commented May 31, 2021 • edited Loading

xqms left a comment

Choose a reason for hiding this comment

xqms commented Jun 1, 2021

AndyZe commented Jun 1, 2021

xqms commented Jun 1, 2021

simonschmeisser commented Jun 1, 2021

v4hn commented Jun 2, 2021

AndyZe commented Jun 2, 2021 • edited Loading

AndyZe commented Jun 2, 2021

simonschmeisser commented Jun 2, 2021

AndyZe commented Jun 3, 2021

xqms commented Jun 7, 2021

tylerjw left a comment

Choose a reason for hiding this comment

tylerjw commented Jun 8, 2021

rhaschke left a comment

Choose a reason for hiding this comment

v4hn commented Jun 9, 2021

AndyZe commented Jun 9, 2021

AndyZe commented Jun 9, 2021

rhaschke left a comment

Choose a reason for hiding this comment

captain-yoshi commented Mar 25, 2022

AndyZe commented Mar 25, 2022

AndyZe commented May 30, 2021 •

edited

Loading

codecov bot commented May 31, 2021 •

edited

Loading

AndyZe commented May 31, 2021 •

edited

Loading

AndyZe commented Jun 2, 2021 •

edited

Loading