Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use thread_local var's in FCL distanceCallback() #2698

Merged
merged 15 commits into from
Jun 9, 2021

Conversation

AndyZe
Copy link
Member

@AndyZe AndyZe commented May 30, 2021

While reviewing #2682 I noticed there are some variables declared inside the FCL distanceCallback. That's pretty bad, performance-wise, for a function that should be fast.

This simply makes these variables thread_local so they don't have to be allocated with every callback.

I see a 52% reduction in distanceCallback duration. The spreadsheet is attached and you can see how it was tested in the commit history (std::chrono::high_resolution_clock). Mean duration drops from 8.75 microseconds to 4.23 microseconds. Worst-case duration drops from 43->21 us. I can give PickNik'rs guidance on how to reproduce the test setup.
print_of_distance_callback.ods

I think there's lots of opportunity to do this type of improvement in other places within the collision-checking code.

@AndyZe AndyZe force-pushed the andyz/distance_collision_allocation branch 2 times, most recently from ff6868c to a161d78 Compare May 31, 2021 03:16
@AndyZe AndyZe force-pushed the andyz/distance_collision_allocation branch from 3cf2fe7 to 0857454 Compare May 31, 2021 03:39
@AndyZe AndyZe changed the title Use thread_local var's in FCL distanceCallback() WIP: Use thread_local var's in FCL distanceCallback() May 31, 2021
@codecov
Copy link

codecov bot commented May 31, 2021

Codecov Report

Merging #2698 (fdf3fc1) into master (b75d37e) will increase coverage by 0.02%.
The diff coverage is 71.43%.

❗ Current head fdf3fc1 differs from pull request most recent head 8085c7b. Consider uploading reports for the commit 8085c7b to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2698      +/-   ##
==========================================
+ Coverage   60.62%   60.64%   +0.02%     
==========================================
  Files         402      402              
  Lines       29615    29624       +9     
==========================================
+ Hits        17952    17962      +10     
+ Misses      11663    11662       -1     
Impacted Files Coverage Δ
...e/collision_detection_fcl/src/collision_common.cpp 77.98% <71.43%> (+0.06%) ⬆️
...e/src/parameterization/model_based_state_space.cpp 68.31% <0.00%> (-2.81%) ⬇️
moveit_core/robot_model/src/joint_model_group.cpp 63.06% <0.00%> (-1.94%) ⬇️
...ipulation/pick_place/src/manipulation_pipeline.cpp 71.85% <0.00%> (-1.94%) ⬇️
...on/pick_place/src/approach_and_translate_stage.cpp 73.24% <0.00%> (-0.70%) ⬇️
...dl_kinematics_plugin/src/kdl_kinematics_plugin.cpp 81.00% <0.00%> (-0.41%) ⬇️
...nning_scene_monitor/src/planning_scene_monitor.cpp 68.75% <0.00%> (+0.07%) ⬆️
...t_setup_assistant/src/tools/moveit_config_data.cpp 22.05% <0.00%> (+2.09%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b75d37e...8085c7b. Read the comment docs.

@AndyZe
Copy link
Member Author

AndyZe commented May 31, 2021

@v4hn I really appreciate the comments and the level of detail you put into them!

I stopped using the bio_ik control pipeline to collect benchmarking data. It was too flaky -- too many moving pieces. Instead I dropped a std::chrono high-res clock right in the distanceCallback, printed the duration to screen, then copy/pasted to a spreadsheet. You can see the code for that in the commit history.

Results are good! Average duration drops from 8.74->4.22 microseconds!

I think one of the main culprits is

const std::pair<std::string, std::string> pc = cd1->getID() < cd2->getID() ?
                                                              std::make_pair(cd1->getID(), cd2->getID()) :  
                                                              std::make_pair(cd2->getID(), cd1->getID());

Allocating a pair of strings is probably expensive!

Spreadsheet attached and I'll update the PR description again.
print_of_distance_callback.ods

@AndyZe AndyZe changed the title WIP: Use thread_local var's in FCL distanceCallback() Use thread_local var's in FCL distanceCallback() May 31, 2021
Copy link
Contributor

@xqms xqms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, saw this one because you referred to my PR ;)

Similar to @v4hn I'd be very cautious regarding this optimization. I'm not sure you kept all the behavior intact, since it's very easy to miss something (see the problems I found below, and there might be more).

All in all, stack variables are very cheap and should nearly always be cheaper than accessing the heap / static variables, since the stack should always be in the cache. Allocating on the stack is also very cheap, it's just incrementing the stack pointer. So in my mind there should be little reason to turn primitive types such as bool or int into thread_local variables. (But who knows, computers are complicated beasts ^^)

For types with costly constructors there may be some argument, but I don't see those here. The std::pair gets completely overwritten on every call anyways, so there shouldn't be any benefit.

Maybe we can find out (e.g. through trial and error) which change exactly makes it faster in your setup?

I can recommend https://github.com/KDAB/hotspot (open source) or Intel VTune (paid, but free for academics and such) to profile the original function in more detail and see which operations are actually costly in this function.

@xqms
Copy link
Contributor

xqms commented Jun 1, 2021

Additionally, I just looked at your measurements and in the "master" case the reported timings have an average of 4.13µs for the first half (which is same or below "this PR"), and in the second half they suddenly go up. That is somewhat suspicious - maybe something else was running at the same time? Maybe your CPU was throttling? Maybe the collision scene changed?

Screenshot_20210601_143347

Just for reference again, here is the small benchmark I wrote for the other PR:
https://gist.github.com/xqms/0500ed88ac9836f6bbe0022298635c8e
Maybe it can be used to test this PR as well.

@AndyZe
Copy link
Member Author

AndyZe commented Jun 1, 2021

Thanks @xqms. I get it -- it will be tough to convince you guys these changes are necessary, and maybe I went overboard. Will try to benchmark in more detail using one of those methods you guys have described.

Hopefully we all agree that it's not good practice to declare variables in functions that could be used in realtime control applications. Thus, something needs to be improved here.

@xqms
Copy link
Contributor

xqms commented Jun 1, 2021

Thanks @xqms. I get it -- it will be tough to convince you guys these changes are necessary, and maybe I went overboard.

No worries! It's an interesting topic. I hope I didn't go overboard with my review ;)

Hopefully we all agree that it's not good practice to declare variables in functions that could be used in realtime control applications.

I'm curious, why? Local variables on the stack have zero allocation time (the stack pointer is usually incremented once on function entry and this needs to be done anyway). So I'd see no problem with local variables in general. But maybe I'm missing something.

If there is something the constructor of the variable is allocating on the heap and there is the possibility to re-use that space in the next call, then I guess there is some potential for improvement. I see that only for the fcl::CollisionResult struct, which contains std::vector and std::set - both of which store data on the heap.

@simonschmeisser
Copy link
Contributor

This function is huge! I think you will get more rewarding results by analyzing the whole operation and looking for things that could be handled more efficiently on a higher level. Besides ´hotspot` which I really recommend (I used it to figure out the thread-local cache of FCL collision objects further down that file) you can also try an instrumenting profiler like perfetto https://perfetto.dev/ to get a time-resolved view of what's happening.

From my memory there are some points worth looking at

  • Allowed Collision Matrix (ACM) was about 1/3 of the time for collision checks (not distance). This could be improved by using std::unordered_map instead of std::map or even better by building up a ACM with fcl::CollisionObjects pointers instead of strings as key
  • geometric transformations and AABB calculations. Maybe something can be cached here as well?
  • maybe even reusing the BVH (Bounding Volume Hierarchy) in FCL between requests somehow?

Specifically in this function I also see that towards the end
https://github.com/ros-planning/moveit/pull/2698/files#diff-c0ca395481a0bd3a3708f3ef2bd50f6049da0f7c136d0a2bd131f187986f85dbR648
some data structure is expanded dynamically (std::set?)
Maybe you can at least partly reuse it (it being the CollisionData) and generously call reserve() in your code?

Also you haven't stated your high-level goal, if you want "pre-impact" collision checking (boolean checking if distance < threshold) in a real-time setup maybe we need a special variant of that callback function that exits early? Or maybe you can already achieve that with the flags?

@v4hn
Copy link
Contributor

v4hn commented Jun 2, 2021

I hope I didn't go overboard with my review ;)

I'm grateful for your comments! There is no "going overboard" w.r.t. patches on runtime optimization. 👍

Hopefully we all agree that it's not good practice to declare variables in functions that could be used in realtime control applications.

I'm curious, why? Local variables on the stack have zero allocation time (the stack pointer is usually incremented once on function entry and this needs to be done anyway). So I'd see no problem with local variables in general. But maybe I'm missing something.

No I agree. As I said above, individual values will often just end up in registers and not even on the stack...

If there is something the constructor of the variable is allocating on the heap and there is the possibility to re-use that space in the next call, then I guess there is some potential for improvement. I see that only for the fcl::CollisionResult struct, which contains std::vector and std::set - both of which store data on the heap.

That's not just a potential improvement. Avoiding allocations is probably what @AndyZe means anyway.
Turning any kind of standard container structure into a static/thread_local makes a lot of sense.

@AndyZe AndyZe force-pushed the andyz/distance_collision_allocation branch from 88060ec to 5d665a1 Compare June 2, 2021 14:34
@AndyZe AndyZe force-pushed the andyz/distance_collision_allocation branch from 5d665a1 to 4545f63 Compare June 2, 2021 14:40
@AndyZe
Copy link
Member Author

AndyZe commented Jun 2, 2021

Well, I've now benchmarked this repeatedly with code that is very similar to what @xqms shared (attached). I took a quick look at hotspot but had some initial trouble compiling on 18.04 and there is only a debian available for Ubuntu 20+.

So, benchmarking results for 10,000 iterations. This is distance checking for a panda arm that is in collision with a box.

master: 1.81 ms/iteration
This PR: 1.65 ms/iteration

The POD variables (double, bool) did not seem to make any difference, whether they were thread-local or not. So you guys were right in that regard.

BTW @xqms, you may be interested to know that --

req.type = collision_detection::DistanceRequestType::GLOBAL;

runs much faster than

req.type = collision_detection::DistanceRequestType::SINGLE;

I don't know why. Prob. due to something in this area:

if (cdata->req->type != DistanceRequestType::GLOBAL)

It is counterintuitive that a GLOBAL check runs faster than a single check.

benchmark_collisions.cpp.txt

@AndyZe
Copy link
Member Author

AndyZe commented Jun 2, 2021

Perhaps we can get this merged soon as a small, incremental improvement?

@simonschmeisser
Copy link
Contributor

Well, I've now benchmarked this repeatedly with code that is very similar to what @xqms shared (attached). I took a quick look at hotspot but had some initial trouble compiling on 18.04 and there is only a debian available for Ubuntu 20

You can simply download the AppImage, chmod +x it and execute it, you can find it among the assets here https://github.com/KDAB/hotspot/releases

@AndyZe
Copy link
Member Author

AndyZe commented Jun 3, 2021

Thanks for the tip @simonschmeisser! hotspot is a nice tool.

I started issue #2714 where we can really dive into the analysis. I suspect it's going to be a long discussion and I don't really want this PR to get dragged into the muck.

@xqms
Copy link
Contributor

xqms commented Jun 7, 2021

BTW @xqms, you may be interested to know that --

req.type = collision_detection::DistanceRequestType::GLOBAL;

runs much faster than

req.type = collision_detection::DistanceRequestType::SINGLE;

I don't know why. Prob. due to something in this area:

if (cdata->req->type != DistanceRequestType::GLOBAL)

It is counterintuitive that a GLOBAL check runs faster than a single check.

benchmark_collisions.cpp.txt

Interesting! I think one reason could be that FCL can apply more early exits to the GLOBAL case: since fcl::distance() is called with the minimum distance computed so far as the distance threshold, it can exit early when it is clear that the objects are far enough apart.
In the SINGLE case, we have to compute pairwise distances for all objects (distance threshold stays at initial value), so it cannot exit early.

Copy link
Member

@tylerjw tylerjw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading through this change I think it is a nice improvement. We have tested this quite a bit on a customer project and it seems to work as intended.

@tylerjw tylerjw requested a review from v4hn June 8, 2021 17:43
@tylerjw
Copy link
Member

tylerjw commented Jun 8, 2021

@v4hn I requested a review from you as you requested changes in your review previously. I believe your comments have been addressed and Andy is correctly cleaning up the memory that is reused before using it now.

Copy link
Contributor

@rhaschke rhaschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't check the semantic code changes. Formally, there are still some stack-allocatable data types declared as thread_local.

@v4hn v4hn dismissed their stale review June 9, 2021 08:32

addressed

@v4hn
Copy link
Contributor

v4hn commented Jun 9, 2021

I requested a review from you as you requested changes in your review previously. I believe your comments have been addressed and Andy is correctly cleaning up the memory that is reused before using it now.

Sorry, I did not spend more time on this yet and dismissed my stale review (you could have done that too @tylerjw).
Looks like Robert invested some time here though.

@AndyZe
Copy link
Member Author

AndyZe commented Jun 9, 2021

@rhaschke Thanks! Run time after your changes dropped a tiny little bit. I think it's because clearing the fcl::DistanceResultd wasn't necessary anymore.

Runtime change: 0.113 ms/iteration --> 0.110 ms/iteration

@AndyZe
Copy link
Member Author

AndyZe commented Jun 9, 2021

Let's take a minute to recap. For my test case, this PR drops runtime from 1.667 ms/iteration to 0.110 ms/iteration. That's a 93% drop. Wow! It makes me happy to think of all the saved CPU cycles. Thanks everybody.

@AndyZe AndyZe force-pushed the andyz/distance_collision_allocation branch from 2cd721b to 8848af0 Compare June 9, 2021 14:10
Copy link
Contributor

@rhaschke rhaschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, I approve. See AndyZe#2 for some more improvements.
Please address the open issues, i.e. add a comment or a dist_result.clear()

@AndyZe AndyZe merged commit 1eb2af5 into moveit:master Jun 9, 2021
skolk added a commit to moveit/moveit.ros.org that referenced this pull request Jul 8, 2021
AndyZe pushed a commit to moveit/moveit.ros.org that referenced this pull request Jul 8, 2021
* wording

* date update

* add image

* Update 2021-07-08-moveit-galactic.md

here 1 > here

* Update _posts/2021-07-08-moveit-galactic.md

Co-authored-by: AndyZe <zelenak@picknik.ai>

* Update 2021-07-08-moveit-galactic.md

Andy's update for moveit/moveit#2698

* Update 2021-07-08-moveit-galactic.md

re-order bullets

* Fix bullet

Co-authored-by: AndyZe <zelenak@picknik.ai>
rhaschke added a commit to ubi-agni/moveit that referenced this pull request Aug 17, 2021
…cope error

This optimization was originally introduced in moveit#2698 to avoid string copying.
@captain-yoshi
Copy link
Contributor

@AndyZe Do you remember your CollisionRequest setup ? I guess it would look like below but perhaps you bumped up the max_contacts field.

# collision_detection::CollisionRequest
distance: True              # Compute proximity distance (Slow)
cost: False                 # Compute collision cost
contacts: True              # Compute contacts
max_contacts: 1             # Maximum number of contacts
max_contacts_per_pair: 1    # Maximum numbe of contacts per pair of bodies
max_cost_sources: 1         # Defines how many of the top cost sources should be returned
verbose: False              # Report information about collision

This is distance checking for a panda arm that is in collision with a box.

How much boxes was in the PlanningScene ?

I'm testing regression across MoveIt versions specifically on collision checks so this would be a good benchmark to validate the optimisation of this PR.

@AndyZe
Copy link
Member Author

AndyZe commented Mar 25, 2022

I remember it was just one box in the scene. Not that you need to copy what I did. I don't remember the other parameters

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants