Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

time wasted in _graph.emplace #3167

Closed
mangoschorle opened this issue Sep 2, 2022 · 24 comments · Fixed by #3201
Closed

time wasted in _graph.emplace #3167

mangoschorle opened this issue Sep 2, 2022 · 24 comments · Fixed by #3201
Assignees

Comments

@mangoschorle
Copy link

https://github.com/ros-planning/navigation2/blob/634d2e3d9b6bde5558c90fe6583d52b3ed66cf55/nav2_smac_planner/src/a_star.cpp#L122
can be improved.
emplace_back will always construct the value even if it is already in the graph. So if it is in the graph, it is created and immediately afterwards destroyed. (see https://en.cppreference.com/w/cpp/container/unordered_map/emplace)

We could spare unnecessary constructions doing sth like this

  auto iter = graph_.find(index);
  if (iter == graph_.end()) {
    return &(std::get<0>(graph_.emplace(std::piecewise_construct, std::forward_as_tuple(index),
                                        std::forward_as_tuple(index /*initializer list of node t or NodeT(index)*/)))
                 ->second);
  }
  return &(iter->second);

Another thing I don't understand is why AStar reserves _graph.reserve(100000);
Seems a bit random to me. Why not reserving max_iterations?

@SteveMacenski
Copy link
Member

Please submit a PR, sounds good to me!

Why not reserving max_iterations?

Frankly, the max_iterations didn't exist at the time that was written to optimize performance. But also max_iterations isn't the right thing because the graph will contain far more than just the number of iterations of the planner. The iterations of the planner are the number of visited nodes in the graph, but theoretically up to num_prims number of nodes are queued into the graph at each visitation.

@mangoschorle
Copy link
Author

Will do.

@SteveMacenski
Copy link
Member

SteveMacenski commented Sep 6, 2022

I’d also make sure in a test that it actually is faster. It wouldn’t surprise me if the immediate destruction due to a pre-existing node was faster than the find implementation- even though either way the hash table’s bucket would need to be found. I suppose it depends on how find is implemented and how the compiler optimizes it. I wouldn’t necessarily assume this is faster so it should be an empirical decision based on benchmarking. But definitely a great point worth that investigation!

On the max iterations- something to mention is that when the graph is cleared, the memory block is still allocated. So if there’s a much larger graph size after the first plan, it’ll retain that capacity. This initial reserve is mostly just for the 1st planning instance to give it a reasonable starting size rather than incurring multiple copying while the container is resized. It will likely need to be resized anyway for longer planning requests, but at least with that default we're not resizing a ton of other times getting up to 100k nodes - since it just doubles each time the capacity is expanded starting from 0. That stops 17 resizings and copies as it grows to a reasonable size (100k)

@mangoschorle
Copy link
Author

On a side node before digging into this. Each of the nodes (hybrid, lattice, ..) has a reset() method. This isn't used currently or is it?

@SteveMacenski
Copy link
Member

SteveMacenski commented Sep 8, 2022

It is used in the unit tests - though I added that for practical needs in the planners which have apparently disappeared over time. Still useful for test purposes though. I think that was back before I was doing dynamically generated graphs and used that as a mechanism to not decallocate and reallocate the graph between runs, but I'm speculating at design decisions from about 2 years ago.

@mangoschorle
Copy link
Author

I did some analysis using gperftools. I had around 6% improvement. Quite interesting to see that the unordered_map business takes up 30%+ of the planning times.

Old version emplace
Screenshot from 2022-09-09 10-22-16

Find and emplace
Screenshot from 2022-09-09 09-41-23

@mangoschorle
Copy link
Author

mangoschorle commented Sep 9, 2022

Now this is super interesting. After seeing how many time was taken up by the unordered_map I went ahead and just for fun replaced it with the first google hit I could find that has some stars on github (https://github.com/martinus/robin-hood-hashing). I didn't look into licensing & so on ... Just for the sake of getting an idea whether one could improve here.

34 % runtime drop.

Time units planning:

std::unordered_map:
find&emplace 473.945 time units
emplace (current version) 504.593 time units

robin_hood::unordered_map:
find&emplace 313.878 time units
emplace (current version) 339.624 time units

Btw. I also tried to just replace the stl hash with the has used in robin_hood::unordered_map. This alone didn't do the trick.
std::unordered_map: + different hash
find&emplace 425.556 time units

Screenshot from 2022-09-09 11-09-14

@SteveMacenski
Copy link
Member

SteveMacenski commented Sep 9, 2022

I did some analysis using gperftools. I had around 6% improvement. Quite interesting to see that the unordered_map business takes up 30%+ of the planning times.

Indeed, this is why I designed the A* implementation the way I did to absolutely remove any lookups in the graph beyond insertion, by using the returned pointers and storing those in NodeBasic so that later access on node visitation was lookup-free.

That looks like an impressive speed up, definitely worth considering adding!

I didn't look into licensing & so on

I just did, this would be totally compatible to include the headers in this project. Though, the headers are kill its test coverage metrics :/ I found https://packages.debian.org/buster/robin-map-dev which are binaries for https://github.com/Tessil/robin-map which claims to implement the same robin hood hashing. Maybe worth a try? If that works, I'd be happy to submit the rosdep keys to the central index so we can move Smac over to it if it gets us a good boost! There's also this library: https://github.com/martinus/unordered_dense same guy, just newer. No binaries, but I figure "new" means "better".

@mangoschorle
Copy link
Author

mangoschorle commented Sep 12, 2022 via email

@SteveMacenski
Copy link
Member

Sounds good - can you open the PR?

Concerning the hash-map, I presonally would also opt for a better map such as round-robin.

I totally agree, this is useful. If you could throw me a branch or a draft PR with that, I can see how I can integrate it as cleanly as possible w.r.t. the dependencies and test coverage metrics.

@SteveMacenski
Copy link
Member

SteveMacenski commented Sep 15, 2022

#3195 for the checking if existing before emplacing.

I wasn't sure why you did the piecewise_construct and forward_as_tuple stuff here, can you explain your thinking or if my solution is equally as good?

For the robin hood hashing - do you have a branch you can link here (or open a pull request?) for that? I'd like to get that in before ROSCon in about a month and update run-time metrics based on the results. I should have some time over the next few days to work on this if you can provide me your launching point.

@SteveMacenski SteveMacenski self-assigned this Sep 15, 2022
@mangoschorle
Copy link
Author

mangoschorle commented Sep 16, 2022

I just timed both solutions: mine 688031 units yours 689687. That's 0.25% difference. Given noise I would say they are just equally good.

Now for the round robin hashing it really is dead simple. I just copied the header robin_hood.h into my code base and replaced
typedef std::unordered_map<unsigned int, NodeT> Graph;

with

typedef robin_hood::unordered_map<unsigned int, NodeT> Graph;

other than that its all completely compatible.

I am really sorry to contribute like this and not just hand in a pull request. We do have the possibility to contribute to OS in the form of actual code, but it requires a non-trivial clearing process, which I am currently familiarizing with. We are looking into this at our end.

@SteveMacenski
Copy link
Member

SteveMacenski commented Sep 16, 2022

Understood, thanks for as much as you've done, its been very helpful! Smac Planners are already pretty fast and I'm very happy now that they straight up blow the others out of the water with these changes 😉

Let me know if I can help / answer any questions on the open source contribution front. Obviously I'd love / appreciate any help you and your organization (whoever that may be 😉) can provide. In this project, we do allow all developers/companies to retain their own intellectual property of contributions (e.g. there is no signing away copyrights or any developer agreements) and everything is in Apache 2.0 licensing (unless there's a strong reason otherwise) so its available for any commercial work freely with virtually no restrictions. And I suppose Apache 2.0 allows people to still patent things, though me and my organization have no particular interest in that. Lots of good reasons and I've tried to make it as easy of a sell as possible for organizations to contribute relatively freely here.

I'll play around with the robin_hood stuff this afternoon

@SteveMacenski
Copy link
Member

SteveMacenski commented Sep 16, 2022

I included the header in the repo robin_hood.h and it fails to plan (just hangs) and the server eventually crashes using that header file and replacing the std with the robin_hood namespace in the graph type. Sure you didn't do anything else?

Branch: https://github.com/ros-planning/navigation2/tree/robin_hood

Edit: I found that it works (?) on smaller maps/paths, but when I use a larger confined map (willow_garage.yaml) it just hangs infinitely. That's not even a particularly large map, its a medium sized office building. Shorter goals work fine on that map. It seems like something is not quite right with robin_hood if we're getting different behaviors. With the same settings and std works fine. I tested on a map I've used for path planning development for some time, so these paths should be achievable.

@SteveMacenski
Copy link
Member

https://github.com/martinus/unordered_dense/blob/main/include/ankerl/unordered_dense.h

However the other one I pointed to seems to work fine if I include that instead and replace the line with

  typedef ankerl::unordered_dense::map<unsigned int, NodeT> Graph;

I don't see a significant difference in compute times -- but I only did some extremely quick tests, I'd need to do more to make a real call if that helped or not (the variance was high, but certainly didn't seem like a 30% drop consistently).

@mangoschorle
Copy link
Author

mangoschorle commented Sep 18, 2022

robin_hood works fine on this end. Our largest environments are 600m*700m and all unit tests succeed. My code base in general has diverged quite a bit form the nav2 version (due to some restriction and practices) but in the end, I am emplacing nodes just as you do.

We are compiling g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0.

You are absolutely right this is not good. But it sort of surprises me. In the end it was a naive pick on my side. As said I took the first thing with 1000+ stars, that's it.

EDIT: Nah, sorry I messed up. Should have read the manual.

Data is either stored in a flat array, or with node indirection. Access for unordered_flat_map is extremely fast due to no indirection, but references to elements are not stable. It also causes allocation spikes when the map resizes, and will need plenty of memory for large objects. Node based map has stable references & pointers (NOT iterators! Similar to std::unordered_map) and uses const Key in the pair. It is a bit slower due to indirection. The choice is yours; you can either use robin_hood::unordered_flat_map or robin_hood::unordered_node_map directly. If you use robin_hood::unordered_map It tries to choose the layout that seems appropriate for your data.

I assume on your side it chooses unordered_flat_map. Can you replace robin_hood::unordered_map with robin_hood::unordered_node_map? unordered_flat_map unsurprisingly causes the behavior you mention also on my side.

I found that it works (?) on smaller maps/paths

Yeah I guess that's before robin_hood starts rehashing

Concerning the hash-map, I presonally would also opt for a better map such as round-robin. Although I think that the benefits of it would only really kick in on non-trivial planning requests.

I am experiencing these 30% drops consistently in all of my trickier planning requests. I don't know if there is much gain before the first rehashing.

@SteveMacenski
Copy link
Member

SteveMacenski commented Sep 19, 2022

Got it -- well I would be curious as to the changes you've made if its diverged that much if there are improvements there to the algorithm that can be made here as well! That's part of the beauty of open source 😄

unordered_node_map does indeed do it - thanks. I'll do some more appropriate benchmarking this afternoon, but I do now see a 15-35% reduction from 2 particular quick test cases averaged over 5 runs. This is worth investigating for real benchmark improvement estimates, but I think ultimately this will be merged even if the low-end estimates are true.

Thanks for bringing this up!

@SteveMacenski
Copy link
Member

On a set of 100 random plans over a 100x100m map, I see a 25% total decrease in planning time without any change in other metrics, so I think this is definitely both good to be merged and a shockly large total planning improvement for a trivial change!

@SteveMacenski
Copy link
Member

From my deeper experiments, I see between 10-15% on a trivial-ish sandbox environment as part of our benchmarking suite https://github.com/ros-planning/navigation2/tree/main/tools/planner_benchmarking. Far from the 30% reported, but also may very well be that high for longer, more complex paths. Either way 10-15% improvement for a change in data structure is well worth it on its own right (or even half of that!).

Only issue I'm trying to deal with now is how to exclude the robin hood header file from our testing and linting suite which is reporting a ton or errors and lack of coverage.

@SteveMacenski
Copy link
Member

Ended up just actually linting the robin hood file by hand -- waiting on CI to turn over successfully and will merge. Closing!

Thanks @mangoschorle for the great work and mentioning this, huge impact for such a small change 😄

@SteveMacenski
Copy link
Member

SteveMacenski commented Sep 22, 2022

I tested these changes on some extremely large maps and see the same 30% numbers you’re seeing! Just giving external confirmation from another application.

For modest environments (~4000m^2 on average) I see the 10-15% numbers. Either way, great easy improvement! Thanks! I’ll take 10% boosts any day. Another 6% would actually make it faster, on average, than NavFn across my last benchmark for both the feasible planners.

@mangoschorle
Copy link
Author

What are the required steps to get the metrics.py up and running on my individual map?
I am using the dockerfile provided and have built everything. but a simple python3 metrics.py won't do the trick.
Is there a startup instruction. Would be handy especially regarding the zigzag issue

@SteveMacenski
Copy link
Member

SteveMacenski commented Sep 22, 2022

I uploaded some instructions in the comment at the top of the file for setting up the config files so that things are setup the way the script expects for calling the planners. If you want to do a subset of those, you'll have to remove the planners your not interested in from the planners list. For maps, we also in the script specify a map in the variable map_path. If you want to use something else, replace that line.

For launching things, just run Nav2 as you normally would and then run this script in another terminal. Obviously you don't need gazebo/the other servers/etc but that's the lowest effort way. I have some launch files I use for just launching the planner server / map server / faking out TF frames / and the lifecycle manager to active everything, but I haven't found a good place to dump that yet for public consumption. It has some hardcoded content that are not fit for general use (maps, mostly). Also if I put a launch file outside of a package in tools/, I'm not sure how other people will launch / use. I suppose ros2 launch ./test_planner_benchmark_launch.py would work. Maybe I should clean that up and upload it. Either way, here you go. Just replace the map in the launch file with some other real-valid path, it'll get overrided later. #3174 is addressing the annoying need for this step in short order.

Demo planner launch file
# Copyright (c) 2022 Samsung Research America
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import os

from launch import LaunchDescription
from launch.actions import IncludeLaunchDescription
from launch.launch_description_sources import PythonLaunchDescriptionSource
from launch_ros.actions import Node
from ament_index_python.packages import get_package_share_directory

def generate_launch_description():
    nav2_bringup_dir = get_package_share_directory('nav2_bringup')
    config = os.path.join(get_package_share_directory('nav2_bringup'), 'params', 'nav2_params.yaml')
    map_file = os.path.join('/home/steve/Documents/bags_n_maps', 'map.yaml') # dummy file for now, changes in script
    lifecycle_nodes = ['map_server', 'planner_server']

    return LaunchDescription([
        Node(
            package='nav2_map_server',
            executable='map_server',
            name='map_server',
            output='screen',
            parameters=[{'use_sim_time': True},
                        {'yaml_filename': map_file},
                        {'topic_name': "map"}]),

        Node(
            package='nav2_planner',
            executable='planner_server',
            name='planner_server',
            output='screen',
            parameters=[config]),

        Node(
            package = 'tf2_ros',
            executable = 'static_transform_publisher',
            output = 'screen',
            arguments = ["0", "0", "0", "0", "0", "0", "base_link", "map"]),

        Node(
            package = 'tf2_ros',
            executable = 'static_transform_publisher',
            output = 'screen',
            arguments = ["0", "0", "0", "0", "0", "0", "base_link", "odom"]),

        Node(
            package='nav2_lifecycle_manager',
            executable='lifecycle_manager',
            name='lifecycle_manager',
            output='screen',
            parameters=[{'use_sim_time': True},
                        {'autostart': True},
                        {'node_names': lifecycle_nodes}]),

        IncludeLaunchDescription(
            PythonLaunchDescriptionSource(
                os.path.join(nav2_bringup_dir, 'launch', 'rviz_launch.py')),
            launch_arguments={'namespace': '',
                              'use_namespace': 'False'}.items())

    ])

I'll say for testing for specific issues, I find writing a small script to hand-code a set of starting/goal poses you know are problematic or non-trivially long is the way to go over random generation. But that's super trivial using the BasicNavigator.getPlan() API I provide in the simple navigator package. For each goal / planner of interest run result = navigator.getPath(start, goal, planner, use_start=True)

@SteveMacenski
Copy link
Member

SteveMacenski commented Sep 22, 2022

On another topic, last night something popped in my head around dinner time, I remembered looking at the partial GDB result you showed above and glanced past the part where the 2D distance heuristic is taking up a bunch of time. Not sure why 2 weeks later my brain decided to bring that to the forefront of my mind, but we do use hypotf in that heuristic which is known to be heavier than just doing $\sqrt{xx+yy}$ by quite a bit if repeated in the ~tens of thousands of times. I'll test that later this week and probably get a few percentage points back.

Edit: #3217 didn't help as much as I would have thought, but definitely doesn't hurt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants