Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Controller/planner server segfault on clear local/global costmap #2373

Closed
charlielito opened this issue May 28, 2021 · 21 comments
Closed

Controller/planner server segfault on clear local/global costmap #2373

charlielito opened this issue May 28, 2021 · 21 comments
Labels
bug Something isn't working in progress

Comments

@charlielito
Copy link
Contributor

Bug report

  • Operating System:
    • Docker + Ubuntu 20.04, 10th gen i7
  • ROS2 Version:
    • Galactic binaries
  • Version or commit hash:
  • DDS implementation:
    • cyclone

Steps to reproduce issue

I am running gazebo with a custom map and custom robot. I am able to navigate, and while doing it, sometimes randomly the controller server dies when requested to clear the local costmap. Also sometimes the planner server dies when requested to clear the global costmap. I have the same set up as in foxy, just changed a little the nav2_params.yaml to meet the new API. I am also using the same default BT file as in foxy branch navigate_w_replanning_and_recovery.xml. In foxy everything runs well and it never dies.

nav2_params.yaml
amcl:
  ros__parameters:
    use_sim_time: True
    alpha1: 0.25
    alpha2: 0.2
    alpha3: 0.15
    alpha4: 0.2 
    alpha5: 0.2 # for omnidirectional robot_model_type only
    base_frame_id: "chassis"
    beam_skip_distance: 0.5
    beam_skip_error_threshold: 0.9
    beam_skip_threshold: 0.3
    do_beamskip: false
    global_frame_id: "map"
    lambda_short: 0.1
    laser_likelihood_max_dist: 2.0
    laser_max_range: 40.0
    laser_min_range: -1.0 # -1 laser’s reported minimum
    laser_model_type: "likelihood_field"
    max_beams: 60
    max_particles: 2000
    min_particles: 500
    odom_frame_id: "odom"
    pf_err: 0.05
    pf_z: 0.99
    recovery_alpha_fast: 0.0
    recovery_alpha_slow: 0.0
    resample_interval: 1
    robot_model_type: "differential"
    save_pose_rate: 0.5
    sigma_hit: 0.2
    tf_broadcast: true
    transform_tolerance: 1.0
    update_min_a: 0.2
    update_min_d: 0.25
    z_hit: 0.5
    z_max: 0.05
    z_rand: 0.5
    z_short: 0.05
    scan_topic: scan

amcl_map_client:
  ros__parameters:
    use_sim_time: True

amcl_rclcpp_node:
  ros__parameters:
    use_sim_time: True

bt_navigator:
  ros__parameters:
    use_sim_time: True
    global_frame: map
    robot_base_frame: chassis
    odom_topic: /odom
    bt_loop_duration: 10
    default_server_timeout: 20
    default_nav_to_pose_bt_xml: "navigate_w_replanning_and_recovery.xml"
    enable_groot_monitoring: false
    plugin_lib_names:
    - nav2_compute_path_to_pose_action_bt_node
    - nav2_compute_path_through_poses_action_bt_node
    - nav2_follow_path_action_bt_node
    - nav2_back_up_action_bt_node
    - nav2_spin_action_bt_node
    - nav2_wait_action_bt_node
    - nav2_clear_costmap_service_bt_node
    - nav2_is_stuck_condition_bt_node
    - nav2_goal_reached_condition_bt_node
    - nav2_goal_updated_condition_bt_node
    - nav2_initial_pose_received_condition_bt_node
    - nav2_reinitialize_global_localization_service_bt_node
    - nav2_rate_controller_bt_node
    - nav2_distance_controller_bt_node
    - nav2_speed_controller_bt_node
    - nav2_truncate_path_action_bt_node
    - nav2_goal_updater_node_bt_node
    - nav2_recovery_node_bt_node
    - nav2_pipeline_sequence_bt_node
    - nav2_round_robin_node_bt_node
    - nav2_transform_available_condition_bt_node
    - nav2_time_expired_condition_bt_node
    - nav2_distance_traveled_condition_bt_node
    - nav2_single_trigger_bt_node
    - nav2_is_battery_low_condition_bt_node
    - nav2_navigate_through_poses_action_bt_node
    - nav2_navigate_to_pose_action_bt_node
    - nav2_remove_passed_goals_action_bt_node
    - nav2_planner_selector_bt_node
    - nav2_controller_selector_bt_node
    - nav2_goal_checker_selector_bt_node

bt_navigator_rclcpp_node:
  ros__parameters:
    use_sim_time: True

controller_server:
  ros__parameters:
    use_sim_time: True
    controller_frequency: 20.0
    min_x_velocity_threshold: 0.001
    min_y_velocity_threshold: 0.5
    min_theta_velocity_threshold: 0.001
    progress_checker_plugin: "progress_checker"
    goal_checker_plugins: ["general_goal_checker"] # "precise_goal_checker"
    controller_plugins: ["FollowPath"]

    # Progress checker parameters
    progress_checker:
      plugin: "nav2_controller::SimpleProgressChecker"
      required_movement_radius: 0.5
      movement_time_allowance: 10.0
    # Goal checker parameters
    #precise_goal_checker:
    #  plugin: "nav2_controller::SimpleGoalChecker"
    #  xy_goal_tolerance: 0.25
    #  yaw_goal_tolerance: 0.25
    #  stateful: True
    general_goal_checker:
      stateful: True
      plugin: "nav2_controller::SimpleGoalChecker"
      xy_goal_tolerance: 0.25
      yaw_goal_tolerance: 0.25
    # DWB parameters
    FollowPath:
      plugin: "dwb_core::DWBLocalPlanner"
      debug_trajectory_details: True
      min_vel_x: 0.0
      min_vel_y: 0.0
      max_vel_x: 1.0
      max_vel_y: 0.0
      max_vel_theta: 1.0
      min_speed_xy: 0.0
      max_speed_xy: 1.0
      min_speed_theta: 0.0
      # Add high threshold velocity for turtlebot 3 issue.
      # https://github.com/ROBOTIS-GIT/turtlebot3_simulations/issues/75
      acc_lim_x: 2.5
      acc_lim_y: 0.0
      acc_lim_theta: 3.2
      decel_lim_x: -2.5
      decel_lim_y: 0.0
      decel_lim_theta: -3.2
      vx_samples: 20
      vy_samples: 5
      vtheta_samples: 20
      sim_time: 1.7
      linear_granularity: 0.05
      angular_granularity: 0.025
      transform_tolerance: 0.2
      xy_goal_tolerance: 0.25
      trans_stopped_velocity: 0.25
      short_circuit_trajectory_evaluation: True
      stateful: True
      critics: ["RotateToGoal", "Oscillation", "BaseObstacle", "GoalAlign", "PathAlign", "PathDist", "GoalDist"]
      BaseObstacle.scale: 0.2
      PathAlign.scale: 32.0
      PathAlign.forward_point_distance: 0.1
      GoalAlign.scale: 24.0
      GoalAlign.forward_point_distance: 0.1
      PathDist.scale: 32.0
      GoalDist.scale: 24.0
      RotateToGoal.scale: 32.0
      RotateToGoal.slowing_factor: 5.0
      RotateToGoal.lookahead_time: -1.0

controller_server_rclcpp_node:
  ros__parameters:
    use_sim_time: True

local_costmap:
  local_costmap:
    ros__parameters:
      update_frequency: 5.0
      publish_frequency: 2.0
      global_frame: odom
      robot_base_frame: chassis
      use_sim_time: True
      rolling_window: true
      width: 3
      height: 3
      resolution: 0.05
      # robot_radius: 0.5
      footprint: "[ [0.25, 0.2], [0.25, -0.2], [-0.25, -0.2], [-0.25, 0.2] ]"
      footprint_padding: 0.1
      plugins: ["voxel_layer", "inflation_layer"]
      # filters: ["keepout_filter"]
      keepout_filter:
        plugin: "nav2_costmap_2d::KeepoutFilter"
        enabled: True
        filter_info_topic: "/costmap_filter_info"
      inflation_layer:
        plugin: "nav2_costmap_2d::InflationLayer"
        cost_scaling_factor: 3.0
        inflation_radius: 0.4
      voxel_layer:
        plugin: "nav2_costmap_2d::VoxelLayer"
        enabled: True
        publish_voxel_map: True
        origin_z: 0.0
        z_resolution: 0.1
        z_voxels: 16
        max_obstacle_height: 4.0
        mark_threshold: 0
        observation_sources: "scan pointcloud"
        scan:
          topic: /scan
          max_obstacle_height: 4.0
          clearing: True
          marking: True
          data_type: "LaserScan"
        pointcloud:
          topic: /camera/depth/color/points
          max_obstacle_height: 4.0
          min_obstacle_height: 0.3
          obstacle_max_range: 10.0
          obstacle_min_range: 0.3
          raytrace_max_range: 10.0
          raytrace_min_range: 0.3
          clearing: True
          marking: True
          data_type: "PointCloud2"
      static_layer:
        map_subscribe_transient_local: True
      always_send_full_costmap: True
  local_costmap_client:
    ros__parameters:
      use_sim_time: True
  local_costmap_rclcpp_node:
    ros__parameters:
      use_sim_time: True

global_costmap:
  global_costmap:
    ros__parameters:
      update_frequency: 1.0
      publish_frequency: 1.0
      global_frame: map
      robot_base_frame: chassis
      use_sim_time: True
      # robot_radius: 0.5
      footprint: "[ [0.25, 0.2], [0.25, -0.2], [-0.25, -0.2], [-0.25, 0.2] ]"
      footprint_padding: 0.1
      resolution: 0.05
      track_unknown_space: true
      plugins: ["static_layer", "obstacle_layer", "voxel_layer", "inflation_layer", ]
      # filters: ["keepout_filter"]
      keepout_filter:
        plugin: "nav2_costmap_2d::KeepoutFilter"
        enabled: True
        filter_info_topic: "/costmap_filter_info"
      obstacle_layer:
        plugin: "nav2_costmap_2d::ObstacleLayer"
        enabled: True
        observation_sources: scan
        footprint_clearing_enabled: true
        scan:
          topic: /scan
          max_obstacle_height: 4.0
          clearing: True
          marking: True
          data_type: "LaserScan"
      voxel_layer:
        plugin: "nav2_costmap_2d::VoxelLayer"
        enabled: True
        footprint_clearing_enabled: true
        max_obstacle_height: 2.0
        publish_voxel_map: True
        origin_z: 0.0
        z_resolution: 0.1
        z_voxels: 16
        unknown_threshold: 15
        mark_threshold: 0
        observation_sources: pointcloud
        combination_method: 1
        pointcloud:
          topic: /camera/depth/color/points
          max_obstacle_height: 4.0
          min_obstacle_height: 0.3
          obstacle_max_range: 10.0
          obstacle_min_range: 0.3
          raytrace_max_range: 10.0
          raytrace_min_range: 0.3
          clearing: True
          marking: True
          data_type: "PointCloud2"
      static_layer:
        plugin: "nav2_costmap_2d::StaticLayer"
        map_subscribe_transient_local: True
      inflation_layer:
        plugin: "nav2_costmap_2d::InflationLayer"
        cost_scaling_factor: 3.0
        inflation_radius: 0.4
      always_send_full_costmap: True
  global_costmap_client:
    ros__parameters:
      use_sim_time: True
  global_costmap_rclcpp_node:
    ros__parameters:
      use_sim_time: True

map_server:
  ros__parameters:
    use_sim_time: True
    yaml_filename: "turtlebot3_world.yaml"

map_saver:
  ros__parameters:
    use_sim_time: True
    save_map_timeout: 5000
    free_thresh_default: 0.15
    occupied_thresh_default: 0.65
    map_subscribe_transient_local: true

planner_server:
  ros__parameters:
    expected_planner_frequency: 20.0
    use_sim_time: True
    planner_plugins: ["GridBased"]
    GridBased:
      plugin: "nav2_navfn_planner/NavfnPlanner"
      tolerance: 0.5
      use_astar: false
      allow_unknown: true

planner_server_rclcpp_node:
  ros__parameters:
    use_sim_time: True

recoveries_server:
  ros__parameters:
    costmap_topic: local_costmap/costmap_raw
    footprint_topic: local_costmap/published_footprint
    cycle_frequency: 10.0
    recovery_plugins: ["spin", "backup", "wait"]
    spin:
      plugin: "nav2_recoveries/Spin"
    backup:
      plugin: "nav2_recoveries/BackUp"
    wait:
      plugin: "nav2_recoveries/Wait"
    global_frame: odom
    robot_base_frame: chassis
    transform_timeout: 0.1
    use_sim_time: true
    simulate_ahead_time: 2.0
    max_rotational_vel: 1.0
    min_rotational_vel: 0.4
    rotational_acc_lim: 3.2

robot_state_publisher:
  ros__parameters:
    use_sim_time: True


waypoint_follower:
  ros__parameters:
    loop_rate: 20
    stop_on_failure: false
    waypoint_task_executor_plugin: "wait_at_waypoint"   
    wait_at_waypoint:
      plugin: "nav2_waypoint_follower::WaitAtWaypoint"
      enabled: True
      waypoint_pause_duration: 200

Expected behavior

Do not crash any node

Actual behavior

Sometimes controller server crashes, sometimes planner server

Additional information

Here are some backtraces when the nodes die

controller_server_segfault.log
planner_server_segfaul.log

@SteveMacenski
Copy link
Member

SteveMacenski commented May 28, 2021

I'm not sure, it would definitely be worth more investigation. Nothing has changed anytime recently that I'm aware of, but the only thing that's really changed since the initial port from ROS1 was 13f518a#diff-15e7b290bebe57765609a46208aa6f20ba2b06ac10a3237f06259265a086cacc -- so that's where I'd start investigating. I don't have the resources to immediately address this, so I'd appreciate a PR once you have a good understanding of the problem, the traceback makes it pretty clear to me its related to the voxel layer or voxel grid.

What's your compute platform?

@SteveMacenski SteveMacenski added the bug Something isn't working label May 28, 2021
@charlielito
Copy link
Contributor Author

charlielito commented May 28, 2021

My testing compute platform is a 10th gen i7 laptop

@AlexeyMerzlyakov
Copy link
Collaborator

AlexeyMerzlyakov commented Jun 15, 2021

In the presented above nav2_params.yaml file the main changes are related to voxel_layer PCL2 observation sources and custom robot parameters. From the backtraces above the problem most likely is related to VoxelLayer plugin. So, I've added PCL to voxel_layer' observation sources and enabled this layer for both: local and global planners. Then I've tried to reproduce the problem with:

  1. Given above configuration using OSRF Galactic Docker container while running standard TB3 waffle Gazebo simulation in there.
  2. Latest ROS2 Rolling and nav2 stack versions (both built from sources) while running standard and AWS small warehouse TB3 waffle Gazebo simulation in there.

In parallel infinite while-loop I've sent ClearEntireCostmap service calls to local and global planners. Unfortunately during 15~20 minutes of continuous calls in each configuration case the crash was not appeared never.

@charlielito, did you ever observe the crash on any Gazebo and TB3 simulation?
Since voxel grid is related to PCL obtained from specific source, that could be helpful if you could give some Gazebo simulation clues or record a small repro-case ROS2Bag, reproducing which one we could obtain the same PCL Grid and crash (more or less stably).

@charlielito
Copy link
Contributor Author

Hey thanks for taking the time to look into this! Indeed the crash happens with our Gazebo environment with a custom robot. It also happened in our real hardware/robot. I'll try the TB3 waffle demo to see if it is crashing there and also come back with some ros2bag for you

@AlexeyMerzlyakov
Copy link
Collaborator

Yes, it would be really appreciated, if you could provide a context to catch the problem.

@SteveMacenski
Copy link
Member

SteveMacenski commented Jun 18, 2021

In thread 14, it looks like the usual update map cycle: updateMap() -> updateBounds() -> raytraceFreespace() -> clearVoxelLineInMap() -> raytraceLine() -> bresenham3D() -> clearing. Same with the planner seg fault.

Are you running SLAM at the same time? That shouldn't make anything break, just looking for clues about why it is that some data location wouldn't be available. None of this has changed since ROS1.

Do you see this happen with the 2D obstacle layer at all (if you were willing to try that for us)? The voxel layer uses a ton of the same mechanics as the obstacle layer so if you saw it there, that would help narrow it down. Are you going "off the map" at all? If you set your max range of your local costmap sensor to below the costmap size, do you still see it? What sensor are you using to get this error.

@charlielito
Copy link
Contributor Author

charlielito commented Jun 21, 2021

@AlexeyMerzlyakov I am encountering this error in Gazebo but with a custom robot. At the end i didn't try the TB3 waffle because the problem seems to be when using pointcloud sensor. I tested my environment without pointcloud sources and never saw any crashes. Just when I add a pointcloud source to the VoxelLayer, the problems begin.

Also tried your strategy of calling the clearing services both to local and global costmaps, but couldn't make it crash 😞 I am only seeing this crash whilst navigating to a goal. Here you can find a zip rosbag attached with all topics. The robot got stuck and I was setting a new goal when the planner server died in that case.

https://drive.google.com/file/d/1j54RIVWt83MbNRpKmendjb7bOVCf7LzN/view?usp=sharing

@SteveMacenski I am not running SLAM. I am running the stack with AMCL with a static map. I also tried only using 2D obstacle layer with a pointcloud source but it didn't crash at all. The sensor that generates the pointcloud is a gazebo stereocamera.

@SteveMacenski
Copy link
Member

Huh, did you build a different PCL version or something? I dont think there’s any action we can take then. Should we close this ticket?

@charlielito
Copy link
Contributor Author

What? Just updated my comment. With PCL I meant pointcloud data, nothing to do with the PCL library.

@SteveMacenski
Copy link
Member

I wonder if this is related to nans or infs or something similar.

@AlexeyMerzlyakov
Copy link
Collaborator

@charlielito, thank you very much for carefully prepared dataset. This did helped me: I've reproduced the same issue on planner_server executable. Some debugging on it shown that there is incorrect offset parameter passed inside bresenham3D() inline routine. Moreover, outside bresenham3D() (in raytraceLine() routine) it is being set correctly. Also, simple calculations from full backtrace you've attached here (planner_server_segfaul.log) confirms the situation: offset is equal to 4294966748 but it should be 11278 as simply calculated by x0,x1,y0,y1,min_len,offset_dy.

This might be stack corruption, GCC bug or something I have no idea yet (e.g. C++ standard deprecation when passing by reference into inline function). Anyway, it looks like the following patch cured the problem for me:

diff --git a/nav2_voxel_grid/include/nav2_voxel_grid/voxel_grid.hpp b/nav2_voxel_grid/include/nav2_voxel_grid/voxel_grid.hpp
index de3e3071..42efda7b 100644
--- a/nav2_voxel_grid/include/nav2_voxel_grid/voxel_grid.hpp
+++ b/nav2_voxel_grid/include/nav2_voxel_grid/voxel_grid.hpp
@@ -300,7 +318,7 @@ private:
   inline void bresenham3D(
     ActionType at, OffA off_a, OffB off_b, OffC off_c,
     unsigned int abs_da, unsigned int abs_db, unsigned int abs_dc,
-    int error_b, int error_c, int offset_a, int offset_b, int offset_c, unsigned int & offset,
+    int error_b, int error_c, int offset_a, int offset_b, int offset_c, unsigned int offset,
     unsigned int & z_mask, unsigned int max_length = UINT_MAX)
   {
     unsigned int end = std::min(max_length, abs_da);

After one week of break (I'll be on vacation) I will return to this problem and try to figure out more detailed explanation of what it happens here.

@SteveMacenski
Copy link
Member

That was a brand new feature added recently to have the minimum / maximum raytracing and marking ranges. So I don't think its out of the question that its a legit bug. I'd make sure to check the unsigned versus signed ints, if you don't explicitly cast between them, I've seen that be a problem before. Given the signcature of that fuction is all int and only that 1 is an unsigned int, that makes me immediately suspicious

@AlexeyMerzlyakov
Copy link
Collaborator

AlexeyMerzlyakov commented Jul 10, 2021

It looks like I've finally found what is happening in bresenham3D(). My initial workaround about maing the offset to be non-reference type was wrong: it is breaking the bresenham3D() logic. The reference for offset in bresenham3D() signature is made intentionally for use dynamically changing nature of offset value. The essence is in the OffA off_a, OffB off_b, OffC off_c objects which are of GridOffset type and having the reference to the same offset inside. By calling operator(offset_val) for GridOffset objects we are shifting the offset to a given in operator value.

More debugging shown that we are running out of the space of VoxelGrid data_ array (which is equal to size_x_ * size_y_). The exceeding of array boundaries might be as in positive direction, and as in negative direction falling behind 0: -1 will be treated as 4billons-millons... of unsigned int type. The direction depends on step (dx/dy/dz) sign. Both of directions cases were observed during the debug.

So, regarding the root of the problem, it should be related to this code in raytraceLine() routine:

    double min_x0 = x0 + dx / dist * min_length;
    double min_y0 = y0 + dy / dist * min_length;
    double min_z0 = z0 + dz / dist * min_length;

When we setting minimal x0/y0 not to correspond to VixelGrid array boundaries, but somewhere in the middle of it, we finally will run out of its boundary in the end as shown in the picture below (sorry, I am not a good artist but hope you will get the situation 😄 ):
P10710-141623

The bug was introduced with 13f518a as @SteveMacenski initially noted, that is why it did not observed on Foxy. This need more technical analysis about how it should be correctly fixed. One of the option - is to add boundary checks to bresenham3D() when changing the offset. Another option - is to fix array boundaries iterators on min_length value. I'll try to figure out the right solution a few days later.

@AlexeyMerzlyakov
Copy link
Collaborator

AlexeyMerzlyakov commented Jul 19, 2021

Update: The out of array bounds accessing situations were appeared due to incorrectly calculated min_x0/min_y0/min_z0 values due to precision loss in double->int->double conversions. This leaded to incorrectly calculated initial offset value in some cases causing to go out-of-ranges. The fix for the problem could be to change dx/dy/dz values from int to double type which forces to all other dependent parameters to be calculated well.

Additionally, the shifts for min_length were compensated, so this was not a problem cause. However, during the code analysis I've found that components shifts were all compensated by absolute min_length value instead of OX,OY,OZ components of this min_length. This won't cause an initial segfault problem, but actually a root of wrong behavior causing calculations of smaller than necessary length areas. Correcting it within the same PR.

@SteveMacenski
Copy link
Member

@charlielito Can you verify this goes away using #2460

@AlexeyMerzlyakov can you verify that you tested this went away and is working properly?

@AlexeyMerzlyakov
Copy link
Collaborator

AlexeyMerzlyakov commented Jul 20, 2021

@AlexeyMerzlyakov can you verify that you tested this went away and is working properly?

I think I need to add some regression TC here for the cases which fixed in this issue.

@charlielito
Copy link
Contributor Author

charlielito commented Jul 20, 2021

I made the changes proposed in #2460 to my galactic branch and tested it. It seems that this solved the planner server segfault since I run several times the stack and never encountered that segfault. Nonetheless, the controller server still crashes.

Here is the traceback of the controller server crashing:
traceback.log

I also recorded another rosbag when the controller crashed:
https://drive.google.com/file/d/1Xjj5YTf9GGkWISiIFbqcyO6ZDHFU0dyJ/view?usp=sharing

More over, I tested this also with TB3 in simulation and this also is happening there after adding to the yaml file the realsense pointcloud source. This way you could try to reproduce it more easily by your end. Tested with this yaml file (has .txt extension because github does not support .yaml files):
nav2_params.yaml.txt

For reproducing the error send goals near the walls

@AlexeyMerzlyakov
Copy link
Collaborator

According to the traceback, the cause of the crash is still out-of-voxel-data array accessing in bresenham3D() routine. From the calculations in the given traceback, all offsets and length should be set correctly. However, as problem still appears, it probably could be that I've missed something or here we have to deal with another hidden problem.
Need more analysis.

@SteveMacenski
Copy link
Member

SteveMacenski commented Aug 24, 2021

@charlielito, Alexey has come up with a new solution, can you test that out and verify if it fixed the underlying issue? #2460

@charlielito
Copy link
Contributor Author

Just tested #2460 locally and it seems it is working, it hasn't crashed any time. Thanks!

@SteveMacenski
Copy link
Member

Sweet! Anything further you'd want to see before we merge / close the ticket?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working in progress
Projects
None yet
Development

No branches or pull requests

3 participants