Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPPI crashing on loading plug-ings #3767

Closed
AmmarAlbakri opened this issue Aug 21, 2023 · 47 comments
Closed

MPPI crashing on loading plug-ings #3767

AmmarAlbakri opened this issue Aug 21, 2023 · 47 comments

Comments

@AmmarAlbakri
Copy link

Bug report

  • Operating System:
    • Ubuntu 22.04
  • ROS2 Version:
  • Humble Source

After the binary packages sync on 18.08.2023 the nav2_mppi_controller started crashing on when launching;
image

ps: same non-updated setup still working.

By the way, the sync was supposed to fix this issue but it hasn't been solved yet: #3762 (comment)

@SteveMacenski
Copy link
Member

This is another issue. Please compile with debug flags according to the docs here (https://navigation.ros.org/tutorials/docs/get_backtrace.html) and give me the traceback to know where the crash is occurring. I looked over the recent changes and I can't find anything obvious that would cause this. Make sure that you clean build (e.g. delete install/build spaces) in case there's some caching issue causing this from a major sync update

@SteveMacenski SteveMacenski added the bug Something isn't working label Aug 21, 2023
@AmmarAlbakri
Copy link
Author

AmmarAlbakri commented Aug 22, 2023

I was able to get this backtrace when running mppi from the binary packages:

[INFO] [1692688226.327340653] [controller_server]: Created controller : FollowPath of type nav2_rotation_shim_controller::RotationShimController
[INFO] [1692688226.349984032] [controller_server]: Created internal controller for rotation shimming: FollowPath of type nav2_mppi_controller::MPPIController
[WARN] [1692688226.351220055] [controller_server]: Controller period is less then model dt, consider setting it equal
[INFO] [1692688226.357645621] [controller_server]: ConstraintCritic instantiated with 1 power and 4.000000 weight.
[INFO] [1692688226.357812764] [controller_server]: Critic loaded : mppi::critics::ConstraintCritic

Thread 1 "controller_serv" received signal SIGILL, Illegal instruction.
0x00007fffcb5b9e15 in mppi::critics::ObstaclesCritic::findCircumscribedCost(std::shared_ptr<nav2_costmap_2d::Costmap2DROS>)

I hope this helps.

When I build from source on humble branch there is no problem everything works like a charm.

@SteveMacenski
Copy link
Member

What is your computer architecture? That is also not a traceback, that doesn't show me where the issue is in particular. I still don't see anything obviously wrong to address.

If you compile it from source from the 1.1.9 tag (humble which is the exact thing that should be in binaries) does that work for you?

@AmmarAlbakri
Copy link
Author

My computer architecutre is:

6.2.0-26-generic #26~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Jul 13 16:27:29 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

I am sorry I don't have experience with gdp or backtrace, I followed the tutorial

I compiled from source 1.1.9 tag, and it is working, but the binaries still give the error I mentioned above. I also tried it on many different computer architectures but still, binaries throw an error and crash.

@StetroF
Copy link
Contributor

StetroF commented Aug 25, 2023

I met this problem as well. I found that this controller can't work in the edition that downloaded by 'sudo apt install ros-humble-nav2-mppi-controller'. And solve it by build the nav2_mppi_controller’s source code rather than download it by sudo. Perhaps it is because the sudo edition didn't update?

@SteveMacenski
Copy link
Member

This is really odd. What happens if you sudo apt purge ros-humble-nav2-mppi-controller and reinstall it? I'm wondering if something is broken in the update process more than the binaries being broken.

I looked into this in detail and I don't see any changes in the function that @AmmarAlbakri got a trace pointing to. Nothing changed with respect to that function or its inputs -- but things around it did change. The rest of Humble and the sync look identical to me, so if its working on local builds but not in binary format I can only seeing it be the cause of 3 things

  • Binaries are busted
  • Binary install is busted
  • The build on the build farm makes binaries that don't run on related but non-identical computers for some reason (but worked before was a fluke?)

The second item is the one I can have you guys easily test. The first I've asked Intrinsic if we can re-trigger to test. The last is more complex and I don't want to go down that rabbit hole if unnecessary.

Please let me know!

@StetroF
Copy link
Contributor

StetroF commented Aug 29, 2023

Once i purge nav2-mppi-controller.The whole navigation will be remove so i reinstall them by sudo apt install ros-humble-navigation2. Then I relaunch my nav2 package. The same problem occur. After i add the builded nav2-mppi-controller's code in my install package. It work again.
Last, Here is the relative log when loading the mppi controller installed by sudo rather than source code build:

[controller_server-15] [INFO] [1693269845.941573472] [controller_server]: Created controller : FollowPath of type nav2_mppi_controller::MPPIController
[controller_server-15] [WARN] [1693269845.950492480] [controller_server]: Controller period is less then model dt, consider setting it equal
[controller_server-15] [INFO] [1693269845.953610926] [controller_server]: ConstraintCritic instantiated with 1 power and 4.000000 weight.
[controller_server-15] [INFO] [1693269845.953682884] [controller_server]: Critic loaded : mppi::critics::ConstraintCritic
[controller_server-15] [INFO] [1693269845.955963772] [controller_server]: GoalCritic instantiated with 1 power and 5.000000 weight.
[controller_server-15] [INFO] [1693269845.956013774] [controller_server]: Critic loaded : mppi::critics::GoalCritic
[controller_server-15] [INFO] [1693269845.957338102] [controller_server]: GoalAngleCritic instantiated with 1 power, 3.000000 weight, and 0.500000 angular threshold.
[controller_server-15] [INFO] [1693269845.957417906] [controller_server]: Critic loaded : mppi::critics::GoalAngleCritic
[controller_server-15] [INFO] [1693269845.960066387] [controller_server]: ReferenceTrajectoryCritic instantiated with 1 power and 14.000000 weight
[controller_server-15] [INFO] [1693269845.960157092] [controller_server]: Critic loaded : mppi::critics::PathAlignCritic
[controller_server-15] [INFO] [1693269845.962222725] [controller_server]: Critic loaded : mppi::critics::PathFollowCritic
[robot_pose_publisher-8] Warning: Invalid frame ID "map" passed to canTransform argument target_frame - frame does not exist
[robot_pose_publisher-8] at line 93 in ./src/buffer_core.cpp
[controller_server-15] [INFO] [1693269845.966025416] [controller_server]: PathAngleCritic instantiated with 1 power and 2.000000 weight. Reversing allowed.
[controller_server-15] [INFO] [1693269845.966077558] [controller_server]: Critic loaded : mppi::critics::PathAngleCritic
[controller_server-15] [INFO] [1693269845.966705911] [controller_server]: PreferForwardCritic instantiated with 1 power and 2.000000 weight.
[controller_server-15] [INFO] [1693269845.966739201] [controller_server]: Critic loaded : mppi::critics::PreferForwardCritic

Then after the preferforwardcritic loaded .The error occur:
[ERROR] [controller_server-15]: process has died [pid 13909, exit code -4, cmd '/opt/ros/humble/lib/nav2_controller/controller_server --ros-args --log-level info --ros-args -r _ --params-file /tmp/tmpsaf6nhx7 -r /tf:=tf -r /tf_static:=tf_static -r cmd_vel:=cmd_vel_nav'].

As you see . Controller server crashed after loading the critic.

@SteveMacenski
Copy link
Member

Can you re-order / remove some of the critics? Is there one in particular that makes it crash? Someone else got a backtrace that pointed to the Obstacle critic -- but maybe its not that simple. Try with 0 critics and work up from there one by one.

If purging and reinstalling doesn't work, I don't think its the debian update process that's the issue. The binary bin job looks fine to me in build.ros2.org so I'm not sure what could cause the issue there. @StetroF what's your computer architecture?

@StetroF
Copy link
Contributor

StetroF commented Aug 30, 2023

I tried. First time , I delete the constraincritic and obstaclecritic, then i find that it stil crashed. But i found it there are some logs relative with PathAngleCritic's intialize. So I thought this critic must be normal.Then second time i delete all critic except PathAngleCritic. But this time after loading the critic, the logs were:

[controller_server-15] [INFO] [1693354275.581728814] [sto_robot2.controller_server]: Created controller : FollowPath of type nav2_mppi_controller::MPPIController
[controller_server-15] [WARN] [1693354275.583152778] [sto_robot2.controller_server]: Controller period is less then model dt, consider setting it equal
[controller_server-15] [INFO] [1693354275.585134635] [sto_robot2.controller_server]: PathAngleCritic instantiated with 1 power and 2.000000 weight. Reversing allowed.
[controller_server-15] [INFO] [1693354275.585179252] [sto_robot2.controller_server]: Critic loaded : mppi::critics::PathAngleCritic
[controller_server-15] [INFO] [1693354275.587878907] [sto_robot2.controller_server]: Optimizer reset

Then it died again.I've tried other options that launch with only 1 critic like GoalAngleCritic ,GoalCritic .etc. The logs are similar with the logs upon that after the 'critic loaded ...... ' and 'Optimizer reset' then it crashed.
Last, I delete all critic, the log were :

[controller_server-15] [INFO] [1693354834.866720886] [sto_robot2.controller_server]: Created controller : FollowPath of type nav2_mppi_controller::MPPIController
[controller_server-15] [WARN] [1693354834.873562877] [sto_robot2.controller_server]: Controller period is less then model dt, consider setting it equal
[controller_server-15] [INFO] [1693354834.878223025] [sto_robot2.controller_server]: Optimizer reset

Then crashed again.Last, my compute architecture is :

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 39 bits physical, 48 bits virtual
Byte Order: Little Endian

@SteveMacenski
Copy link
Member

SteveMacenski commented Aug 30, 2023

Fun.

[sto_robot2.controller_server]: Optimizer reset

That's the last line of the optimizer's initialization in the bringup https://github.com/ros-planning/navigation2/blob/210977790bfb6e809f2c2ef3b7e1660662d630cc/nav2_mppi_controller/src/optimizer.cpp#L54

The very next thing is the path handler initialization. The following lines were add - but all are in the 1.1.9 sync that was backported to humble

  getParam(enforce_path_inversion_, "enforce_path_inversion", false);
  if (enforce_path_inversion_) {
    getParam(inversion_xy_tolerance_, "inversion_xy_tolerance", 0.2);
    getParam(inversion_yaw_tolerance, "inversion_yaw_tolerance", 0.4);
    inversion_locale_ = 0u;
  }

I'm morbidly curious if you add those parameters to your parameter file for MPPI if that crash goes away (e.g. something going on with missing parameters). Keep all the critics out of the configuration. I doubt it'll do anything, but if it does that would be extremely relevant to know.

Also, getting a backtrace for the crash that occurs when you're not using any critics would be another valuable data point https://navigation.ros.org/tutorials/docs/get_backtrace.html. Especially if you're not seeing the exact same behavior / issue as @AmmarAlbakri

What seems odd to me is that both of these areas of code were changed recently. But neither have been changed since 1.1.9 was released that you've mentioned you've cloned that tag which is the exact same software and seems to work. It makes me think something's up in buildfarm-land

Edit:

Another thing to test is https://github.com/ros-planning/navigation2/blob/main/nav2_mppi_controller/CMakeLists.txt#L38-L47

It seems that the illegal instruction complaint might be that those flags are illegal for your CPUs but legal for the build farm. If you change

add_compile_options(-O3 -finline-limit=1000000 -ffp-contract=fast -ffast-math)

to

add_compile_options(-O3 -finline-limit=1000000 -ffp-contract=fast -ffast-math -mavx2 -mfma)

And compile locally, do you see any change in behavior with the locally compiled version (e.g. compile fail, or the same binary related issue)? I want both @AmmarAlbakri and @StetroF to try this since you may not actually be experiencing the same issue

@MichalMalachowski
Copy link

MichalMalachowski commented Aug 30, 2023

@SteveMacenski I have the same problem, I use ROS Humble Nav2 stack inside Docker container. MPPI controller is proven to launch correctly in the container created the second-to-last Friday (18th August) and it doesn't launch in the container created yesterday (29th August).

I tried some of advises you gave in the last reply:

  • empty critics list
  • added parameters enforce_path_inversion: false, inversion_xy_tolerance: 0.2, inversion_yaw_tolerance: 0.4 for the controller in config yaml
  • tested with both my configuration and an example configuration from Nav2 website

None combination of above worked.

I didn't try to compile the package locally. I could play with it when I find some free time, but let's hope someone else will check it before.

@SteveMacenski
Copy link
Member

SteveMacenski commented Aug 30, 2023

None combination of above worked.

Any traceback to what method specifically crashed? This would be helpful, especially if it changed between different runs / things.

I didn't try to compile the package locally.

Please do. I'm playing with this right now but I only have about 2 hours today to spend on it. Especially if the types of crashes are different for different people's computers, I'll need that data to know if I'm barking up the right tree from another source. I see this now that I'm testing it locally, which is good that I can reproduce and experiment myself. Though difficult since it only appears in binaries :(

@SteveMacenski
Copy link
Member

SteveMacenski commented Aug 30, 2023

Notes:

  • Compiled locally OK
  • Compiled locally with forcing on -mavx2 -mfma OK
  • Compiled locally with forcing off -mavx2 -mfma OK

  • Binaries with the Configuration Docs configs NOT OK
> const&, std::shared_ptr<nav2_costmap_2d::Costmap2DROS>, mppi::ParametersHandler*) ()
   from /opt/ros/humble/lib/libmppi_controller.so
#4  0x00007ffff40fbde4 in mppi::Optimizer::initialize(std::weak_ptr<rclcpp_lifecycle::LifecycleNode>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::shared_ptr<nav2_costmap_2d::Costmap2DROS>, mppi::ParametersHandler*) ()
   from /opt/ros/humble/lib/libmppi_controller.so
#5  0x00007ffff40da0c1 in nav2_mppi_controller::MPPIController::configure(std::weak_ptr<rclcpp_lifecycle::LifecycleNode> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::shared_ptr<tf2_ros::Buffer>, std::shared_ptr<nav2_costmap_2d::Costmap2DROS>) ()
   from /opt/ros/humble/lib/libmppi_controller.so
#6  0x00007ffff7cade9d in nav2_controller::ControllerServer::on_configure(rclcpp_lifecycle::State const&) ()
   from /home/steve/Documents/hi_ws/install/nav2_controller/lib/libcontroller_server_core.so
#7  0x00007ffff7dacc6b in ?? () from /opt/ros/humble/lib/librclcpp_lifecycle.so
#8  0x00007ffff7db66d2 in ?? () from /opt/ros/humble/lib/librclcpp_lifecycle.so
#9  0x00007ffff7da4730 in rclcpp_lifecycle::LifecycleNode::LifecycleNodeInterfaceImpl::on_change_state(std::shared_ptr<rmw_request_id_s>, std::shared_ptr<lifecycle_msgs::srv::ChangeState_Request_<std::allocator<void> > >, std::shared_ptr<lifecycle_msgs::srv::ChangeState_Response_<std::allocator<void> > >) () from /opt/ros/humble/lib/librclcpp_lifecycle.so
#10 0x00007ffff7da6936 in std::_Function_handler<void (std::shared_ptr<rmw_request_id_s>, std::shared_ptr<lifecycle_msgs::srv::ChangeState_Request_<std::allocator<void> > >, std::shared_ptr<lifecycle_msgs::srv::ChangeState_Response_<std::allocator<void> > >), std::_Bind<void (rclcpp_lifecycle::LifecycleNode::LifecycleNodeInterfaceImpl::*(rclcpp_lifecycle::LifecycleNode::LifecycleNodeInterfaceImpl*, std::_Placeholder<1>, std::_Placeholder<2>, std::_Placeholder<3>))(std::shared_ptr<rmw_request_id_s>, std::shared_ptr<lifecycle_msgs::srv::ChangeState_Request_<std::allocator<void> > >, std::shared_ptr<lifecycle_msgs::srv::ChangeState_Response_<std::allocator<void> > >)> >::_M_invoke(std::_Any_data const&, std::shared_ptr<rmw_request_id_s>&&, std::shared_ptr<lifecycle_msgs::srv::ChangeState_Request_<std::allocator<void> > >&&, std::shared_ptr<lifecycle_msgs::srv::ChangeState_Response_<std::allocator<void> > >&&) () from /opt/ros/humble/lib/librclcpp_lifecycle.so
#11 0x00007ffff7db4819 in ?? () from /opt/ros/humble/lib/librclcpp_lifecycle.so
#12 0x00007ffff7eae046 in ?? () from /opt/ros/humble/lib/librclcpp.so
#13 0x00007ffff7eaba2a in rclcpp::Executor::execute_service(std::shared_ptr<rclcpp::ServiceBase>) () from /opt/ros/humble/lib/librclcpp.so
#14 0x00007ffff7eabd96 in rclcpp::Executor::execute_any_executable(rclcpp::AnyExecutable&) () from /opt/ros/humble/lib/librclcpp.so
#15 0x00007ffff7eb3620 in rclcpp::executors::SingleThreadedExecutor::spin() () from /opt/ros/humble/lib/librclcpp.so
#16 0x00007ffff7eb3835 in rclcpp::spin(std::shared_ptr<rclcpp::node_interfaces::NodeBaseInterface>) () from /opt/ros/humble/lib/librclcpp.so

  • Binaries with Arbitrary critics NOT OK

w/ Obstacle critic its always

(gdb) backtrace
#0  0x00007ffff4061d56 in mppi::critics::ObstaclesCritic::findCircumscribedCost(std::shared_ptr<nav2_costmap_2d::Costmap2DROS>) ()
   from /opt/ros/humble/lib/libmppi_critics.so
#1  0x00007ffff4062501 in mppi::critics::ObstaclesCritic::initialize() ()
   from /opt/ros/humble/lib/libmppi_critics.so
#2  0x00007ffff4122dc9 in mppi::CriticManager::loadCritics() ()
   from /opt/ros/humble/lib/libmppi_controller.so
#3  0x00007ffff411d2b1 in mppi::CriticManager::on_configure(std::weak_ptr<rclcpp_lifecycle::LifecycleNode>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::shared_ptr<nav2_costmap_2d::Costmap2DROS>, mppi::ParametersHandler*) () from /opt/ros/humble/lib/libmppi_controller.so
#4  0x00007ffff40fbde4 in mppi::Optimizer::initialize(std::weak_ptr<rclcpp_lifecycle::LifecycleNode>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::shared_ptr<nav2_costmap_2d::Costmap2DROS>, mppi::ParametersHandler*) () from /opt/ros/humble/lib/libmppi_controller.so
#5  0x00007ffff40da0c1 in nav2_mppi_controller::MPPIController::configure(std::weak_ptr<rclcpp_lifecycle::LifecycleNode> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::shared_ptr<tf2_ros::Buffer>, std::shared_ptr<nav2_costmap_2d::Costmap2DROS>) ()
   from /opt/ros/humble/lib/libmppi_controller.so
#6  0x00007ffff7cade9d in nav2_controller::ControllerServer::on_configure(rclcpp_lifecycle::State const&) ()
   from /home/steve/Documents/hi_ws/install/nav2_controller/lib/libcontroller_se

W/ goal angle, or Prefer forward critic only, Constraint critic only, or path follow only, or path angle only, or path align only, or goal angle only, or twirling only: it switches between one of these two on different runs (which seems reasonable b/c the NoiseGenerator is in another thread, so its basically "who triggers the failure first)

/opt/ros/humble/lib/libmppi_controller.so
(gdb) backtrace
#0  0x00007ffff412fc82 in mppi::NoiseGenerator::generateNoisedControls() () from /opt/ros/humble/lib/libmppi_controller.so
#1  0x00007ffff41308d3 in mppi::NoiseGenerator::noiseThread() () from /opt/ros/humble/lib/libmppi_controller.so
#2  0x00007ffff78dc253 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007ffff7494b43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#4  0x00007ffff7526a00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
#0  0x00007ffff4129e3b in mppi::PathHandler::getMaxCostmapDist() ()
   from /opt/ros/humble/lib/libmppi_controller.so
#1  0x00007ffff4129f4b in mppi::PathHandler::initialize(std::weak_ptr<rclcpp_lifecycle::LifecycleNode>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::shared_ptr<nav2_costmap_2d::Costmap2DROS>, std::shared_ptr<tf2_ros::Buffer>, mppi::ParametersHandler*) ()
   from /opt/ros/humble/lib/libmppi_controller.so
#2  0x00007ffff40da157 in nav2_mppi_controller::MPPIController::configure(std::weak_ptr<rclcpp_lifecycle::LifecycleNode> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::shared_ptr<tf2_ros::Buffer>, std::shared_ptr<nav2_costmap_2d::Costmap2DROS>) ()
   from /opt/ros/humble/lib/libmppi_controller.so
#3  0x00007ffff7cade9d in nav2_controller::ControllerServer::on_configure(rclcpp_lifecycle::State const&) ()
   from /home/steve/Documents/hi_ws/install/nav2_controller/lib/libcontroller_server_core.so
#4  0x00007ffff7dacc6b in ?? () from /opt/ros/humble/lib/librclcpp_lifecycle.so
#5  0x00007ffff7db66d2 in ?? () from /opt/ros/humble/lib/librclcpp_lifecycle.so
#6  0x00007ffff7da4730 in rclcpp_lifecycle::LifecycleNode::LifecycleNodeInterfaceImpl::on_change_state(std::shared_ptr<rmw_request_id_s>, std::shared_ptr<lifecycle_msgs::srv::ChangeState_Request_<std::allocator<void> > >, std::shared_ptr<lifecycle_msgs::srv::ChangeState_Response_<std::allocator<void> > >) ()
   from /opt/ros/humble/lib/librclcpp_lifecycle.so
--Type <RET> for more, q to quit, c to continue without paging--
#7  0x00007ffff7da6936 in std::_Function_handler<void (std::shared_ptr<rmw_request_id_s>, std::shared_ptr<lifecycle_msgs::srv::ChangeState_Request_<std::allocator<void> > >, std::shared_ptr<lifecycle_msgs::srv::ChangeState_Response_<std::allocator<void> > >), std::_Bind<void (rclcpp_lifecycle::LifecycleNode::LifecycleNodeInterfaceImpl::*(rclcpp_lifecycle::LifecycleNode::LifecycleNodeInterfaceImpl*, std::_Placeholder<1>, std::_Placeholder<2>, std::_Placeholder<3>))(std::shared_ptr<rmw_request_id_s>, std::shared_ptr<lifecycle_msgs::srv::ChangeState_Request_<std::allocator<void> > >, std::shared_ptr<lifecycle_msgs::srv::ChangeState_Response_<std::allocator<void> > >)> >::_M_invoke(std::_Any_data const&, std::shared_ptr<rmw_request_id_s>&&, std::shared_ptr<lifecycle_msgs::srv::ChangeState_Request_<std::allocator<void> > >&&, std::shared_ptr<lifecycle_msgs::srv::ChangeState_Response_<std::allocator<void> > >&&) () from /opt/ros/humble/lib/librclcpp_lifecycle.so
#8  0x00007ffff7db4819 in ?? () from /opt/ros/humble/lib/librclcpp_lifecycle.so
#9  0x00007ffff7eae046 in ?? () from /opt/ros/humble/lib/librclcpp.so
#10 0x00007ffff7eaba2a in rclcpp::Executor::execute_service(std::shared_ptr<rclcpp::ServiceBase>) ()
   from /opt/ros/humble/lib/librclcpp.so
#11 0x00007ffff7eabd96 in rclcpp::Executor::execute_any_executable(rclcpp::AnyExecutable&) ()
   from /opt/ros/humble/lib/librclcpp.so
#12 0x00007ffff7eb3620 in rclcpp::executors::SingleThreadedExecutor::spin() ()
   from /opt/ros/humble/lib/librclcpp.so
#13 0x00007ffff7eb3835 in rclcpp::spin(std::shared_ptr<rclcpp::node_interfaces::NodeBaseInterface>) ()
   from /opt/ros/humble/lib/librclcpp.so
#14 0x00005555555563c0 in main ()

Both of which complain of an illegal instruction where the obstacle layer's seg faults outright.

That seems to point to the ordering:

  • Obstacle, when present, will trigger first
  • If not, then either the noise generator or the path handler, depending on which thread can get there first

So lets look at these functions

double PathHandler::getMaxCostmapDist()
{
  const auto & costmap = costmap_->getCostmap();
  return std::max(costmap->getSizeInCellsX(), costmap->getSizeInCellsY()) *
         costmap->getResolution() / 2.0;
}
void NoiseGenerator::generateNoisedControls()
{
  auto & s = settings_;

  xt::noalias(noises_vx_) = xt::random::randn<float>(
    {s.batch_size, s.time_steps}, 0.0,
    s.sampling_std.vx);
  xt::noalias(noises_wz_) = xt::random::randn<float>(
    {s.batch_size, s.time_steps}, 0.0,
    s.sampling_std.wz);
  if (is_holonomic_) {
    xt::noalias(noises_vy_) = xt::random::randn<float>(
      {s.batch_size, s.time_steps}, 0.0,
      s.sampling_std.vy);
  }
}

double ObstaclesCritic::findCircumscribedCost(
  std::shared_ptr<nav2_costmap_2d::Costmap2DROS> costmap)
{
  double result = -1.0;
  bool inflation_layer_found = false;
  // check if the costmap has an inflation layer
  for (auto layer = costmap->getLayeredCostmap()->getPlugins()->begin();
    layer != costmap->getLayeredCostmap()->getPlugins()->end();
    ++layer)
  {
    auto inflation_layer = std::dynamic_pointer_cast<nav2_costmap_2d::InflationLayer>(*layer);
    if (!inflation_layer) {
      continue;
    }

    inflation_layer_found = true;
    const double circum_radius = costmap->getLayeredCostmap()->getCircumscribedRadius();
    const double resolution = costmap->getCostmap()->getResolution();
    result = inflation_layer->computeCost(circum_radius / resolution);
    auto getParam = parameters_handler_->getParamGetter(name_);
    getParam(inflation_scale_factor_, "cost_scaling_factor", 10.0);
    getParam(inflation_radius_, "inflation_radius", 0.55);
  }

  if (!inflation_layer_found) {
    RCLCPP_WARN(
      rclcpp::get_logger("computeCircumscribedCost"),
      "No inflation layer found in costmap configuration. "
      "If this is an SE2-collision checking plugin, it cannot use costmap potential "
      "field to speed up collision checking by only checking the full footprint "
      "when robot is within possibly-inscribed radius of an obstacle. This may "
      "significantly slow down planning times and not avoid anything but absolute collisions!");
  }

  return result;
}

What didn't change anything

  • Messing with the controller speed
  • Messing with key params for data sizing (in case running out of memory) like batch size, time steps etc
  • removing all params and removing all critic params

--> So I'm pretty sure it has nothing to do with configurations being broken.

My local xtensor version is 0.23.10-15 which is years old, so that wasn't recently updated. I also checked the build farm and its using the same version. I also scanned over all the logs and nothing looks out of the ordinary.

I just tested and the iron binaries are broken too. Beyond the fact that these are orbiting pieces of code that have changed recently, I can't find any throughline or why this would be happening only in the binaries but not in source code -- nor why anything has illegal instructions.

@AmmarAlbakri
Copy link
Author

And compile locally, do you see any change in behavior with the locally compiled version (e.g. compile fail, or the same binary related issue)? I want both @AmmarAlbakri and @StetroF to try this since you may not actually be experiencing the same issue

I have tested compiling with -mavx2 -mfma flag and it works as long as I am building locally. Binaries still give the same error.

@MichalMalachowski
Copy link

Any traceback to what method specifically crashed?

Just running the nav stack gave no specific error, it was just about controller_server dying:

[lifecycle_manager-6] [INFO] [1693468058.747407360] [lifecycle_manager_navigation]: Configuring controller_server
[controller_server-3] [INFO] [1693468059.576063454] [controller_server]: Configuring controller interface
[controller_server-3] [INFO] [1693468059.576473075] [controller_server]: getting goal checker plugins..
[controller_server-3] [INFO] [1693468059.576515560] [controller_server]: Controller frequency set to 20.0000Hz
[controller_server-3] [INFO] [1693468059.576553513] [local_costmap.local_costmap]: Configuring
[controller_server-3] [INFO] [1693468059.582493940] [local_costmap.local_costmap]: Using plugin "static_layer"
[controller_server-3] [INFO] [1693468059.585687865] [local_costmap.local_costmap]: Subscribing to the map topic (/map) with transient local durability
[controller_server-3] [INFO] [1693468059.587032856] [local_costmap.local_costmap]: Initialized plugin "static_layer"
[controller_server-3] [INFO] [1693468059.587061169] [local_costmap.local_costmap]: Using plugin "obstacle_layer"
[controller_server-3] [INFO] [1693468059.588327585] [local_costmap.local_costmap]: Subscribed to Topics: 
[controller_server-3] [INFO] [1693468059.588434046] [local_costmap.local_costmap]: Initialized plugin "obstacle_layer"
[controller_server-3] [INFO] [1693468059.588450733] [local_costmap.local_costmap]: Using plugin "inflation_layer"
[controller_server-3] [INFO] [1693468059.589782200] [local_costmap.local_costmap]: Initialized plugin "inflation_layer"
[controller_server-3] [INFO] [1693468059.599204450] [controller_server]: Created progress_checker : progress_checker of type nav2_controller::SimpleProgressChecker
[controller_server-3] [INFO] [1693468059.600091182] [controller_server]: Created goal checker : GoalChecker of type nav2_controller::SimpleGoalChecker
[controller_server-3] [INFO] [1693468059.600529509] [controller_server]: Controller Server has GoalChecker  goal checkers available.
[controller_server-3] [INFO] [1693468059.601569700] [controller_server]: Created controller : DefaultController of type nav2_mppi_controller::MPPIController
[controller_server-3] [INFO] [1693468059.604710106] [controller_server]: Controller period is equal to model dt. Control sequence shifting is ON
[controller_server-3] [INFO] [1693468059.606289757] [controller_server]: ConstraintCritic instantiated with 1 power and 4.000000 weight.
[controller_server-3] [INFO] [1693468059.606340764] [controller_server]: Critic loaded : mppi::critics::ConstraintCritic
[controller_server-3] [INFO] [1693468059.606928068] [controller_server]: GoalCritic instantiated with 1 power and 5.000000 weight.
[controller_server-3] [INFO] [1693468059.606956307] [controller_server]: Critic loaded : mppi::critics::GoalCritic
[controller_server-3] [INFO] [1693468059.607652251] [controller_server]: GoalAngleCritic instantiated with 1 power, 3.000000 weight, and 5.000000 angular threshold.
[controller_server-3] [INFO] [1693468059.607685011] [controller_server]: Critic loaded : mppi::critics::GoalAngleCritic
[controller_server-3] [INFO] [1693468059.609034109] [controller_server]: ReferenceTrajectoryCritic instantiated with 1 power and 10.000000 weight
[controller_server-3] [INFO] [1693468059.609078291] [controller_server]: Critic loaded : mppi::critics::PathAlignCritic
[controller_server-3] [INFO] [1693468059.609874952] [controller_server]: Critic loaded : mppi::critics::PathFollowCritic
[controller_server-3] [INFO] [1693468059.611249104] [controller_server]: PreferForwardCritic instantiated with 1 power and 5.000000 weight.
[controller_server-3] [INFO] [1693468059.611292541] [controller_server]: Critic loaded : mppi::critics::PreferForwardCritic
[controller_server-3] [INFO] [1693468059.612857101] [controller_server]: Optimizer reset
[ERROR] [controller_server-3]: process has died [pid 154526, exit code -4, cmd '/opt/ros/humble/lib/nav2_controller/controller_server --ros-args --log-level info --ros-args -p use_sim_time:=False --params-file /tmp/tmplp_r7y7o -r /tf:=tf -r /tf_static:=tf_static'].

However, after adding prefix=['xterm -e gdb -ex run --args'] to controller_server in the launch file, I got an additional information:

...
[controller_server-3] [INFO] [1693468059.611292541] [controller_server]: Critic loaded : mppi::critics::PreferForwardCritic
[New Thread 0x7fffe1efe640 (LWP 317595)]
[controller_server-3] [INFO] [1693468059.612857101] [controller_server]: Optimizer reset

Thread 17 "controller_serv" received signal SIGILL, Illegal instruction.
[Switching to Thread 0x7fffe1efe640 (LWP 317595)]
0x00007ffff0191c82 in mppi::NoiseGenerator::generateNoisedControls() () from /opt/ros/humble/lib/libmppi_controller.so

So it's the method generateNoisedControls() you @SteveMacenski also found.

I'm sorry, but I rather won't have time this week to compile the package locally, so if no-one will do this before me, I will try next week.

@SteveMacenski
Copy link
Member

I'm running a re-release on Iron and Humble to see if that resolves the issue due to a transient build farm issue. I'm not overly hopeful, but its worth a shot.

@jankolkmeier
Copy link

I was facing this issue today and was able to fix it by compiling from source.

@Imaniac230
Copy link

We have recently encountered this issue as well when deploying on new systems.

I can confirm that all of our robots that have the older nav2_mppi_controller binary version 1.1.8 work just fine. They are using:

  • 11th gen Intel NUCs with i7-1165G7 CPUs
  • kernel versions: 5.15.0-76-generic #83-Ubuntu SMP Thu Jun 15 19:16:32 UTC 2023 x86_64
  • ubuntu: 22.04.2 LTS

I even have the older 1.1.8 version working correctly in an LXD container with Ubuntu 22.04.3 LTS (host ubuntu 20.04.6 LTS, kernel 5.15.0-58-generic #64~20.04.1-Ubuntu SMP Fri Jan 6 16:42:31 UTC 2023 x86_64.

A new container or updated device using version 1.1.9 has the same issue with the controller dying prematurely:

[INFO] [1695154993.888202638] [loki_sim.controller_server]: Created controller : FollowPathMPPI of type nav2_mppi_controller::MPPIController
[INFO] [1695154993.889628400] [loki_sim.controller_server]: Controller period is equal to model dt. Control sequence shifting is ON
[INFO] [1695154993.896635169] [loki_sim.controller_server]: ConstraintCritic instantiated with 1 power and 4.000000 weight.
[INFO] [1695154993.896670707] [loki_sim.controller_server]: Critic loaded : mppi::critics::ConstraintCritic
[INFO] [1695154993.897008256] [loki_sim.controller_server]: GoalAngleCritic instantiated with 1 power, 3.000000 weight, and 0.400000 angular threshold.
[INFO] [1695154993.897028625] [loki_sim.controller_server]: Critic loaded : mppi::critics::GoalAngleCritic
[INFO] [1695154993.897407984] [loki_sim.controller_server]: GoalCritic instantiated with 1 power and 5.000000 weight.
[INFO] [1695154993.897428914] [loki_sim.controller_server]: Critic loaded : mppi::critics::GoalCritic
[INFO] [1695154993.898048167] [loki_sim.controller_server]: PathAngleCritic instantiated with 1 power and 2.000000 weight. Reversing allowed.
[INFO] [1695154993.898069528] [loki_sim.controller_server]: Critic loaded : mppi::critics::PathAngleCritic
[INFO] [1695154993.898725862] [loki_sim.controller_server]: ReferenceTrajectoryCritic instantiated with 1 power and 14.000000 weight
[INFO] [1695154993.898747613] [loki_sim.controller_server]: Critic loaded : mppi::critics::PathAlignCritic
[INFO] [1695154993.899148422] [loki_sim.controller_server]: Critic loaded : mppi::critics::PathFollowCritic

Thread 1 "controller_serv" received signal SIGILL, Illegal instruction.
0x00007fffb0d10e15 in mppi::critics::ObstaclesCritic::findCircumscribedCost(std::shared_ptr<nav2_costmap_2d::Costmap2DROS>) () from /opt/ros/humble/lib/libmppi_critics.so

In all cases the order or ctirics in cofing is always: critics: ["ConstraintCritic", "GoalAngleCritic", "GoalCritic", "PathAngleCritic", "PathAlignCritic", "PathFollowCritic", "ObstaclesCritic", "PreferForwardCritic", "TwirlingCritic"]. We are also always loading all of the available controllers at launch and choosing a desird one through the behavior tree (it was dwb in this case).

@SteveMacenski
Copy link
Member

SteveMacenski commented Sep 19, 2023

I can't explain this and would need some help from OSRF to make an attempt via some custom jobs to see if we can find the error. @nuclearsandwich @clalancette @wjwwood I know you're all busy, but this is above my knowledge level and all the best debugging I've been able to accomplish point back to the build farm. See #3767 (comment)

  • Binaries for a package that was working for its first release on any distribution are now broken on a sync for every distribution
  • The same version if we clone it locally on 10+ machines work fine, only binaries fail from the same version
  • The crashes happen in different places, but all seem to correspond to places recently touched. This makes me think there's something going on with caching (or coincidental). That would seem to grokk given it worked for all distributions the first binary build before there was previously a build. Or something else in build-farm land changed?

Without help, the best I can do is just re-release and pray, or try to release under a different name to create a new binary of a new job to see if that works. But those are both not real solutions to make sure it is resolved (and testing in production). I'm at a loss for direction since I don't know that there's any dials I can play with on the build farm side myself

I re-released the same version as Chris suggested to Iron. I just tested on ros2-testing that it is still indeed crashing after a new build with the new version, so it wasn't a one-off fluke. I've been through changes with a fine-toothed comb. Unless the issue is with some implicit casting or comparisons of different types (which I'm now fixing just in case), I'm not seeing whatever this issue is, if its in the code. The next best guess is it has to do with compiler flags (and the cpu or environment in the build farm changed?), though our flags haven't changed since 1.1.8

If @Imaniac230 @jankolkmeier @AmmarAlbakri or any others in the thread had a few minutes, I'd be curious if you could package up your install space and send it to me (x86 only). I'll try running it on my computer to see if we can replicate this without the build farm.

@Imaniac230
Copy link

Imaniac230 commented Sep 20, 2023

If @Imaniac230 @jankolkmeier @AmmarAlbakri or any others in the thread had a few minutes, I'd be curious if you could package up your install space and send it to me (x86 only). I'll try running it on my computer to see if we can replicate this without the build farm.

I've compiled a list of installed packages using apt-clone. I used the updated lxd container, so from my understanding this should mirror a clean base ubuntu 22.04 with all of the additional (ros2, nav2, and other) packages we use for deployment.

Now, this produced only a list of packages and did not archive any of the the binaries locally, so it will still download all from the remote repositories. Also any locally compiled packages were not included in this. I'm not sure if this is the desired output.

If it's something else you had in mind I'll see if I can get that working (like dpkg-repack on all of them, so that all are archived in local .deb files?).

@SteveMacenski
Copy link
Member

SteveMacenski commented Sep 20, 2023

I was suggesting instead to package up the install space of the manual build to see if I could get those libraries to run on another computer, in case some build flags were obviously causing problems on a different CPU of the same architecture. I’m not convinced it’ll show us anything, but its definitely worth a try if the issue is portability. If it does cause a problem, then that could lead to a workflow for testing solutions!

@Imaniac230
Copy link

Imaniac230 commented Sep 21, 2023

Oh, sorry, I was going somewhere else with my thoughts.

So I've done some additional experiments in the meantime. The manually compiled from source (tag 1.1.9) version works fine on my testing system as well, just as everyone else is experiencing.

I also tried using different generators unix makefiles and ninja with and without the -mfma -mavx2 flags, neither of which made any difference (all combinations still OK).

The system I'm testing this on now is:

  • AMD Ryzen 7 3700X CPU with support for both avx2 and fma instruction sets.
processor	: 0
vendor_id	: AuthenticAMD
cpu family	: 23
model		: 113
model name	: AMD Ryzen 7 3700X 8-Core Processor
stepping	: 0
microcode	: 0x8701021
cpu MHz		: 3593.146
cache size	: 512 KB
physical id	: 0
siblings	: 16
core id		: 0
cpu cores	: 8
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 16
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sme sev sev_es
  • LXD container with ubuntu 22.04.3 LTS (host ubuntu 20.04.6 LTS, host kernel 5.15.0-58-generic #64~20.04.1-Ubuntu SMP Fri Jan 6 16:42:31 UTC 2023 x86_64)

I just quickly verified that with the default cmake setup the support does get detected and the compile options would get added on my system if left as is:

check_cxx_compiler_flag("-mavx2" COMPILER_SUPPORTS_AVX2)
check_cxx_compiler_flag("-mfma" COMPILER_SUPPORTS_FMA)

message(AVX2 support detected:_"${COMPILER_SUPPORTS_AVX2}")
message(FMA support detected:_"${COMPILER_SUPPORTS_FMA}")

if(COMPILER_SUPPORTS_AVX2)
  message(adding_avx2_compile_options)
  add_compile_options(-mavx2)
endif()

if(COMPILER_SUPPORTS_FMA)
  message(adding_fma_compile_options)
  add_compile_options(-mfma)
endif()

output:

--- stderr: nav2_mppi_controller                                          
AVX2supportdetected:_"1"
FMAsupportdetected:_"1"
adding_avx2_compile_options
adding_fma_compile_options
---

Then I've tried a dumb hack by replacing my manually compiled lib files from local_worskapce/install/nav2_mppi_controller/lib/ to /opt/ros/humble/lib/ and running with the native ros install sourced (/opt/ros/humble/).
The first interesting thing is that the manually compiled ones do have different sizes:

ll -h /opt/ros/humble/lib/libmppi_c*
-rw-r--r-- 1 root root 1.2M Sep 21 15:03 /opt/ros/humble/lib/libmppi_controller.so
-rw-r--r-- 1 root root 796K Sep 21 15:03 /opt/ros/humble/lib/libmppi_controller_original_from_apt.so
-rw-r--r-- 1 root root 615K Sep 21 15:05 /opt/ros/humble/lib/libmppi_critics.so
-rw-r--r-- 1 root root 300K Sep 21 15:05 /opt/ros/humble/lib/libmppi_critics_original_from_apt.so
  1. If both libs are replaced by manually compiled ones, all works fine - OK.
  2. If libmppi_controller.so is left original and only libmppi_critics.so is replaced I crash with the following ouput:
[INFO] [1695309388.908256577] [loki_sim.controller_server]: Created controller : FollowPathMPPI of type nav2_mppi_controller::MPPIController
[INFO] [1695309388.909553515] [loki_sim.controller_server]: Controller period is equal to model dt. Control sequence shifting is ON
[INFO] [1695309388.918605212] [loki_sim.controller_server]: ConstraintCritic instantiated with 1 power and 4.000000 weight.
[INFO] [1695309388.918645829] [loki_sim.controller_server]: Critic loaded : mppi::critics::ConstraintCritic
[INFO] [1695309388.918914659] [loki_sim.controller_server]: GoalAngleCritic instantiated with 1 power, 3.000000 weight, and 0.400000 angular threshold.
[INFO] [1695309388.918940848] [loki_sim.controller_server]: Critic loaded : mppi::critics::GoalAngleCritic
[INFO] [1695309388.919209397] [loki_sim.controller_server]: GoalCritic instantiated with 1 power and 5.000000 weight.
[INFO] [1695309388.919236218] [loki_sim.controller_server]: Critic loaded : mppi::critics::GoalCritic
[INFO] [1695309388.919763327] [loki_sim.controller_server]: PathAngleCritic instantiated with 1 power and 2.000000 weight. Reversing allowed.
[INFO] [1695309388.919784187] [loki_sim.controller_server]: Critic loaded : mppi::critics::PathAngleCritic
[INFO] [1695309388.920276029] [loki_sim.controller_server]: ReferenceTrajectoryCritic instantiated with 1 power and 14.000000 weight
[INFO] [1695309388.920296408] [loki_sim.controller_server]: Critic loaded : mppi::critics::PathAlignCritic
[INFO] [1695309388.920626574] [loki_sim.controller_server]: Critic loaded : mppi::critics::PathFollowCritic
[INFO] [1695309388.921333443] [loki_sim.controller_server]: ObstaclesCritic instantiated with 1 power and 20.000000 / 1.500000 weights. Critic will collision check based on footprint cost.
[INFO] [1695309388.921354153] [loki_sim.controller_server]: Critic loaded : mppi::critics::ObstaclesCritic
[INFO] [1695309388.921643571] [loki_sim.controller_server]: PreferForwardCritic instantiated with 1 power and 5.000000 weight.
[INFO] [1695309388.921669190] [loki_sim.controller_server]: Critic loaded : mppi::critics::PreferForwardCritic
[INFO] [1695309388.921868237] [loki_sim.controller_server]: TwirlingCritic instantiated with 1 power and 10.000000 weight.
[INFO] [1695309388.921896931] [loki_sim.controller_server]: Critic loaded : mppi::critics::TwirlingCritic
[New Thread 0x7fffb1e6e640 (LWP 104572)]
[INFO] [1695309388.924296550] [loki_sim.controller_server]: Optimizer reset

Thread 1 "controller_serv" received signal SIGILL, Illegal instruction.
0x00007fffb2098e3b in mppi::PathHandler::getMaxCostmapDist() () from /opt/ros/humble/lib/libmppi_controller.so

All of the critic functions pass through ok and it fail on a controller function.
3. If libmppi_critics.so is left original and only libmppi_controller.so is replaced I crash with the following ouput:

[INFO] [1695309578.237223524] [loki_sim.controller_server]: Created controller : FollowPathMPPI of type nav2_mppi_controller::MPPIController
[INFO] [1695309578.238515142] [loki_sim.controller_server]: Controller period is equal to model dt. Control sequence shifting is ON
[INFO] [1695309578.245155308] [loki_sim.controller_server]: ConstraintCritic instantiated with 1 power and 4.000000 weight.
[INFO] [1695309578.245189653] [loki_sim.controller_server]: Critic loaded : mppi::critics::ConstraintCritic
[INFO] [1695309578.245482769] [loki_sim.controller_server]: GoalAngleCritic instantiated with 1 power, 3.000000 weight, and 0.400000 angular threshold.
[INFO] [1695309578.245504740] [loki_sim.controller_server]: Critic loaded : mppi::critics::GoalAngleCritic
[INFO] [1695309578.245798106] [loki_sim.controller_server]: GoalCritic instantiated with 1 power and 5.000000 weight.
[INFO] [1695309578.245826671] [loki_sim.controller_server]: Critic loaded : mppi::critics::GoalCritic
[INFO] [1695309578.246308714] [loki_sim.controller_server]: PathAngleCritic instantiated with 1 power and 2.000000 weight. Reversing allowed.
[INFO] [1695309578.246335926] [loki_sim.controller_server]: Critic loaded : mppi::critics::PathAngleCritic
[INFO] [1695309578.246910876] [loki_sim.controller_server]: ReferenceTrajectoryCritic instantiated with 1 power and 14.000000 weight
[INFO] [1695309578.246939941] [loki_sim.controller_server]: Critic loaded : mppi::critics::PathAlignCritic
[INFO] [1695309578.247330752] [loki_sim.controller_server]: Critic loaded : mppi::critics::PathFollowCritic

Thread 1 "controller_serv" received signal SIGILL, Illegal instruction.
0x00007fffb1e81e15 in mppi::critics::ObstaclesCritic::findCircumscribedCost(std::shared_ptr<nav2_costmap_2d::Costmap2DROS>) () from /opt/ros/humble/lib/libmppi_critics.so

This time it crashes already on a critics function.

So it seems that there is something going on with both libraries. Next I can make a simple archive from the manually compiled and working nav2_mppi_controller install as you suggest @SteveMacenski to see if they also work on different machines. I whink this should be enough since it's just these libs that have problems.

@SteveMacenski
Copy link
Member

SteveMacenski commented Sep 21, 2023

I spent a good chunk of last night also looking into MPPI and I found that the xtensor::optimize may actually be setting -march=native to make processor specific optimizations that may or may not work on other processors. That sounds suspicious to me, however note that that has never changed since the original release so it was working before. Still something to test, but conveniently that is the kind of error that would appear in sharing install spaces across computers, so if that's the problem, that's debuggable without the build farm.

@Imaniac230 thanks for the debugging effort and that's a good idea to testing out the specific libraries (I didn't think of that)! The two errors you bring up look alot like the tracebacks I got in my (long) comment trying out different combinations of settings. It looks like the difference between the 2 categories I reported is which library crashes - those logs you sent align perfectly with them.

Glad to have validation as well on mfma and mavx2 flags; that's what I also see! Can you send me that install space?

@Imaniac230
Copy link

Glad to have validation as well on mfma and mavx2 flags; that's what I also see! Can you send me that install space?

Sure, I guess I shouldn't be posting it here? So expect an email from tomas.horelican@vut.cz, that's going to be from me.

In the meantime I'll also try to play with the march=native flag and port the libs into one of our NUCs with Intel CPUs and see if there's any difference.

@Imaniac230
Copy link

The mail server seems to be rejecting the attachment. Is there a channel where I should send it instead, or should i place it here?

@SteveMacenski
Copy link
Member

The mail server seems to be rejecting the attachment.

... I guess it is just libs, that does look suspect as heck.

You can either put it in google drive / drop box as a tar file and email me that link or join the nav2 slack (https://join.slack.com/t/navigation2/shared_invite/zt-uj428p0x-jKx8U7OzK1IOWp5TnDS2rA) and hope that it doesn't have the same constraint

... Please don't virus me, that would be unchill

@Imaniac230
Copy link

Well, I zipped the whole contents of the install/nav2_mppi_controller folder, which contains scripts as well. I guess that was raising flags. I'll send just the two lib files, that will hopefully be enough.

@Imaniac230
Copy link

Imaniac230 commented Sep 21, 2023

This time it passed through with just the libs. Please let me know if it's all good.

@SteveMacenski
Copy link
Member

SteveMacenski commented Sep 21, 2023

@Imaniac230 tried your file and it fails for me in one of the expected ways! Can you try again without the xtensor::optimize in target_link_libraries?

I'm pleasantly hopeful that this will resolve the problem, but then equally concerned as to what that'll mean in terms of the binary's performance vs building from source. And really confused as to why this is only an issue now

@Imaniac230
Copy link

I'm not sure I can create a difference. Both libs compiled locally on my machine work without errors with or without the xtensor::optimize argument.

However, when replacing just one of them and leaving the other one original (just as in my comment #3767 (comment)), it did crash at a different function:

  1. libmppi_controller.so is left original and only libmppi_critics.so is replaced by locally compiled one without xtensor::optimize:
[INFO] [1695324179.470023594] [loki_sim.controller_server]: Created controller : FollowPathMPPI of type nav2_mppi_controller::MPPIController
[INFO] [1695324179.470914202] [loki_sim.controller_server]: Controller period is equal to model dt. Control sequence shifting is ON
[INFO] [1695324179.479942239] [loki_sim.controller_server]: ConstraintCritic instantiated with 1 power and 4.000000 weight.
[INFO] [1695324179.479974681] [loki_sim.controller_server]: Critic loaded : mppi::critics::ConstraintCritic
[INFO] [1695324179.480193886] [loki_sim.controller_server]: GoalAngleCritic instantiated with 1 power, 3.000000 weight, and 0.400000 angular threshold.
[INFO] [1695324179.480214545] [loki_sim.controller_server]: Critic loaded : mppi::critics::GoalAngleCritic
[INFO] [1695324179.480425625] [loki_sim.controller_server]: GoalCritic instantiated with 1 power and 5.000000 weight.
[INFO] [1695324179.480446565] [loki_sim.controller_server]: Critic loaded : mppi::critics::GoalCritic
[INFO] [1695324179.480823911] [loki_sim.controller_server]: PathAngleCritic instantiated with 1 power and 2.000000 weight. Reversing allowed.
[INFO] [1695324179.480844610] [loki_sim.controller_server]: Critic loaded : mppi::critics::PathAngleCritic
[INFO] [1695324179.481246632] [loki_sim.controller_server]: ReferenceTrajectoryCritic instantiated with 1 power and 14.000000 weight
[INFO] [1695324179.481266329] [loki_sim.controller_server]: Critic loaded : mppi::critics::PathAlignCritic
[INFO] [1695324179.481516263] [loki_sim.controller_server]: Critic loaded : mppi::critics::PathFollowCritic
[INFO] [1695324179.482039806] [loki_sim.controller_server]: ObstaclesCritic instantiated with 1 power and 20.000000 / 1.500000 weights. Critic will collision check based on footprint cost.
[INFO] [1695324179.482075634] [loki_sim.controller_server]: Critic loaded : mppi::critics::ObstaclesCritic
[INFO] [1695324179.482292966] [loki_sim.controller_server]: PreferForwardCritic instantiated with 1 power and 5.000000 weight.
[INFO] [1695324179.482311501] [loki_sim.controller_server]: Critic loaded : mppi::critics::PreferForwardCritic
[INFO] [1695324179.482487084] [loki_sim.controller_server]: TwirlingCritic instantiated with 1 power and 10.000000 weight.
[INFO] [1695324179.482503515] [loki_sim.controller_server]: Critic loaded : mppi::critics::TwirlingCritic
[New Thread 0x7fffb1e6c640 (LWP 129622)]
[INFO] [1695324179.484902593] [loki_sim.controller_server]: Optimizer reset

Thread 16 "controller_serv" received signal SIGILL, Illegal instruction.
[Switching to Thread 0x7fffb1e6c640 (LWP 129622)]
0x00007fffb209cc82 in mppi::NoiseGenerator::generateNoisedControls() () from /opt/ros/humble/lib/libmppi_controller.so
  1. libmppi_critics.so is left original and only libmppi_controller.so is replaced by locally compiled one without xtensor::optimize: same error as MPPI crashing on loading plug-ings  #3767 (comment)

Meanwhile, I did some experiments on a NUC machine:

  • with an Intel i7-1165G7 CPU with support for avx2and fma
  • native ubuntu 22.04.3 LTS and kernel 5.15.0-79-generic #86-Ubuntu SMP Mon Jul 10 16:07:21 UTC 2023 x86_64

and got some different behavior. Any possible combination I tried worked without errors on the Intel:

  1. downloaded libs taken from the AMD PC -> OK (this was the latest one 1.1.9-1jammy.20230822.201753 amd64)
  2. downloaded libs on the Intel PC -> OK (this one is older 1.1.9-1jammy.20230807.181110 amd64, I assume this was is before the sync?)
  3. locally compiled libs on the AMD PC (with or without fma and avx2) -> OK
  4. locally compiled libs on the Intel PC (with or without fma and avx2) -> OK

Now going back to the AMD machine:

  1. downloaded libs on the AMD PC -> same errors (1.1.9-1jammy.20230822.201753 amd64)
  2. downloaded libs taken from the Intel PC -> same errors (1.1.9-1jammy.20230807.181110 amd64)
  3. locally compiled libs on the Intel PC -> new error:
[INFO] [1695323372.759161328] [loki_sim.controller_server]: Created controller : FollowPathMPPI of type nav2_mppi_controller::MPPIController
[INFO] [1695323372.759980261] [loki_sim.controller_server]: Controller period is equal to model dt. Control sequence shifting is ON

Thread 1 "controller_serv" received signal SIGILL, Illegal instruction.
0x00007fffb202b2f3 in std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::vector(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) () from /opt/ros/humble/lib/libmppi_controller.so

I have haven't removed the xtensor::optimize in these tests yet. This will be what I can do for today, I'll get back to it tomorrow.

@SteveMacenski
Copy link
Member

I'm not sure I can create a difference. Both libs compiled locally on my machine work without errors with or without the xtensor::optimize argument.

Its not about compiler failures, its about portability. If you send me ones without that argument, I can see if that's what's causing portability problems. Under the hood that enables a flag which could potentially be triggering this behavior!

@Imaniac230
Copy link

Imaniac230 commented Sep 22, 2023

So I'm starting to believe this is actually narrowing down to a compatibility problem with AMD CPUs. I've double checked that I haven't made any mistakes when porting the libs to the Intel machine, but it still does actually all work there. I've downloaded the latest 1.1.9 update from apt (1.1.9-1jammy.20230822.201753 amd64) to the Intel NUC as well and it also works fine. So currently there is no possible way I can reproduce the errors on the Intel NUC, everything is OK there.

I've verified that all of the machines that we had these problems with do have AMC CPUs in them.

Then I built the libs locally on the Intel but without the tensor::optimize argument and sure enough, when I ported that to the PC with an AMD CPU it worked without errors. So it seems as though the march=native flag is triggering some instructions that are supported on the Intels but not on AMDs. However, the other way around: compiling with tensor::optimize on AMD and porting that to Intel is OK.

This might also explain why it seemed to work on some systems and not on others.

Lastly, I added the march=native flag to add_compile_options explicitly and also without xtensor::optimize in target_link_libraries, so:

  • add_compile_options(-O3 -finline-limit=1000000 -ffp-contract=fast -ffast-math -march=native)
  • target_link_libraries(${lib} xtensor xtensor::use_xsimd)

and did the same cross-checking. It does produce exactly the same bevior as only removing or adding the xtensor::optimize argument itself so there does seem to be a correlation:

  1. compiling with march=native on AMD and porting to Intel -> OK
  2. compiling with march=native on Intel and porting to AMD -> same error:
 [INFO] [1695323372.759161328] [loki_sim.controller_server]: Created controller : FollowPathMPPI of type nav2_mppi_controller::MPPIController
[INFO] [1695323372.759980261] [loki_sim.controller_server]: Controller period is equal to model dt. Control sequence shifting is ON

Thread 1 "controller_serv" received signal SIGILL, Illegal instruction.
0x00007fffb202b2f3 in std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::vector(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) () from /opt/ros/humble/lib/libmppi_controller.so

It is interesting that it fails at a completely different point than any of the previous reported ones. I'm not sure if the release builds are being compiled for different CPUs differently and this is actually normal behavior, or there is something going on here.

For additional info, I've taken a look at the specific instructions being called in each function when it crashes(with layout asm in the gdb window after the crash) and this is what I see repeatedly being called:

  • in ObstaclesCritic::findCircumscribedCost it's trying to call vcvtusi2sd with %eax, %xmm5, %xmm0
  • in PathHandler::getMaxCostmapDist it's trying to call vcvtusi2sd with %ebx, %xmm0, %xmm0
  • in NoiseGenerator::generateNoisedControls it's trying to call vptestnmq with %ymm0, %ymm0, %k1
  • in the std::allocator its trying vpbroadcastq with %r14, %xmm0

From the Intel documentation (https://www.intel.com/content/dam/develop/external/us/en/documents/319433-024-697869.pdf), vcvtusi2sdis supposed to convert a signle unsigned integer to a single double float. This is an avx512 instruction, which does not have support on AMD, ergo -> illegal instruction. The closest thing to this is a signed version vcvtsi2sd, which is only avx (https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/40332.pdf).

The vptestnmq is again an avx512 instruction, which performs bitwise logical operations on integers. A similar avx variant in AMD docs is vptest.

The last vpbroadcastq should perform a broadcast of a single memory element into multiple places. This one is listed as avx2 for both Intel and AMD, however Intel does specify an avx512 variant, which is probably what is being called here.

Now, my knowledge of the amd64 and x86 instruction set doesn't extend beyond reading the documentation so I don't think I'll be pushing this any further. But from this I would humbly conclude that the march=native flag, when compiled on Intel, invokes avx512 extension instructions, therefore making it non-portable to other systems.

It would also explain why the remote binaries were not working for some people, but if they compiled locally from source it would know to use only those instructions that are actually available on their hardware.

The one thing that is bugging me, however, is that all of the libs that I sent you @SteveMacenski were compiled on an AMD PC without avx512 so they shouldn't be failing. I'm curious in what ways they were failing in your tests?

@SteveMacenski
Copy link
Member

SteveMacenski commented Sep 25, 2023

I've downloaded the latest 1.1.9 update from apt (1.1.9-1jammy.20230822.201753 amd64) to the Intel NUC as well and it also works fine.

I have an intel i7-8565U and 1.1.9 is broken for me. I also see this problem on my Intel CPU using the libs you provided me with optimize not included (and obviously included). Which did you build those libs on, Intel or AMD? It also may not be simply as easy as AMD->Intel but also the relative age. Have something older than an 11th gen to test on?

However, the other way around: compiling with tensor::optimize on AMD and porting that to Intel is OK. compiling with march=native on AMD and porting to Intel -> OK

If you compiled on AMD, then my comment above invalidates that theory. See my long comment above on the exact failures I saw from the build-farm generated binaries.

Supposedly also the build farm is using AMD for building the binaries https://build.ros2.org/job/Hbin_uJ64__nav2_mppi_controller__ubuntu_jammy_amd64__binary/11/consoleFull

vcvtusi2sdis supposed to convert a single unsigned integer to a single double float, vptestnmq is again an avx512 instruction, which performs bitwise logical operations on integers.

Interesting. a hail mary I threw was 91b688d last week, which explicitly does the type conversions that I noticed where implicit in the diff between 1.1.8 and 1.1.9.

  • I explicitly casted the unsigned int costmap size to a double in getMaxCostmapDist
  • I casted and changed the comparison of possibly_inscribed_cost_ in findCircumscribedCost to all use floats
  • I set the mean for the STD in generateNoisedControls to 0.0f from 0.0 and converted the STD type to a float.

@Imaniac230 if you were to compile from main (or these two commits b0a68bb and 91b688d), do you still see any of these portability issues after I made these changes (or have new ones pop up or some disappear)? It seems like I may have resolved a few (or all?) from my fine-tooth-comb code analysis changes noticing the same things that you found explicitly - except maybe generateNoisedControls where that instruction doesn't line up with my potential change, I don't think at least.

Still yet, none of this explains why this started now, unless I was literally perfect in all my other casts everywhere else, which seems awfully unlikely. Plus, generateNoisedControls has never changed since the codebase started. Even if we pretended the codebase was perfect before and I introduced errors causing instruction incompatabilities in the other two places, the generateNoisedControls instruction error isn't explainable with that thesis. Unless its some threading interaction I don't understand that might disappear with the fixes to the other locales.

But its a good start. I can't explain that Intel - AMD relationship you mention though also portability issues across Intel? Mind running that test for me using the newest changes to the MPPI controller (these two commits b0a68bb and 91b688d)? Maybe we can resolve this through removing the illegal instruction-generating actions themselves. I'm not sure how I'll prevent this from happening in the future, but I suppose one step at a time.

It would also explain why the remote binaries were not working for some people

Does it work for you on your NUC if you take the binaries from the build farm and from the AMD processor with optimize on? That's an interesting datapoint. But with optimize on or off, your libs did not work for me on Intel i7 8th gen. So perhaps its not so cut and dry?

@SteveMacenski
Copy link
Member

Actually, just pull in Humble again after #3836 is merged. This includes those changes I made to MPPI. You can just use that instead to test if my changes fixed some or all of these issues

@Imaniac230
Copy link

Imaniac230 commented Sep 27, 2023

I have an intel i7-8565U and 1.1.9 is broken for me. I also see this problem on my Intel CPU using the libs you provided me with optimize not included (and obviously included). Which did you build those libs on, Intel or AMD? It also may not be simply as easy as AMD->Intel but also the relative age. Have something older than an 11th gen to test on?

I was hoping a bit that it could be that simple. But all the libs I sent you were built with AMD Ryzen 7 3700X and they are clearly still failing for you. It would be interesting to see which exact instructions it is failing at in this case, since my AMD certainly does not show any avx512 variant in the supported flags (see #3767 (comment)). And if both cases (with or without optimize) still fail in the same way, there is probably still something else going on. I do have an older i5 7600K that I can test this with once I get to that PC.

Supposedly also the build farm is using AMD for building the binaries https://build.ros2.org/job/Hbin_uJ64__nav2_mppi_controller__ubuntu_jammy_amd64__binary/11/consoleFull

I couldn't find any reference to the specific HW being used. The amd64 tag should only specify the 64-bit architecture name (regardless of Intel or AMD). What I also find strange is that the release libs that I have downloaded from the build farm still fail at the same avx512 instructions, which would be strange if they were compiled on AMD without avx512. However I may be overreaching my knowledge here, and it might be possible that the instruction set is in fact supported on some AMD CPUs as well.

Does it work for you on your NUC if you take the binaries from the build farm and from the AMD processor with optimize on? That's an interesting datapoint.

Yes, they all work. I was not able to get it NOT working with any test.

Actually, just pull in Humble again after #3836 is merged. This includes those changes I made to MPPI. You can just use that instead to test if my changes fixed some or all of these issues

So here is what I have.
Pulled humble (latest commit b3474ca) and built with optimize enabled on AMD Ryzen 7 3700X (same one as #3767 (comment)):

  • Launched on the Intel NUC -> segfault:
[INFO] [1695827814.050579297] [loki_sim.controller_server]: Created controller : FollowPathMPPI of type nav2_mppi_controller::MPPIController
[INFO] [1695827814.053224030] [loki_sim.controller_server]: Controller period is equal to model dt. Control sequence shifting is ON
[INFO] [1695827814.064664339] [loki_sim.controller_server]: ConstraintCritic instantiated with 1 power and 4.000000 weight.
[INFO] [1695827814.064721051] [loki_sim.controller_server]: Critic loaded : mppi::critics::ConstraintCritic
[INFO] [1695827814.065277043] [loki_sim.controller_server]: GoalAngleCritic instantiated with 1 power, 3.000000 weight, and 0.400000 angular threshold.
[INFO] [1695827814.065312321] [loki_sim.controller_server]: Critic loaded : mppi::critics::GoalAngleCritic
[INFO] [1695827814.065865356] [loki_sim.controller_server]: GoalCritic instantiated with 1 power and 5.000000 weight.
[INFO] [1695827814.065893939] [loki_sim.controller_server]: Critic loaded : mppi::critics::GoalCritic
[INFO] [1695827814.066769422] [loki_sim.controller_server]: PathAngleCritic instantiated with 1 power and 2.000000 weight. Reversing allowed.
[INFO] [1695827814.066800916] [loki_sim.controller_server]: Critic loaded : mppi::critics::PathAngleCritic
[INFO] [1695827814.067920996] [loki_sim.controller_server]: ReferenceTrajectoryCritic instantiated with 1 power and 14.000000 weight
[INFO] [1695827814.067957812] [loki_sim.controller_server]: Critic loaded : mppi::critics::PathAlignCritic
[INFO] [1695827814.068764985] [loki_sim.controller_server]: Critic loaded : mppi::critics::PathFollowCritic

Thread 1 "controller_serv" received signal SIGSEGV, Segmentation fault.
0x00007ffff7a59c97 in __dynamic_cast () from /lib/x86_64-linux-gnu/libstdc++.so.6
  • Launched on the AMD PC -> the same segfault:
[INFO] [1695828405.596082485] [loki_sim.controller_server]: Created controller : FollowPathMPPI of type nav2_mppi_controller::MPPIController
[INFO] [1695828405.598232041] [loki_sim.controller_server]: Controller period is equal to model dt. Control sequence shifting is ON
[INFO] [1695828405.607964476] [loki_sim.controller_server]: ConstraintCritic instantiated with 1 power and 4.000000 weight.
[INFO] [1695828405.608007728] [loki_sim.controller_server]: Critic loaded : mppi::critics::ConstraintCritic
[INFO] [1695828405.608446179] [loki_sim.controller_server]: GoalAngleCritic instantiated with 1 power, 3.000000 weight, and 0.400000 angular threshold.
[INFO] [1695828405.608469604] [loki_sim.controller_server]: Critic loaded : mppi::critics::GoalAngleCritic
[INFO] [1695828405.609305679] [loki_sim.controller_server]: GoalCritic instantiated with 1 power and 5.000000 weight.
[INFO] [1695828405.609324986] [loki_sim.controller_server]: Critic loaded : mppi::critics::GoalCritic
[INFO] [1695828405.610255861] [loki_sim.controller_server]: PathAngleCritic instantiated with 1 power and 2.000000 weight. Reversing allowed.
[INFO] [1695828405.610283964] [loki_sim.controller_server]: Critic loaded : mppi::critics::PathAngleCritic
[INFO] [1695828405.611136671] [loki_sim.controller_server]: ReferenceTrajectoryCritic instantiated with 1 power and 14.000000 weight
[INFO] [1695828405.611161989] [loki_sim.controller_server]: Critic loaded : mppi::critics::PathAlignCritic
[INFO] [1695828405.611714006] [loki_sim.controller_server]: Critic loaded : mppi::critics::PathFollowCritic

Thread 1 "controller_serv" received signal SIGSEGV, Segmentation fault.
0x00007ffff7a5dc97 in __dynamic_cast () from /lib/x86_64-linux-gnu/libstdc++.so.6

Pulled humble (latest commit b3474ca) and built with optimize enabled on Intel i7-1165G7 (NUC):

  • Launched on the Intel NUC -> segfault:
[INFO] [1695828497.678056294] [loki_sim.controller_server]: Created controller : FollowPathMPPI of type nav2_mppi_controller::MPPIController
[INFO] [1695828497.680619236] [loki_sim.controller_server]: Controller period is equal to model dt. Control sequence shifting is ON
[INFO] [1695828497.693196659] [loki_sim.controller_server]: ConstraintCritic instantiated with 1 power and 4.000000 weight.
[INFO] [1695828497.693249468] [loki_sim.controller_server]: Critic loaded : mppi::critics::ConstraintCritic
[INFO] [1695828497.693813437] [loki_sim.controller_server]: GoalAngleCritic instantiated with 1 power, 3.000000 weight, and 0.400000 angular threshold.
[INFO] [1695828497.693837594] [loki_sim.controller_server]: Critic loaded : mppi::critics::GoalAngleCritic
[INFO] [1695828497.694371774] [loki_sim.controller_server]: GoalCritic instantiated with 1 power and 5.000000 weight.
[INFO] [1695828497.694392927] [loki_sim.controller_server]: Critic loaded : mppi::critics::GoalCritic
[INFO] [1695828497.695323145] [loki_sim.controller_server]: PathAngleCritic instantiated with 1 power and 2.000000 weight. Reversing allowed.
[INFO] [1695828497.695347562] [loki_sim.controller_server]: Critic loaded : mppi::critics::PathAngleCritic
[INFO] [1695828497.696395866] [loki_sim.controller_server]: ReferenceTrajectoryCritic instantiated with 1 power and 14.000000 weight
[INFO] [1695828497.696421021] [loki_sim.controller_server]: Critic loaded : mppi::critics::PathAlignCritic
[INFO] [1695828497.697094395] [loki_sim.controller_server]: Critic loaded : mppi::critics::PathFollowCritic

Thread 1 "controller_serv" received signal SIGSEGV, Segmentation fault.
0x00007ffff7a59c97 in __dynamic_cast () from /lib/x86_64-linux-gnu/libstdc++.so.6
  • Launched on the AMD PC -> illegal instruction now right in constructor:
[WARN] [1695828118.200269456] [loki_sim.controller_server]: Disabling reversing. Both use_rotate_to_heading and allow_reversing parameter cannot be set to true. By default setting use_rotate_to_heading true

Thread 1 "controller_serv" received signal SIGILL, Illegal instruction.
0x00007fffb203a907 in class_loader::impl::MetaObject<nav2_mppi_controller::MPPIController, nav2_core::Controller>::create() const () from /opt/ros/humble/lib/libmppi_controller.so

calling vmovdqu64 %zmm0,0x58(%r12)

I also downloaded the release 1.1.9-1jammy.20230920.000044 (which I think should also have the fixes incorporated?):

  • Launched on the Intel NUC -> no change, all is still OK
  • Launched on the AMD PC -> this now works OK

So now the latest release libs of 1.1.9 work fine on both on my AMD and the Intel NUC. But any of the compiled libs are now throwing segfaults always at the same place. I'll try disabling the optimizer next and check if I haven't somehow messed up the build environment. But I think that's not likely since the exact same crash is happening from libs compiled on two different machines. Do you want me to send any of the new compiled libs again?

And after compiling without xtensor::optimize I still get the same segfaults.

@SteveMacenski
Copy link
Member

SteveMacenski commented Sep 27, 2023

Wait, are your tests showing that compiling even on the same CPU makes it crash now? Make sure I didn't mess up part of my update... Edit: I just tested building humble source and it works fine for me on the same CPU... that was never an issue ?

0x00007ffff7a59c97 in __dynamic_cast () from /lib/x86_64-linux-gnu/libstdc++.so.6

That error is notably different from the others! Can you check where that happens. The 2 dynamic casts I see are

# in obstacle 
auto inflation_layer = std::dynamic_pointer_cast<nav2_costmap_2d::InflationLayer>(*layer);

# in constraint crtiic
auto acker = dynamic_cast<AckermannMotionModel *>(data.motion_model.get());

If you disinclude the obstacle critic, are there any other traceback points that show up now? This seems like progress. If we update the layer cast to

auto inflation_layer = std::dynamic_cast<nav2_costmap_2d::InflationLayer *>(*layer); // may require some fanagling with the iterator type

Does that work if that's the last issue? Or I'm wondering if either of those are even where that's being called - those are both line that haven't changed.

Also, I'm curious if you tried 1.1.8 across CPUs, the last where things worked? If you do that, do you not run into any issues either way? Just to close the loop that this test actually shows the issue we think it does (or if there's actually something else going on here, like the CPU that the build farm was using changed). If may be updated CPUs or changes that caused this to some degree - but it would be good to validate if our issues are fully internal or also external.

So now the latest release libs of 1.1.9 work fine on both on my AMD and the Intel NU

Huh? 1.1.9 is what broke all the things we're discussing, no?

It looks like a new build of it come out though 1.1.9-1jammy.20230920.000044 over 1.1.9-1jammy.20230822.201753 https://build.ros2.org/job/Hbin_uJ64__nav2_mppi_controller__ubuntu_jammy_amd64__binary/ That contains no changes, its the exact same code rebuilt. Are you saying that this fixed the issue?

@SteveMacenski
Copy link
Member

SteveMacenski commented Sep 27, 2023

... I just did some testing and it appears that 1.1.9-1jammy.20230920.000044 does work whereas prior did not, with no changes to the code. I'm reaching out to Intrinsic for some help on the build farm. It appears that Iron is still broken then, so it was a fluke job that worked. Also 1.1.10 in shadow-fixed is broken again, so once that release comes out, we're screwed again, that one binary release is a fluke. I asked if anything changed between the date ranges of "broke" and "not broke" and received the reply from @nuclearsandwich

The build farm uses a mix of AMD and Intel machines based on what is available and affordable within AWS. AMD machines were in use by default for a long time because they were generally cheaper. claraberendsen has done a lot of work to make the build agent pools more robust which could have brought Intel CPUs back.

Ahhhhhh. This may be it. I don't know the path forward

On dates, the only job to make "working" binaries recently has been 1.1.9-1jammy.20230920.000044 whereas 1.1.9-1jammy.20230822.201753 and 1.1.9-1jammy.20230807.181110 were noted broken. I can't point you to what specific build of 1.1.8 was working (all?), but I didn't start receiving issue tickets from users until 1.1.9 was released, so I think its safe to put the range of dates at

  • June 23 working for exactly 2 releases, one of which rebuild several times
  • August 7th or 22nd (perhaps 22nd since that's when the ticket was filed?) noted brokenness in all binaries in all distributions, but to be fair, we backport heavily, so it could be code as much as build farm for all I knew at that point.
  • Sept 19th's re build of the same version from August 7, 22 build randomly works and is what users can actually see via apt (until the next humble / iron sync). This points more towards build farmy items potentially
  • Iron continues to be broken
  • So really, we just have this one fluke job 1.1.9-1jammy.20230920.000044 that worked

The 1.1.10 since are also broken (but I wouldn't totally rule out I broke them, but I need to go in with gdb and see if the locations look similar). The nice thing about that data is that those are all the same software versions, so there's definitely something build-farm-y that changed between those jobs or the agents building them. It is unfortunate I had to run a release before this was resolved to add changes (e.g. 1.1.10) into the mix for what experiments re run now

@SteveMacenski
Copy link
Member

Note we found the difference. AMD processors in the build farm build good binaries. Intel do not. I'll need to work with Intrinsic & Co about what to do next about this particular weirdo package

@SteveMacenski
Copy link
Member

SteveMacenski commented Sep 27, 2023

The attached debian is what I was able to test working with the great work of @claraberendsen. @Imaniac230 please verify this works for you across the board. I promise, its just an output from the build farm. I left my viruses with my sick-self at home 😉

ros-humble-nav2-mppi-controller_1.1.10-1jammy.20230927.201948_amd64.zip

Here's my speculation after seeing the CPUs being used: https://gist.github.com/claraberendsen/998caedc6175a3b3b6da94c2d3453979

Intel has some instructions for optimizing that AMD doesn't support, which would explain the AMD failures but not the Intel failures (which I see). I think Intel CPUs that are being used in the cloud are more advanced with newer versions of instructions that older machines don't. The tests have been run on 7th, 8th, and 11th gen core i-series processors.

The instructions that were illegal were usually from avx512 . While I'm no expert, Googlefo tells me that this is only available on Xeon and Skylake-X CPUs (https://en.wikipedia.org/wiki/AVX-512) which are not the standard robotics CPUs but the build farm uses Xeon. We want avx enabled for SIMD optimizations, and both "normal" intel + AMD machines provide that. We can disable that under the hood the use of specifically avx512 and use what is portable back to the more broad set of computers (https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Advanced_Vector_Extensions_2). But avx2 is available on Xeons, so if we could tell it to use that, then I think we're potentially solved.

The "why now" is half to do with the deployment of Intel machines in the build farm's pool and a bit of russian roulette if you got one assigned to you that would generate the AVX512 instructions per-package build. That's why that random one works, but the predecessors did not

https://stackoverflow.com/questions/60815316/disabling-all-avx512-extensions

#3843

Going to run a new release with this and see how it goes!

@Imaniac230
Copy link

Imaniac230 commented Sep 28, 2023

I can't point you to what specific build of 1.1.8 was working (all?), but I didn't start receiving issue tickets from users until 1.1.9 was released, so I think its safe to put the range of dates at

The only 1.1.8 version that I have is 1.1.8-2jammy.20230624.062741 and this specific version works across all machines that I have now tested (AMD Ryzen 7 3700X, Intel i7-1165G7, Intel i5-7600K, and also AMD Ryzen 5 PRO 4650U). I do not have any prior or newer releases of 1.1.8, so I can only confirm this one.

Here's my speculation after seeing the CPUs being used: https://gist.github.com/claraberendsen/998caedc6175a3b3b6da94c2d3453979

I think it shows support for the theory nicely. If I search for avx512 in the link you mention (https://gist.github.com/claraberendsen/998caedc6175a3b3b6da94c2d3453979) it does show up only for the Intel and not in the AMD info output. So if the Intel is building with optimizations it would probably use the instructions.

I've also tried porting all of the lib versions tested so far to the older Intel i5-7600K

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 158
model name	: Intel(R) Core(TM) i5-7600K CPU @ 3.80GHz
stepping	: 9
microcode	: 0xf4
cpu MHz		: 4782.804
cache size	: 6144 KB
physical id	: 0
siblings	: 4
core id		: 0
cpu cores	: 4
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 22
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d arch_capabilities

and an AMD Ryzen 5 PRO 4650U

processor	: 0
vendor_id	: AuthenticAMD
cpu family	: 23
model		: 96
model name	: AMD Ryzen 5 PRO 4650U with Radeon Graphics
stepping	: 1
microcode	: 0x8600106
cpu MHz		: 1393.328
cache size	: 512 KB
physical id	: 0
siblings	: 12
core id		: 0
cpu cores	: 6
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 16
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca

What is notable here is that neither of them show support for avx512 if you look at the flags sections in the outputs from cat /proc/cpuinfo. And when I try out the different lib versions, they do in fact both show the same exact failing behavior as I had on the AMD Ryzen 7 3700X:

  • The two broken versions 1.1.9-1jammy.20230822.201753 and 1.1.9-1jammy.20230807.181110 fail with illegal avx512 instructions
  • The re-built 1.1.9-1jammy.20230920.000044 works
  • The tag 1.1.9 compiled from source locally on AMD Ryzen 7 3700X with or witout xtensor::optimize or march-native all works
  • The tag 1.1.9 compiled from source locally on Intel i7-1165G7 with xtensor::optimize (or march=native) enabled fails with the same illegal avx512 instructions
  • The tag 1.1.9 compiled from source locally on Intel i7-1165G7 without xtensor::optimize works

The 11th gen Intel i7-1165G7 that we have on the NUCs does show avx512 support, so that would be why it always worked on the NUCs:

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 140
model name	: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
stepping	: 1
microcode	: 0xac
cpu MHz		: 2800.000
cache size	: 12288 KB
physical id	: 0
siblings	: 8
core id		: 0
cpu cores	: 4
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 27
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l2 invpcid_single cdp_l2 ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves split_lock_detect dtherm arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid movdiri movdir64b fsrm avx512_vp2intersect md_clear flush_l1d arch_capabilities
vmx flags	: vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple pml ept_mode_based_exec tsc_scaling

I left my viruses with my sick-self at home 😉

You trusted me with my mail attachments, so I think I can trust you as well 😀. I've tried your attached 1.1.10 deb on all of the devices: AMD Ryzen 7 3700X, Intel i7-1165G7, Intel i5-7600K, and also AMD Ryzen 5 PRO 4650U and they all crash for me with the exact same segfault as in #3767 (comment).

All of the new segfaults that I'm getting from local builds of humble at b3474ca or the attached .deb are happening in the same place, here's the backtrace I got:

#0  0x00007ffff7a5fc97 in __dynamic_cast () at /lib/x86_64-linux-gnu/libstdc++.so.6
#1  0x00007fffb1e52bd8 in mppi::critics::ObstaclesCritic::findCircumscribedCost(std::shared_ptr<nav2_costmap_2d::Costmap2DROS>) () at /opt/ros/humble/lib/libmppi_critics.so
#2  0x00007fffb1e570e4 in mppi::critics::ObstaclesCritic::initialize() () at /opt/ros/humble/lib/libmppi_critics.so
#3  0x00007fffb2089419 in mppi::CriticManager::loadCritics() () at /opt/ros/humble/lib/libmppi_controller.so
#4  0x00007fffb2087476 in mppi::CriticManager::on_configure(std::weak_ptr<rclcpp_lifecycle::LifecycleNode>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::shared_ptr<nav2_costmap_2d::Costmap2DROS>, mppi::ParametersHandler*) ()
    at /opt/ros/humble/lib/libmppi_controller.so
#5  0x00007fffb204d2a9 in mppi::Optimizer::initialize(std::weak_ptr<rclcpp_lifecycle::LifecycleNode>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::shared_ptr<nav2_costmap_2d::Costmap2DROS>, mppi::ParametersHandler*) () at /opt/ros/humble/lib/libmppi_controller.so
#6  0x00007fffb201f0bc in nav2_mppi_controller::MPPIController::configure(std::weak_ptr<rclcpp_lifecycle::LifecycleNode> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::shared_ptr<tf2_ros::Buffer>, std::shared_ptr<nav2_costmap_2d::Costmap2DROS>) ()
    at /opt/ros/humble/lib/libmppi_controller.so
#7  0x00007ffff7c8c3cc in nav2_controller::ControllerServer::on_configure(rclcpp_lifecycle::State const&) () at /opt/ros/humble/lib/libcontroller_server_core.so
#8  0x00007ffff7d66b8d in  () at /opt/ros/humble/lib/librclcpp_lifecycle.so
#9  0x00007ffff7d71a8a in  () at /opt/ros/humble/lib/librclcpp_lifecycle.so
#10 0x00007ffff7d5edc0 in rclcpp_lifecycle::LifecycleNode::LifecycleNodeInterfaceImpl::on_change_state(std::shared_ptr<rmw_request_id_s>, std::shared_ptr<lifecycle_msgs::srv::ChangeState_Request_<std::allocator<void> > >, std::shared_ptr<lifecycle_msgs::srv::ChangeState_Response_<std::allocator<void> > >) ()
    at /opt/ros/humble/lib/librclcpp_lifecycle.so
#11 0x00007ffff7d60986 in std::_Function_handler<void (std::shared_ptr<rmw_request_id_s>, std::shared_ptr<lifecycle_msgs::srv::ChangeState_Request_<std::allocator<void> > >, std::shared_ptr<lifecycle_msgs::srv::ChangeState_Response_<std::allocator<void> > >), std::_Bind<void (rclcpp_lifecycle::LifecycleNode::LifecycleNodeInterfaceImpl::*(rclcpp_lifecycle::LifecycleNode::LifecycleNodeInterfaceImpl*, std::_Placeholder<1>, std::_Placeholder<2>, std::_Placeholder<3>))(std::shared_ptr<rmw_request_id_s>, std::shared_ptr<lifecycle_msgs::srv::ChangeState_Request_<std::allocator<void> > >, std::shared_ptr<lifecycle_msgs::srv::ChangeState_Response_<std::allocator<void> > >)> >::_M_invoke(std::_Any_data const&, std::shared_ptr<rmw_request_id_s>&&, std::shared_ptr<lifecycle_msgs::srv::ChangeState_Request_<std::allocator<void> > >&&, std::shared_ptr<lifecycle_msgs::srv::ChangeState_Response_<std::allocator<void> > >&&) ()
    at /opt/ros/humble/lib/librclcpp_lifecycle.so
#12 0x00007ffff7d6f736 in  () at /opt/ros/humble/lib/librclcpp_lifecycle.so
#13 0x00007ffff7e6d296 in  () at /opt/ros/humble/lib/librclcpp.so
#14 0x00007ffff7e6ac7a in rclcpp::Executor::execute_service(std::shared_ptr<rclcpp::ServiceBase>) () at /opt/ros/humble/lib/librclcpp.so
#15 0x00007ffff7e6afe6 in rclcpp::Executor::execute_any_executable(rclcpp::AnyExecutable&) () at /opt/ros/humble/lib/librclcpp.so
#16 0x00007ffff7e728a0 in rclcpp::executors::SingleThreadedExecutor::spin() () at /opt/ros/humble/lib/librclcpp.so
#17 0x00007ffff7e72ab5 in rclcpp::spin(std::shared_ptr<rclcpp::node_interfaces::NodeBaseInterface>) () at /opt/ros/humble/lib/librclcpp.so
#18 0x00005555555563ff in  ()
#19 0x00007ffff7794d90 in __libc_start_call_main (main=main@entry=0x5555555562d0, argc=argc@entry=56, argv=argv@entry=0x7fffffffcec8) at ../sysdeps/nptl/libc_start_call_main.h:58
#20 0x00007ffff7794e40 in __libc_start_main_impl (main=0x5555555562d0, argc=56, argv=0x7fffffffcec8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffceb8) at ../csu/libc-start.c:392
#21 0x0000555555556625 in  ()

@SteveMacenski
Copy link
Member

SteveMacenski commented Sep 28, 2023

Huh, what about the binaries in http://repo.ros2.org/ubuntu/building/pool/main/r/ros-humble-nav2-mppi-controller/ros-humble-nav2-mppi-controller_1.1.11-1jammy.20230928.130430_amd64.deb?

That's the same, but build on Intel explicitly with avx512 off. The previous binary was AMD where that isn't even an option. This also works for me.

Can you be specific when you say that

All of the new segfaults that I'm getting from local builds

You should never be crashing from source builds on the same CPU as it was compiled on. If that's what's happening, then that's totally unrelated to any of this and lets discuss that separately. If its from moving libs around, that dynamic cast error is new from the errors we had before and oddly its called before the previous crash location so we should have seen it before?

Can you tell what instruction that is - also a AVX thing? If you change that to a dynamic_cast and not a pointer cast like I suggested, does that change anything? Also perhaps removing the mtune flag I set, though I think that should increase portability in 1.1.10. I note that both AMD and Intel without avx512 in the build farm both work for me, so I think portability is largely resolved there - but always worth a try since I did add that in 1.1.10 and isn't critical if we can remove it. I added that for added portability with march (or so I think?).

PS: if you turn off the obstacle critic, does the rest work?

The binaries that you mention doesn't work on anything is the only binary that works for me on my Intel i7 (minus that other random build that works for us all):

The re-built 1.1.9-1jammy.20230920.000044 works

The only variable that I've been made aware of in the build farm was the Intel vs AMD pools, so I tried both; AMD worked Intel did not due to the AVX512 issue. If you're saying that that binary worked for you as I, then I think we've shown that must have been an AMD machine. The first binary I linked above in the zip was on AMD. I'm shocked that it doesn't work for you -- are you sure you installed it and verified that was the version in use with apt show? Are you sure you don't have a dirty state where you have sourced workspaces with lib/*.sos that are built on other machines taking priority over the binaries? Make sure to test anything w.r.t. binaries in clean terminals that have never touched a workspace & doesn't source a workspace in bash.rc file.

I'm hoping that's your issue... Otherwise that's just issue one of potentially others. May be worth at that time on your local builds using the no-avx512 flag so that we have that part of the problem squared away and start looking at this last remaining one. I can say though that I'm not sure I've ever seen that error you mention

@Imaniac230
Copy link

Regarding the deb install, it should be installed correctly. I've just ran sudo apt install ./ros-humble-nav2-mppi-controller_1.1.10-1jammy.20230927.201948_amd64.deb --reinstall without any other changes. And it does show installed for the correct version. With apt list --installed ros-humble-nav2-mppi-controller -a:

ros-humble-nav2-mppi-controller/now 1.1.10-1jammy.20230927.201948 amd64 [installed,local]
ros-humble-nav2-mppi-controller/jammy 1.1.9-1jammy.20230920.000044 amd64

And with apt show ros-humble-nav2-mppi-controller -a:

Package: ros-humble-nav2-mppi-controller
Version: 1.1.10-1jammy.20230927.201948
Status: install ok installed
Priority: optional
Section: misc
Maintainer: Aleksei Budyakov <budyakov.aleksei@gmail.com>
Installed-Size: 1238 kB
Depends: libc6 (>= 2.35), libconsole-bridge1.0 (>= 1.0.1+dfsg2), libgcc-s1 (>= 3.3.1), liborocos-kdl1.5 (>= 1.5.1), libstdc++6 (>= 12), libtinyxml2-9 (>= 8.0.0), libbenchmark-dev, libomp-dev, libxsimd-dev, libxtensor-dev, ros-humble-geometry-msgs, ros-humble-nav2-common, ros-humble-nav2-core, ros-humble-nav2-costmap-2d, ros-humble-nav2-msgs, ros-humble-nav2-util, ros-humble-pluginlib, ros-humble-rclcpp, ros-humble-std-msgs, ros-humble-tf2, ros-humble-tf2-eigen, ros-humble-tf2-geometry-msgs, ros-humble-tf2-ros, ros-humble-visualization-msgs, ros-humble-ros-workspace
Download-Size: unknown
APT-Manual-Installed: yes
APT-Sources: /var/lib/dpkg/status
Description: nav2_mppi_controller

Package: ros-humble-nav2-mppi-controller
Version: 1.1.9-1jammy.20230920.000044
Priority: optional
Section: misc
Maintainer: Aleksei Budyakov <budyakov.aleksei@gmail.com>
Installed-Size: 1245 kB
Depends: libc6 (>= 2.35), libconsole-bridge1.0 (>= 1.0.1+dfsg2), libgcc-s1 (>= 3.3.1), liborocos-kdl1.5 (>= 1.5.1), libstdc++6 (>= 12), libtinyxml2-9 (>= 8.0.0), libbenchmark-dev, libomp-dev, libxsimd-dev, libxtensor-dev, ros-humble-geometry-msgs, ros-humble-nav2-common, ros-humble-nav2-core, ros-humble-nav2-costmap-2d, ros-humble-nav2-msgs, ros-humble-nav2-util, ros-humble-pluginlib, ros-humble-rclcpp, ros-humble-std-msgs, ros-humble-tf2, ros-humble-tf2-eigen, ros-humble-tf2-geometry-msgs, ros-humble-tf2-ros, ros-humble-visualization-msgs, ros-humble-ros-workspace
Download-Size: 387 kB
APT-Sources: http://packages.ros.org/ros2/ubuntu jammy/main amd64 Packages
Description: nav2_mppi_controller

The dates on the raw binaries also match -> ll -h /opt/ros/humble/lib/libmppi_c*

-rw-r--r-- 1 root root 780K Sep 27 20:19 /opt/ros/humble/lib/libmppi_controller.so
-rw-r--r-- 1 root root 260K Sep 27 20:19 /opt/ros/humble/lib/libmppi_critics.so

It may certainly be that I have messed up something on my side (and managed to do so on four different machines). But, I'm following the exact same testing steps as I have been from the start, using the same environments. ros2 pkg prefix nav2_mppi_controller does point only to /opt/ros/humble. And I'm not getting any crashes when I re-install the 1.1.9-1jammy.20230920.000044 and run that using the same environment.

Can you tell what instruction that is - also a AVX thing?

The crashing instruction is just a normal memory manipulation mov (which makes sense, since the crash is related to bad memory manipulation), nothing related to avx.

PS: if you turn off the obstacle critic, does the rest work?

If I remove the Obstacle critic, I don't the segfault, but I do get a different crash:

[INFO] [1696005038.897474880] [loki_sim.controller_server]: Created controller :                                                                                                                                                                                                                                         FollowPathMPPI of type nav2_mppi_controller::MPPIController                                                                                                                                                                                                                                                            
[INFO] [1696005038.902849830] [loki_sim.controller_server]: Controller period is                                                                                                                                                                                                                                         equal to model dt. Control sequence shifting is ON                                                                                                                                                                                                                                                                     
[INFO] [1696005038.917012197] [loki_sim.controller_server]: ConstraintCritic ins                                                                                                                                                                                                                                        tantiated with 1 power and 4.000000 weight.                                                                                                                                                                                                                                                                             
[INFO] [1696005038.917118217] [loki_sim.controller_server]: Critic loaded : mppi                                                                                                                                                                                                                                        ::critics::ConstraintCritic                                                                                                                                                                                                                                             
[INFO] [1696005038.917946190] [loki_sim.controller_server]: GoalAngleCritic instantiated with 1 power, 3.000000 weight, and 0.400000 angular threshold.
[INFO] [1696005038.918051651] [loki_sim.controller_server]: Critic loaded : mppi::critics::GoalAngleCritic
[INFO] [1696005038.918755726] [loki_sim.controller_server]: GoalCritic instantiated with 1 power and 5.000000 weight.
[INFO] [1696005038.918846660] [loki_sim.controller_server]: Critic loaded : mppi::critics::GoalCritic
[INFO] [1696005038.920787492] [loki_sim.controller_server]: PathAngleCritic instantiated with 1 power and 2.000000 weight. Reversing allowed.
[INFO] [1696005038.920893581] [loki_sim.controller_server]: Critic loaded : mppi::critics::PathAngleCritic
[INFO] [1696005038.922722597] [loki_sim.controller_server]: ReferenceTrajectoryCritic instantiated with 1 power and 14.000000 weight
[INFO] [1696005038.922927233] [loki_sim.controller_server]: Critic loaded : mppi::critics::PathAlignCritic
[INFO] [1696005038.923987011] [loki_sim.controller_server]: Critic loaded : mppi::critics::PathFollowCritic
[INFO] [1696005038.924777270] [loki_sim.controller_server]: PreferForwardCritic instantiated with 1 power and 5.000000 weight.
[INFO] [1696005038.924867296] [loki_sim.controller_server]: Critic loaded : mppi::critics::PreferForwardCritic
[INFO] [1696005038.925592463] [loki_sim.controller_server]: TwirlingCritic instantiated with 1 power and 10.000000 weight.
[INFO] [1696005038.925682629] [loki_sim.controller_server]: Critic loaded : mppi::critics::TwirlingCritic
[New Thread 0x7fffb1eab640 (LWP 43249)]
[INFO] [1696005038.930997934] [loki_sim.controller_server]: Optimizer reset
[Thread 0x7fffb1eab640 (LWP 43249) exited]
[ERROR] [1696005038.942448416] [loki_sim.controller_server]: Caught exception in callback for transition 10
[ERROR] [1696005038.942551782] [loki_sim.controller_server]: Original error: basic_string::_M_create
[WARN] [1696005038.942729878] [loki_sim.controller_server]: Error occurred while doing error handling.
[FATAL] [1696005038.942828285] [loki_sim.controller_server]: Lifecycle node controller_server does not have error state implemented

And it's the same thing when I install form the 1.1.11 deb file. Are the deb packages iron or humble?

Regarding the local humble builds, I am transferring the locally compiled lib files to /op/ros/humble/lib/ and running with /opt/ros/humble as the source path. And It's the same behavior as with the deb installs.

However, if I run with the local build workspace being sourced instead and with the obstacles critic enabled, I do not get the segfault crashes, but I do get the same error as above.

I'll try building again from the 1.1.9 tag prior to the fixes, where I didn't have these new issues (omitting the obvious avx512 thing that started this whole conversation) and see if there's anything different on my side there. I'll also try messing with the mtune flag.

May be worth at that time on your local builds using the no-avx512 flag so that we have that part of the problem squared away and start looking at this last remaining one.

I'll do that when I get to the NUC.

I can't currently explain why these new errors started happening now, they weren't a problem previously. But a small pattern I see is that they started only happening from versions related to 1.1.10 and 1.1.11.

@SteveMacenski
Copy link
Member

SteveMacenski commented Sep 29, 2023

sudo apt install ./ros-humble-nav2-mppi-controller_1.1.10-1jammy.20230927.201948_amd64.deb --reinstall

Standard is more dpkg -i for local binaries but my guess is that's what this is doing anyway. But in the off case its different, worth a try. I don't know if apt install --reinstall tries to do something fancy. I doubt it, but good to point out just in case.

And it's the same thing when I install form the 1.1.11 deb file. Are the deb packages iron or humble?

Humble

However, if I run with the local build workspace being sourced instead and with the obstacles critic enabled, I do not get the segfault crashes, but I do get the same error as above.

I think that points to a problem in your workflow if the built workspace transferring libs and taking debians of the same thing don't result in the same outcomes.

I'll try building again from the 1.1.9 tag prior to the fixes, where I didn't have these new issues (omitting the obvious avx512 thing that started this whole conversation) and see if there's anything different on my side there. I'll also try messing with the mtune flag.

OK! The different from .9 to .10 are these three commits b0a68bb 91b688d 1b13476

.10 to .11 adds 2d6e9a9 only

Hopefully those should be easy to test anything meaningful that's changed. You may also benefit from enacting a clean restart of your previous changes to make sure you haven't landed up in an odd state -- unless it is really just -mtune=generic. But I'm surprised it works fine for me if it doesn't work on your Intel machines given that mine is substantially older.

Edit: I did some research on mtune and generic is what is set when march=native. As long as march isn't set to an exact processor name, mtune=generic shouldn't actually change anything. Still worth a shot, but I'm thinking increasingly it might be stale / odd state on your end. Can someone else (@MichalMalachowski @AmmarAlbakri ) also test this on their AMD or newer Intel computers to get a 3rd opinion? I don't want to poo-poo you if this is actually just a "my computer is the odd man out" situation and everyone else is still having issues.

@Imaniac230
Copy link

Imaniac230 commented Oct 4, 2023

Standard is more dpkg -i for local binaries but my guess is that's what this is doing anyway.

Yes, at least from my understanding, the apt install should just invoke dpkg -i behind the scenes either way (if you give it a local path to deb it will use that instead of searching on remote) and the --reinstall option is there to make sure it didn't skip the installation if it found that the package is already installed. I tried with dpkg -i as well and there's no difference.

I think that points to a problem in your workflow if the built workspace transferring libs and taking debians of the same thing don't result in the same outcomes.

If I'm manually copying the locally compiled libs to /opt/ros/humble/lib/ it is producing exactly the same behavior as installing from the debians. The different behavior is only when I source the nav2 workspace instead of having /op/ros/humble/ sourced.

I checked-out at the 1.1.9 tag and first tried re-building only the nav2_mppi_controller package (using --packages-select) with all else still left at b3474ca (this means it should use the already compiled dependencies during the compilation of MPPI). That did not make a difference - I was still getting the same errors.

Then I completely cleaned the build and install space and re-built the whole nav2 from 1.1.9. And here it did work without any problems. Either copying libs, or sourcing the nav2 workspace, all of it worked without errors. At this point I tried setting and unsetting the mtune=generic flag and that does not make a difference, so that is not a problem.

While at this working state, I checked the mno-avx512f flag by setting it together with march=native explicitly, and building on the Intel i7-1165G7 CPU:

add_compile_options(-O3 -finline-limit=10000000 -ffp-contract=fast -ffast-math -march=native -mno-avx512f)
target_link_libraries(${lib} xtensor xtensor::optimize xtensor::use_xsimd)

When ported to an AMD PC, the libs worked without problems. If the mno-avx512f flag was removed it crashed at the illegal instruction. If it was added, it worked OK. So, from my end, I can also confirm that this problem should be resolved.

Regarding the new errors I've been getting, it seems they don't relate to changes in nav2_mppi_controller but changes in the other dependency packages instead, when the libs are moved around individually. When I re-build only nav2_mppi_controller from the newest 6e35cc7 commit (--packages-select) and left all else still at 1.1.9, it did not break anything and all still worked fine. So rebuilding the MPPI package individually either way does not produce any changes unless it is rebuilt together with the other dependencies.

Then, I went commit by commit. When all is built from 0cf0462 (1.1.9), there are no problems when copying individual libs. When all is re-built from commit 0ca14fe, and then I copy just the MPPI libs to /opt/ros/humble/lib/ and run, I get the dynamic_cast segfault with the MPPI obstacle critic, and the basic_string::_M_create error without the obstacle critic. So at this point, I think the MPPI libs are probably expecting to use something that was added in 0ca14fe to a dependency package, which is not in my install since I copied only the MPPI files and nothing else.

Going from the MPPI dependencies listed in CMakeLists.txt, when I start porting those libs one by one (all compiled from 0ca14fe) some of them now expect to find the added pre-shutdown callback methods.

  • When nav2_planner or nav2_lifecycle_manager are ported to /op/ros/humble/, it doesn't make any difference.
  • When nav2_costmap_2d lib is ported to /opt/ros/humble/:
/opt/ros/humble/lib/nav2_controller/controller_server: symbol lookup error: /opt/ros/humble/lib/libnav2_costmap_2d_core.so: undefined symbol: _ZN9nav2_util13LifecycleNode18on_rcl_preshutdownEv
  • When nav2_controller lib is ported to /op/ros/humble/:
/opt/ros/humble/lib/nav2_controller/controller_server: symbol lookup error: /opt/ros/humble/lib/libcontroller_server_core.so: undefined symbol: _ZN9nav2_util13LifecycleNode18on_rcl_preshutdownEv
  • When I then also port nav2_util, which defines the added pre-shutdown methods errors start popping up from planners:
[INFO] [1696413455.157573423] [loki_sim.planner_server]: Created global planner plugin GridBasedNavFn of type nav2_navfn_planner/NavfnPlanner
[INFO] [1696413455.157816263] [NavfnPlanner]: Destroying plugin GridBasedNavFn of type NavfnPlanner
[ERROR] [1696413455.157859635] [loki_sim.planner_server]: Caught exception in callback for transition 10
[ERROR] [1696413455.157900562] [loki_sim.planner_server]: Original error: std::bad_alloc
[WARN] [1696413455.157958950] [loki_sim.planner_server]: Error occurred while doing error handling.
[FATAL] [1696413455.157998690] [loki_sim.planner_server]: Lifecycle node planner_server does not have error state implemented

All of this is resulting from just copying individual lib files around. The error I was getting when running from an environment that has the nav2 workspace sourced were in fact because I have the config set so that all of the available controller plugins get loaded at startup (controller_plugins: ["FollowPathDWB", "FollowPathTEB", "FollowPathRPP", "FollowPathMPPI"]). And the error was coming from TEB not MPPI as I initially thought:

[INFO] [1696414607.072932515] [loki_sim.controller_server]: Created controller : FollowPathTEB of type teb_local_planner::TebLocalPlannerROS
[INFO] [1696414607.093805826] [TEBLocalPlanner]: Footprint model 'circular' (radius: 0.220000m) loaded for trajectory optimization.
[INFO] [1696414607.093885236] [loki_sim.controller_server]: Parallel planning in distinctive topologies enabled.
[ERROR] [1696414607.282713391] [loki_sim.controller_server]: Caught exception in callback for transition 10
[ERROR] [1696414607.282806769] [loki_sim.controller_server]: Original error: std::bad_alloc
[WARN] [1696414607.282867601] [loki_sim.controller_server]: Error occurred while doing error handling.
[FATAL] [1696414607.282911811] [loki_sim.controller_server]: Lifecycle node controller_server does not have error state implemented

When I remove TEB from the config and run from the environment that has the nav2 workspace sourced, I have no errors and all works. So here, all of the re-compiled packages from 0ca14fe are within the sourced environment path and it all works. If I make similar experiments with this environment - first building all of nav2 from 1.1.9 and then re-building only individual packages from 0ca14fe instead of rebuilding all (--packages-select nav2_mppi_controller nav2_util) - I get the same behavior. I also went through the individual changes, and the errors start once the rcl_preshutdown_cb_handle_ and the accompanying methods are added in nav2_util -> lifecycle_node.hpp and nav2_util -> lifecycle_node.cpp.

A pattern I would deduce from this would be that the added pre-shutdown in 0ca14fe made a lot of the older pre- 0ca14fe binaries incompatible with the new ones? So after this change, compatibility with TEB is also broken? That would explain why I started getting the errors when I install the newer MPPI libs from the debians or port the newer libs compiled locally. I'm not sure if this behavior is even an issue since it is expected that when a dependency to a package changes and the package is ported individually without it's changed dependency it is going to cause problems. However, shouldn't minor revisions still be cross-portable independently? So this would mean that if I update my nav2_mppi_controller from 1.1.9 to say 1.1.10, but leave all of the other packages still at 1.1.9 it would break (which is what happens when I do an install from the debians).

This was always my nav2 building environment:

ROS_VERSION=2
ROS_PYTHON_VERSION=3
AMENT_PREFIX_PATH=/opt/ros/humble
PYTHONPATH=/opt/ros/humble/lib/python3.10/site-packages:/opt/ros/humble/local/lib/python3.10/dist-packages
LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu/gazebo-11/plugins:/opt/ros/humble/opt/rviz_ogre_vendor/lib:/opt/ros/humble/lib/x86_64-linux-gnu:/opt/ros/humble/lib
ROS_LOCALHOST_ONLY=0
PATH=/home/ubuntu//.local/bin:/opt/ros/humble/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin
ROS_DISTRO=humble

So it should be just the native ros install with no additional workspaces.

@SteveMacenski
Copy link
Member

SteveMacenski commented Oct 4, 2023

There's alot of good stuff here, so I want to parrot back to you the highlights to make sure we're on the same page

  • You completely clean rebuilt from 1.1.9 and things worked as expected -- so something in your workflow in moving the libs around it compiling was causing some of the issues
  • Adding the mno-avx512f at this point solved the problems with portability
  • The process of moving libs around with older stuff is causing the issue that you're currently facing. While I looked at 0ca14fe and see nothing obvious to why that could crash, that could be an ABI compability thing that us mere mortals looking at code cannot figure out without going much deeper into analysis. But I don't think that's called for in this case since its a test workflow issue, not an issue intrinsic to the problem we're solving
  • And the error was coming from TEB LOL, classic TEB

I'm not sure if this behavior is even an issue since it is expected that when a dependency to a package changes and the package is ported individually without it's changed dependency it is going to cause problem

Agreed.

However, shouldn't minor revisions still be cross-portable independently?

No, only within a distribution compatibility is assured. Rolling / main is bleeding edge so anything can change anytime. You shouldn't generally expect to be able to take a package from version X and apply it to the stack at version Y off mainline. In a released distro (e.g. humble, iron, etc) you should be able to do that.


So in summary here: it sounds like its all fixed and if I bloom up a new release, we should be good to go on this issue in entirety. I think you mention some other TEB problem, but its unclear to me if that's an issue from the version X package and version Y framework or something else that should be addressed

@SteveMacenski
Copy link
Member

Releases cut! The issue should be resolved next humble / iron sync!

still follow up on TEB if there’s anything actionable!

@Imaniac230
Copy link

You completely clean rebuilt from 1.1.9 and things worked as expected

Yes.

Adding the mno-avx512f at this point solved the problems with portability

Yes.

The process of moving libs around with older stuff is causing the issue that you're currently facing.

Yes, If I start mixing old 1.1.9 packages with the newer versions then there are issues. And when I completely re-build 1.1.11 and don't copy any individual lib files but instead source the whole nav2 workspace and use that, it also worked. But only if TEB was excluded at that point.

that could be an ABI compability thing

Yes, it does seem like something of that nature to me. It might be due to the way some of the libraries are linked to each other during compilation and a change in one core part manifests across multiple other ones. But I'm just speculating here. It basically required most of the nav2 packages to be on the same updated versions and not just some to make it work.

So in summary here: it sounds like its all fixed and if I bloom up a new release, we should be good to go on this issue in entirety.

Yes, I believe so. When all (or almost all) packages are at the same release version, and there are no mixtures of old 1.1.9 and newer like 1.1.11, I wouldn't (hopefully) expect any issues.

still follow up on TEB if there’s anything actionable!

The error from TEB looks to be of the same nature as mixing of old and new lib versions. I re-compiled the whole nav2 workspace from 1.1.11. And I quickly tried re-compiling all of TEB with the 1.1.11 nav2 workspace being sourced (so it would use the new libraries during compilation and not the old ones) and it seems to have solved the problem. So from what I see, it might be necessary for people to re-compile TEB after the newer nav2 release is downloaded and that should solve the issues if they appear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants