Fix exclusive hardware control mode switching on controller failed activation #1522

saikishor · 2024-04-29T22:12:51Z

If the hardware supports only one active control mode at a time and when there is a failing controller activation, ros2_control is not propagating this information to the hardware, and the hardware might have the resource locked internally. This PR allows to solve such situations as explained in #1487 and #1486.

Fixes #1487
Closes #1492

…faces

saikishor · 2024-04-29T22:15:05Z

@firesurfer This PR has the fix that should solve your issue. If you can test it on your hardware and let us know, it would be great.
I could reproduce the same issue with a test and I added a fix, so that the test passes, so I'm confident that this might work for you.

codecov · 2024-04-29T22:25:51Z

Codecov Report

Attention: Patch coverage is 84.92063% with 19 lines in your changes are missing coverage. Please review.

Project coverage is 87.65%. Comparing base (e260049) to head (7755df0).

Additional details and impacted files

@@           Coverage Diff            @@
##           master    #1522    +/-   ##
========================================
  Coverage   87.65%   87.65%            
========================================
  Files         102      105     +3     
  Lines        8619     8741   +122     
  Branches      774      782     +8     
========================================
+ Hits         7555     7662   +107     
- Misses        788      802    +14     
- Partials      276      277     +1

Flag	Coverage Δ
unittests	`87.65% <84.92%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
...ailed_activate/test_controller_failed_activate.hpp	`100.00% <100.00%> (ø)`
controller_manager/src/controller_manager.cpp	`74.92% <81.81%> (+0.36%)`	⬆️
...ontroller_manager/test/test_release_interfaces.cpp	`89.28% <92.85%> (+1.78%)`	⬆️
...ailed_activate/test_controller_failed_activate.cpp	`66.66% <66.66%> (ø)`
..._components/test_actuator_exclusive_interfaces.cpp	`83.82% <83.82%> (ø)`

firesurfer · 2024-04-30T06:22:51Z

@saikishor Thanks a lot. Just for my understanding: If a controller fails activation now, a command mode switch is issued so that the hardware interface can release internally locked hardware?

I try to test this patch with an iron based setup.

firesurfer · 2024-04-30T06:41:39Z

@saikishor I just tested your PR and for me it didn't work so far.

It could be that I did something wrong. What I did was:

Clone your repository into by workspace
Rebase your branch onto iron with: git rebase fix/exclusive_hw_interface_switching --onto=iron
Build it with: --allow-overriding controller_interface controller_manager hardware_interface ros2_control_test_assets hardware_interface_testing
Change my controller to fail activation

  on_activate()....{
      release_interfaces();
      return controller_interface::CallbackReturn::FAILURE;

Source it and run it

When I activate the controller it fails as it should. But when I then try to load another controller I run into the same issue as before - the interfaces haven't been stopped.

saikishor · 2024-04-30T06:43:20Z

@saikishor Thanks a lot. Just for my understanding: If a controller fails activation now, a command mode switch is issued so that the hardware interface can release internally locked hardware?

I try to test this patch with an iron based setup.

@firesurfer yes you are right! The hardware will be told to stop the interfaces that are started for the hardware that the controller needed. This should be good for the hardware component to internally release the locked control modes

saikishor · 2024-04-30T06:45:58Z

@firesurfer can you share the logs here?. Can you print something in your hardware interfaces to see if you receive the stop interfaces?. Basically, if your controller fails activation, it should immediately receive to release interfaces even before activating a new one. If you go over tests, I'm testing exactly the same

saikishor · 2024-04-30T06:58:19Z

@firesurfer for initial testing, maybe you can remove the installed dependencies and this might help. For me, after removing the installed dependencies, it worked

firesurfer · 2024-04-30T07:33:30Z

So I removed the installed packages:
sudo dpkg --force-all --remove ros-iron-controller-interface ros-iron-controller-manager ros-iron-hardware-interface just to make sure the ones from the workspace are used.

Log while failing activation:

[ros2_control_node-14] [ERROR] [1714461945.456711669] [controller_manager]: After activation, controller 'my_controller' is in state 'inactive' (2), expected 'active' (3).

Log during activation of a ForwardController for the same joints after the other one has failed:

[ros2_control_node-14] [ERROR] [1714461952.579709743] [my_hardware]: Joint already in used: my_joint [ros2_control_node-14] [ERROR] [1714461952.579871675] [resource_manager]: Component 'my_hardware' did not accept command interfaces combination:

I added a log message in the perform_command_mode_switch and did not get any log message.

I am not sure if perhaps rebasing on Iron broke something.

saikishor · 2024-04-30T07:37:37Z

So I removed the installed packages: sudo dpkg --force-all --remove ros-iron-controller-interface ros-iron-controller-manager ros-iron-hardware-interface just to make sure the ones from the workspace are used.

Log while failing activation:

[ros2_control_node-14] [ERROR] [1714461945.456711669] [controller_manager]: After activation, controller 'my_controller' is in state 'inactive' (2), expected 'active' (3).

Log during activation of a ForwardController for the same joints after the other one has failed:

[ros2_control_node-14] [ERROR] [1714461952.579709743] [my_hardware]: Joint already in used: my_joint [ros2_control_node-14] [ERROR] [1714461952.579871675] [resource_manager]: Component 'my_hardware' did not accept command interfaces combination:

I added a log message in the perform_command_mode_switch and did not get any log message.

I am not sure if perhaps rebasing on Iron broke something.

Hello @firesurfer!

It seems like you are not using the changes, because If you use the changes of this branch, your error should be like the following

[ros2_control_node-14] [ERROR] [1714461945.456711669] [controller_manager]: After activation, controller 'my_controller' is in state 'inactive' (2), expected 'active' (3). Releasing Interfaces!

ros2_control/controller_manager/src/controller_manager.cpp

Lines 1509 to 1515 in 4a4d780

    
           RCLCPP_ERROR( 
        
             get_logger(), 
        
             "After activation, controller '%s' is in state '%s' (%d), expected '%s' (%d). Releasing " 
        
             "interfaces!", 
        
             controller->get_node()->get_name(), new_state.label().c_str(), new_state.id(), 
        
             hardware_interface::lifecycle_state_names::ACTIVE, 
        
             lifecycle_msgs::msg::State::PRIMARY_STATE_ACTIVE);

In your error log, it doesn't mention about the Releasing interfaces, so please crosscheck your setup.

Thank you!

firesurfer · 2024-04-30T08:04:46Z

@saikishor I made a mistake while rebasing onto iron. When I do it properly I run into a lots of conflicts. When I try to resolve them it doesn't compile anymore.

Do you have an recommendations how to easily test this with an iron setup ?

christophfroehlich · 2024-04-30T08:23:44Z

You should be able to compile the rolling stack (also this PR) on your iron distro directly. override should also work, no need to uninstall your binary install. (check with ros2 pkg xml ros2_control -t version)
edit: can be that you have to patch your hardware component to the new API

saikishor · 2024-04-30T08:27:25Z

You should be able to compile the rolling stack (also this PR) on your iron distro directly. override should also work, no need to uninstall your binary install. (check with ros2 pkg xml ros2_control -t version) edit: can be that you have to patch your hardware component to the new API

Thank you @christophfroehlich. Yes, you can do this. In case you continue to have issues, then try to just cherry-pick the last 2 commits of this PR (77e932e and 4a4d780) onto your iron branch

firesurfer · 2024-04-30T09:27:17Z

@saikishor Thanks. Cherry-picking worked :)

When I now have the controller failing I run into an endless loop of ros2control trying to activating the controller - failing - stopping the interfaces - starting them again - failing again...

[ros2_control_node-14] [ERROR] [1714469097.477339381] [controller_manager]: After activation, controller 'my_controller' is in state 'inactive' (2), expected 'active' (3). Releasing interfaces!

But I can confirm. The interfaces are stopped!.

saikishor · 2024-04-30T09:29:08Z

@saikishor Thanks. Cherry-picking worked :)

When I now have the controller failing I run into an endless loop of ros2control trying to activating the controller - failing - stopping the interfaces - starting them again - failing again...

[ros2_control_node-14] [ERROR] [1714469097.477339381] [controller_manager]: After activation, controller 'my_controller' is in state 'inactive' (2), expected 'active' (3). Releasing interfaces!

But I can confirm. The interfaces are stopped!.

@firesurfer I'm glad that the fix worked for you. If that's the case, do you mind reviewing this PR and approving it?

firesurfer · 2024-04-30T09:32:30Z

@saikishor shouldn't be the issue of running into an endless loop of activation -> failing... be addressed first?

saikishor · 2024-04-30T09:34:21Z

@saikishor shouldn't be the issue of running into an endless loop of activation -> failing... be addressed first?

@firesurfer I don't really see how there is an endless loop. CM never reactivates or tries if it is failing. Maybe the print you have added is printing many times, because you have this component per joint and it is trying to iterate over the HW components to do the prepare and perform switch. This is the behavior inside the Resource Manager and I'm not touching that part.

EDIT: How many times does it print the ....Releasing Interfaces! error?

saikishor · 2024-04-30T19:39:47Z

@firesurfer I can confirm after testing it again on our setup that the fix works and there is no looping activation from the CM side.

Thank you

firesurfer · 2024-05-02T07:28:16Z

@saikishor

EDIT: How many times does it print the ....Releasing Interfaces! error?

As I am seeing the activation failure message again and again (of the controller) I would say this is not just because of the printout in the hardware_interface. I had it running for like 10-20s before killing the process.

I am wondering if that could be because I cherry-picked it on the iron branch?

saikishor · 2024-05-02T07:33:42Z

As I am seeing the activation failure message again and again (of the controller) I would say this is not just because of the printout in the hardware_interface. I had it running for like 10-20s before killing the process.

I am wondering if that could be because I cherry-picked it on the iron branch?

That would be strange to have this in the iron branch. Can you share some logs?

EDIT: Can you also share how you are trying to activate this controller?

firesurfer · 2024-05-02T07:53:00Z

@saikishor I just tested in a clean dockerized environment instead of my host installation and it seems to work fine.

I will review and approve the PR then.

controller_manager/src/controller_manager.cpp

firesurfer

Looks fine to me. Maybe you want to address the comment about the additional scope in the controller_manager.

saikishor · 2024-05-02T08:15:49Z

@saikishor I just tested in a clean dockerized environment instead of my host installation and it seems to work fine.

I will review and approve the PR then.

Awesome. I'm very glad that this fixed your issue! 👍

mergify · 2024-05-03T20:16:05Z

This pull request is in conflict. Could you fix it @saikishor?

destogl

Generally I am OK with this, but this calls for even more complex handling of things. And what I rather see as a variable and would put priority to is handling of error that happen when HW mode switching happens as those might have external uncontrollable factors. And controller should not fail on activation, or at least I don't thing that we should handle the both cases as it will be a hell to deal with all the variates of issues and all of that in the real-time sensitive part!

@firesurfer when thinking now more about your issue, I am interested, why your controller fails during activation? How it can be that is configures successfully, but fails on activation? This doesn't sound right.

@bmagyar and @saikishor I even vote that we fobid failing of controllers on activation because of keeping up with the real-time capabilities. This is the reason why we have configure in the first place. Allowing this, I forsee, as elaborated earlier, opening pandora box of funky management of stuff in the update loop.

destogl · 2024-05-08T19:17:46Z

controller_manager/src/controller_manager.cpp

+        // interfaces
+        if (
+          !command_interface_names.empty() &&
+          (!resource_manager_->prepare_command_mode_switch({}, command_interface_names) ||


We should not call this method here, as we are in the RT (update) loop, and this method is declared ad non-realtime relevant in the hardware description, as it is usually called from the non-rt loop (service callback).

I agree with you! I completely forgot that this is a non-RT method. We can do it at the end as you mentioned in the other comment

destogl · 2024-05-08T19:26:48Z

controller_manager/src/controller_manager.cpp

+        if (
+          !command_interface_names.empty() &&
+          (!resource_manager_->prepare_command_mode_switch({}, command_interface_names) ||
+           !resource_manager_->perform_command_mode_switch({}, command_interface_names)))


We have here a bit of a chicken-egg problem. Currently we don't switch controller if this command fails (see line 2112), and we react on an error in general and not an error of specific hardware. We will have to change this in the future, but not for now as it seems not be a problem.

Back to the topic...

I thinks that a cleaner soluation would be to gather all the command interfaces of controller that failed to start and then do only once call on this method, to poke arround HW as little as possible. In current implementation we are doing N-time check in M HW regaridng the interface, if we add this to the line 2136 then we will do only 1xM checks.

I would do as you said, to trigger for all failed controllers at the end!

firesurfer · 2024-05-10T06:50:00Z

@destogl As far as I am aware of the only place during the chain of "loading -> configuration -> activation" the activation step is the only one where a controller can access the state of the hardware and can fail before entering the update loop.

In my case I need to make sure that multiple axes of the system are in a certain configuration otherwise the controller won't work / might even damage the system.

I therefore strongly oppose the the suggestion to forbid a controller to fail during activation or do you have a better suggestion how to handle a controller that needs to check if the system is in the correct state? I think the workaround I wrote about in the entry post of #1487 is rather ugly.

…t once

saikishor · 2024-05-13T14:03:50Z

@bmagyar and @saikishor I even vote that we fobid failing of controllers on activation because of keeping up with the real-time capabilities. This is the reason why we have configure in the first place. Allowing this, I forsee, as elaborated earlier, opening pandora box of funky management of stuff in the update loop.

I see your point. However, the thing of failing or not is not in our hand with user controllers, moreover, the LifeCycleNode allows this kind of behavior. So, I think it is reasonable to have this fix. We can warn the users that if this situation arises, the realtimeness is not assured anymore.

Lennart Nachtigall and others added 11 commits April 29, 2024 21:59

added an actuator interface with exclusive interface handling

7440aea

ran precommit

d181630

renamed into test_actuator_exclusive_interfaces

a3814c0

formatting fix

a8a1a68

added a test controller that fails at activation

7b07afc

forgot to run pre-commit

daa9ce7

Fix issues in the setup of the test controller that fails on activate

6a29b34

Added test to reproduce the failure of switching with exclusive inter…

848c8ae

…faces

Throw an exception when requesting already claimed exclusive interface

17bb7a2

If the controller activation fails then release the claimed interfaces

77e932e

add the perform switching logic to fix the issue with exclusive inter…

4a4d780

…faces

github-actions bot requested review from bmagyar, DasRoteSkelett, destogl, mcbed, moriarty, progtologist and VX792 April 29, 2024 22:13

saikishor added bug backport-humble This label should be used by maintaines only! Label triggers PR backport to ROS2 humble. backport-iron This label should be used by maintaines only! Label triggers PR backport to ROS2 Iron. labels Apr 29, 2024

Merge branch 'master' into fix/exclusive_hw_interface_switching

1236e5c

firesurfer reviewed May 2, 2024

View reviewed changes

controller_manager/src/controller_manager.cpp Outdated Show resolved Hide resolved

firesurfer approved these changes May 2, 2024

View reviewed changes

saikishor added 2 commits May 2, 2024 20:50

remove the scoping

de9dc03

Merge branch 'master' into fix/exclusive_hw_interface_switching

96b5a8c

firesurfer approved these changes May 3, 2024

View reviewed changes

saikishor and others added 2 commits May 3, 2024 22:25

Merge branch 'master' into fix/exclusive_hw_interface_switching

8b17d8e

Merge branch 'master' into fix/exclusive_hw_interface_switching

6df3c72

destogl requested changes May 8, 2024

View reviewed changes

saikishor added 2 commits May 13, 2024 15:58

Collect all failed activated controllers and perform command switch a…

fba2abd

…t once

Merge branch 'master' into fix/exclusive_hw_interface_switching

a670590

added some documentation about the determinism

7755df0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix exclusive hardware control mode switching on controller failed activation #1522

Fix exclusive hardware control mode switching on controller failed activation #1522

saikishor commented Apr 29, 2024

saikishor commented Apr 29, 2024

codecov bot commented Apr 29, 2024 •

edited

firesurfer commented Apr 30, 2024

firesurfer commented Apr 30, 2024

saikishor commented Apr 30, 2024

saikishor commented Apr 30, 2024

saikishor commented Apr 30, 2024

firesurfer commented Apr 30, 2024

saikishor commented Apr 30, 2024

firesurfer commented Apr 30, 2024

christophfroehlich commented Apr 30, 2024 •

edited

saikishor commented Apr 30, 2024 •

edited by christophfroehlich

firesurfer commented Apr 30, 2024

saikishor commented Apr 30, 2024

firesurfer commented Apr 30, 2024

saikishor commented Apr 30, 2024 •

edited

saikishor commented Apr 30, 2024

firesurfer commented May 2, 2024

saikishor commented May 2, 2024 •

edited

firesurfer commented May 2, 2024

firesurfer left a comment

saikishor commented May 2, 2024

mergify bot commented May 3, 2024

destogl left a comment

destogl May 8, 2024

saikishor May 13, 2024

destogl May 8, 2024

saikishor May 13, 2024

firesurfer commented May 10, 2024

saikishor commented May 13, 2024

Fix exclusive hardware control mode switching on controller failed activation #1522

Are you sure you want to change the base?

Fix exclusive hardware control mode switching on controller failed activation #1522

Conversation

saikishor commented Apr 29, 2024

saikishor commented Apr 29, 2024

codecov bot commented Apr 29, 2024 • edited

Codecov Report

firesurfer commented Apr 30, 2024

firesurfer commented Apr 30, 2024

saikishor commented Apr 30, 2024

saikishor commented Apr 30, 2024

saikishor commented Apr 30, 2024

firesurfer commented Apr 30, 2024

saikishor commented Apr 30, 2024

firesurfer commented Apr 30, 2024

christophfroehlich commented Apr 30, 2024 • edited

saikishor commented Apr 30, 2024 • edited by christophfroehlich

firesurfer commented Apr 30, 2024

saikishor commented Apr 30, 2024

firesurfer commented Apr 30, 2024

saikishor commented Apr 30, 2024 • edited

saikishor commented Apr 30, 2024

firesurfer commented May 2, 2024

saikishor commented May 2, 2024 • edited

firesurfer commented May 2, 2024

firesurfer left a comment

Choose a reason for hiding this comment

saikishor commented May 2, 2024

mergify bot commented May 3, 2024

destogl left a comment

Choose a reason for hiding this comment

destogl May 8, 2024

Choose a reason for hiding this comment

saikishor May 13, 2024

Choose a reason for hiding this comment

destogl May 8, 2024

Choose a reason for hiding this comment

saikishor May 13, 2024

Choose a reason for hiding this comment

firesurfer commented May 10, 2024

saikishor commented May 13, 2024

codecov bot commented Apr 29, 2024 •

edited

christophfroehlich commented Apr 30, 2024 •

edited

saikishor commented Apr 30, 2024 •

edited by christophfroehlich

saikishor commented Apr 30, 2024 •

edited

saikishor commented May 2, 2024 •

edited