fix race conditions when updating PlanningScene #232

rhaschke · 2016-09-20T22:11:51Z

After having merged #63 we can finally attempt to merge https://github.com/rhaschke/moveit/pull/1 against ros-planning/moveit. This PR attempts to fix moveit/moveit_ros#442 and the flaky test mentioned in #221 (comment). This PR draws on reverted PRs moveit/moveit_ros#716 moveit/moveit_ros#724, and moveit/moveit_ros#728.

I removed AsyncSpinner and CallBackQueue in PlanningSceneMonitor for now. With fix trajectory service blocking callback queue moveit_ros#717 (cherry-picking #713 and #717 #59) this is not required anymore to update the robot state. However syncSceneUpdates() relies on them, which is why I had to remove them for now too. It's anyway in discussion, whether this function is useful. I will file a separate PR for this.
I renamed CurrentStateMonitor::waitForCurrentState() and added a deprecated fallback function calling the new method waitForCompleteState(). Deprecation should be OK in Kinetic, but not Indigo and Jade. This should be considered when back-porting.

rhaschke · 2016-09-20T22:13:30Z

moveit_ros/move_group/src/default_capabilities/execute_trajectory_service_capability.cpp

-            res.error_code.val = moveit_msgs::MoveItErrorCodes::TIMED_OUT;
-          else
-            res.error_code.val = moveit_msgs::MoveItErrorCodes::CONTROL_FAILED;
+        res.error_code.val = moveit_msgs::MoveItErrorCodes::CONTROL_FAILED;


This is simply reformatting nested if's into a linear structure.

changes like this should be in a separate PR

rhaschke · 2016-09-20T22:14:11Z

moveit_ros/manipulation/move_group_pick_place_capability/src/pick_place_action_capability.cpp

@@ -322,6 +322,7 @@ void move_group::MoveGroupPickPlaceAction::executePickupCallback(const moveit_ms
 {
  setPickupState(PLANNING);

+  context_->planning_scene_monitor_->waitForCurrentRobotState(ros::Time::now());


Before we start planning, ensure that we have the latest robot state received...

Your Github comment should be an inline comment instead :)

rhaschke · 2016-09-20T22:15:29Z

...lanning/planning_scene_monitor/include/moveit/planning_scene_monitor/current_state_monitor.h

@@ -199,6 +208,7 @@ class CurrentStateMonitor
  ros::Time                                    last_tf_update_;

  mutable boost::mutex                         state_update_lock_;
+  mutable boost::condition_variable            state_update_condition_;


This introduces an ABI incompatibility.

rhaschke · 2016-09-20T22:19:39Z

moveit_ros/planning/trajectory_execution_manager/src/trajectory_execution_manager.cpp

@@ -878,7 +878,7 @@ bool TrajectoryExecutionManager::validate(const TrajectoryExecutionContext &cont
  ROS_DEBUG_NAMED("traj_execution", "Validating trajectory with allowed_start_tolerance %g", allowed_start_tolerance_);

  robot_state::RobotStatePtr current_state;
-  if (!csm_->waitForCurrentState(1.0) || !(current_state = csm_->getCurrentState()))
+  if (!csm_->waitForCurrentState(ros::Time::now()) || !(current_state = csm_->getCurrentState()))


Here, calling the new function waitForCurrentState() instead of the old one, which was renamed to waitForCompleteRobotState().

davetcoleman

skimming looks good to me but @v4hn is the one who needs to really review this

davetcoleman · 2016-09-20T23:42:05Z

moveit_ros/move_group/src/default_capabilities/execute_trajectory_service_capability.cpp

-            res.error_code.val = moveit_msgs::MoveItErrorCodes::TIMED_OUT;
-          else
-            res.error_code.val = moveit_msgs::MoveItErrorCodes::CONTROL_FAILED;
+        res.error_code.val = moveit_msgs::MoveItErrorCodes::CONTROL_FAILED;


davetcoleman · 2016-09-20T23:43:29Z

...lanning/planning_scene_monitor/include/moveit/planning_scene_monitor/current_state_monitor.h

+
+  /** @brief Wait for at most \e wait_time seconds until the complete robot state is known. Return true if the full state is known */
+  bool waitForCompleteState(double wait_time) const;
+  /// replaced by waitForCompleteState, will be removed in L-turtle


comment should be /** */ for consistency

davetcoleman · 2016-09-20T23:45:21Z

moveit_ros/planning/planning_scene_monitor/src/planning_scene_monitor.cpp

+  {
+    ROS_DEBUG_STREAM_NAMED("planning_scene_monitor", "last robot motion: " << (t-last_robot_motion_time_).toSec() << " ago");
+    new_scene_update_condition_.wait_for(lock, boost::chrono::nanoseconds(timeout.toNSec()));
+    timeout -= ros::WallTime::now()-start; // compute remaining wait_time


davetcoleman · 2016-09-29T15:33:07Z

ping @v4hn - this PR is the last blocker for the kinetic release!

also, @rhaschke do you mind rebasing for the merge conflicts?

unlocking needs to be performed in reverse order of locking otherwise deadlocks are risked

…ue_) Due to an upstream bug, it's not possible to start multiple AsyncSpinners from different threads. Filed PR: ros/ros_comm#867 The spinner is now only needed to serve our own callback_queue_ for scene updates, which is only required for syncSceneUpdates() that syncs all kind of scene updates, not only the robot state.

…State() deprecated old functions, which should be removed in L-turtle

... because we might wait up to 1s for a robot state update

rhaschke · 2016-10-06T12:39:07Z

Rebased.

davetcoleman

This is a very tricky PR to review - you have too many things changes here in a very complex way. please simplify. i also take issue with how you've renamed a function then created a new function with the same name - not good!

davetcoleman · 2016-10-06T11:46:23Z

moveit_ros/manipulation/move_group_pick_place_capability/src/pick_place_action_capability.cpp

@@ -322,6 +322,7 @@ void move_group::MoveGroupPickPlaceAction::executePickupCallback(const moveit_ms
 {
  setPickupState(PLANNING);

+  context_->planning_scene_monitor_->waitForCurrentRobotState(ros::Time::now());


Your Github comment should be an inline comment instead :)

davetcoleman · 2016-10-06T11:49:03Z

...lanning/planning_scene_monitor/include/moveit/planning_scene_monitor/current_state_monitor.h

-  /** @brief Wait for at most \e wait_time seconds until the complete current state is known. Return true if the full state is known */
-  bool waitForCurrentState(double wait_time) const;
+  /** @brief Wait for at most \e wait_time seconds for a robot state more recent than t */
+  bool waitForCurrentState(const ros::Time t=ros::Time::now(), double wait_time=1) const;


spaces needed around =

1 should be 1.0 or 1.

in comment add "(defaults to 1 sec)"

in comment explain what the return value means - e.g. @return false indicates the current robot state at time \e t was not found within the \e wait_time

davetcoleman · 2016-10-06T11:50:58Z

...lanning/planning_scene_monitor/include/moveit/planning_scene_monitor/current_state_monitor.h

+  /** @brief Wait for at most \e wait_time seconds until the complete robot state is known. Return true if the full state is known */
+  bool waitForCompleteState(double wait_time) const;
+  /** replaced by waitForCompleteState, will be removed in L-turtle */
+  MOVEIT_DEPRECATED bool waitForCurrentState(double wait_time) const;


can you clarify in the comment if this new function behaves differently than this deprecated waitForCurrentState?

davetcoleman · 2016-10-06T11:54:59Z

...anning/planning_scene_monitor/include/moveit/planning_scene_monitor/planning_scene_monitor.h

@@ -344,6 +344,13 @@ class PlanningSceneMonitor : private boost::noncopyable
  /** @brief This function is called every time there is a change to the planning scene */
  void triggerSceneUpdateEvent(SceneUpdateType update_type);

+  /** \brief Wait for robot state to become more recent than t.


"than time t"

davetcoleman · 2016-10-06T11:57:01Z

...anning/planning_scene_monitor/include/moveit/planning_scene_monitor/planning_scene_monitor.h

+   * If there is no state monitor active, there will be no scene updates.
+   * Hence, you can specify a timeout to wait for those updates. Default is 1s.
+   */
+  bool waitForCurrentRobotState(const ros::Time &t, double wait_time = 1.);


Why did you deprecate the function named waitForCurrentState in the current_state_monitor but here add one named that? Should it also be waitForCompleteState here? I find your varying use of these two functions in current_state_monitor and planning_scene_monitor very confusing

As explained by @v4hn, we agreed to correctly rename the old function waitForCurrentRobotState -> waitForCompleteRobotState to reflect its actual semantics.
The new function waitForCurrentRobotState actually does what its name states: waiting for an update of the current robot state.

ok that makes sense, its just hard to review

davetcoleman · 2016-10-06T12:27:59Z

moveit_ros/planning/planning_scene_monitor/src/planning_scene_monitor.cpp

+
+    /* If the robot doesn't move, we will never receive an update from CSM in planning scene.
+       As we ensured that an update, if it is triggered by CSM, is directly passed to the scene,
+       we can early return true here (if we successfully received a CSM update). Otherwise return false. */


nice comment!

davetcoleman · 2016-10-06T12:32:59Z

moveit_ros/planning/planning_scene_monitor/src/planning_scene_monitor.cpp

+  // However, scene updates are only published if the robot actually moves. Hence we need a timeout!
+  // As publishing planning scene updates is throttled (2Hz by default), a 1s timeout is a suitable default.
+  boost::shared_lock<boost::shared_mutex> lock(scene_update_mutex_);
+  ros::Time prev_robot_motion_time_ = last_robot_motion_time_;


there should not be an underscore at the end of this local variable

Only member variables should have an underscore in the end, right?

right, and prev_robot_motion_time_ is a local variable

Oh. I missed the not in your original comment.

davetcoleman · 2016-10-06T12:43:30Z

moveit_ros/planning/planning_scene_monitor/src/planning_scene_monitor.cpp

  if (octomap_monitor_)
    octomap_monitor_->getOcTreePtr()->unlockRead();
+  scene_update_mutex_.unlock_shared();


changes like this should be in a separate PR

davetcoleman · 2016-10-06T12:44:10Z

moveit_ros/planning_interface/test/python_move_group.py

+            if self.group.execute(self.plan(target)):
+                 actual = np.asarray(self.group.get_current_joint_values())
+                 self.assertTrue(np.allclose(target, actual, atol=1e-4, rtol=0.0))
+


i think this could be in a separate PR also

Why this? This is the unittest which tests the behavior fixed by this PR.

davetcoleman · 2016-10-06T12:44:25Z

moveit_ros/move_group/src/default_capabilities/execute_trajectory_service_capability.cpp

-            res.error_code.val = moveit_msgs::MoveItErrorCodes::TIMED_OUT;
-          else
-            res.error_code.val = moveit_msgs::MoveItErrorCodes::CONTROL_FAILED;
+        res.error_code.val = moveit_msgs::MoveItErrorCodes::CONTROL_FAILED;


changes like this should be in a separate PR

davetcoleman · 2016-10-06T12:47:00Z

I really want to get this merged in so we can release kinetic, so have attempted to review, above. Would greatly appreciate anyone who can test this PR on their physical robot setup. @Jntzko?

v4hn · 2016-10-06T14:31:57Z

Thanks dave!

waitForCurrentState

Yes this is meddling with a core function, but I discussed that with @rhaschke in the old request already.
The old function is plainly broken beyond repair! While it is called waitForCurrentState it does not wait for a current state at all! @rhaschke proposed to change this (for the better) and provide an old fallback and I fully agree with him here.
A number of people rely on that function to do what it says, and it should do that.

Jntzko · 2016-10-07T15:23:01Z

I've tested this PR in indigo with the 9 commits from @rhaschke in the simulation.

Therefore I made a small test case where the ur5 moves in a loop to four positions.
After a few iterations I get this error:

Invalid Trajectory: start point deviates from current robot state more than 0.01 joint 'ur5_wrist_3_joint': expected: -2.52336, current: -2.53886

As @v4hn told me, this error should not appear with the fixes from the PR.

To get this test running you need to check out the following repositories:

The branch testFix#442 from:
https://github.com/Jntzko/moveit.git

The test case:
https://github.com/Jntzko/fix_haschke.git

And also this repositories for our lab setup:
https://github.com/TAMS-Group/tams_ur5.git
https://github.com/TAMS-Group/tams_ur5_setup.git

Start the test with:
roslaunch tams_ur5_setup_moveit_config demo.launch
and
rosrun fix_haschke move_loop

rhaschke · 2016-10-10T13:02:32Z

I guess, to reproduce the test, I also would need to have a UR arm, wouldn't I?
Is it possible that those deviations originate from sensor noise (or breaks)?
Actually, I added a similar test in moveit_ros/planning_interface/test/python_move_group.py doing some (only 3) random motions in simulations. @Jntzko, could you please run this test too, possibly increasing the number of loops? It runs like this:
rostest moveit_ros_planning_interface python_move_group.test

This test runs without problems for me with a simulation-only robot.

v4hn · 2016-10-10T13:08:02Z

On Mon, Oct 10, 2016 at 06:02:32AM -0700, Robert Haschke wrote:

I guess, to reproduce the test, I also would need to have a UR arm, wouldn't I?

No, he ran this in demo mode. That's why I'm pretty sure the problem lies in the synchronization.

Jntzko · 2016-10-11T14:49:42Z

@rhaschke I've tested your code, also just in the simulation and I get the same "start point deviates from current robot state" error.

davetcoleman · 2016-10-11T18:47:02Z

This discussion on what tests you are using to check this PR inspired me to write a stub tutorial on the topic: moveit/moveit_tutorials#24

rhaschke · 2016-10-14T07:07:37Z

I've tested your code, also just in the simulation and I get the same "start point deviates from current robot state" error.

@Jntzko Sorry, but I cannot reproduce the error in Kinetic with my proposed unit test. Could you, please, send me the whole output of running the unit test?

rostest moveit_ros_planning_interface python_move_group.test

rhaschke · 2016-10-14T07:12:26Z

I addressed most of the additional comments from @davetcoleman. Those, which are not addressed, I commented inline. I didn't separated stuff into individual PRs. They are nicely separated by individual commits.

v4hn · 2016-10-16T11:53:30Z

We will try to look at this tomorrow.
There is no sense in merging this if it actually fails to achieve what it was written for.
And it does not make too much sense to me that it works fine for @rhaschke on kinetic but fails for @Jntzko on indigo.

rhaschke · 2016-10-17T15:44:28Z

There is no sense in merging this if it actually fails to achieve what it was written for.

Agreed. However, I don't see a reason yet, why it should fail. Do you?

And it does not make too much sense to me that it works fine for @rhaschke on kinetic but fails for @Jntzko on indigo.

That's why I asked for the full log.

This is a poor-man's-replacement for rhaschke's work that waits for the current robot state. This can be removed after his work is merged. (See #232) Without the additional sleep the automatic update of the start state will pick a point near the end of the executed trajectory instead of the current state. Let's give the monitor a moment to catch up.

rhaschke · 2016-10-17T21:21:47Z

@Jntzko Please send the full log of the unit test to allow me to further look into this.

Jntzko · 2016-10-18T17:07:29Z

@rhaschke I have cherry picked your latest commits in my indigo branch too and ran the test multiple times.

The following error does not appear in every run within the 30 movements, but it still appears every now and then(appeared in 2 of 3 runs):

[ERROR] [1476803281.592726674]:
Invalid Trajectory: start point deviates from current robot state more than 0.01
joint 'joint_6': expected: -0.359403, current: -0.37355

I work with the testFix#442 branch of this repo:
https://github.com/Jntzko/moveit.git

Here's one of the rostest-log files where the error appeared:
rostest-tams121-26002.log.txt

Jntzko · 2016-10-18T22:15:33Z

@rhaschke I ran your test on kinetic now and increased the number of movements to 100.
When I ran the test the third time, the error appeared again:

[ERROR] [1476828218.417227953]:
Invalid Trajectory: start point deviates from current robot state more than 0.01
joint 'joint_1': expected: -0.345288, current: -0.145362

Here's the rostest-log file:
rostest-thinkpad-28045.log.txt

rhaschke · 2016-10-18T23:49:32Z

I could indeed reproduce the problem in Kinetic too now. The update is missed rarely but sometimes. I think, the core issue is that the TrajectoryExecutionManager signals finishing of trajectory execution as soon as the last controller values were send out. However, the controller will need some time to process them and only after some delay the final joint state will be published by the hardware or controller.

We have the same situation with fake controllers too: They publish to /move_group/fake_controller_joint_states and a joint_state_publisher re-publishes with a fixed rate and its own timestamp.

Hence, it might happen, that waitForCurrentRobotState(ros::Time::now()) is called with a threshold time earlier than the finally published joint_states. In that case we will miss these final updates. Modifying joint_state_publisher to re-publish with the received timestamp (instead of a new one), solved the problem on my side, which supports the explanation.

However, this is of course not a feasible solution. The core issue is that we cannot know the delay between the last controller command send by TrajectoryExecutionManager and final joint_states received. A potential solution could be to wait in TrajectoryExecutionManager until the robot "settles" before sending the finished signal. As I integrated a CurrentStateMonitor recently, that should be feasible - using the same noise threshold as for validate().

Hence, I suggest to drop this PR for Kinetic. However, I will file some parts of this PR as separate PRs - to be included in the Kinetic release.

This is a poor-man's-replacement for rhaschke's work that waits for the current robot state. This can be removed after his work is merged. (See #232) Without the additional sleep the automatic update of the start state will pick a point near the end of the executed trajectory instead of the current state. Let's give the monitor a moment to catch up.

This is a poor-man's-replacement for rhaschke's work that waits for the current robot state. This can be removed after his work is merged. (See moveit#232) Without the additional sleep the automatic update of the start state will pick a point near the end of the executed trajectory instead of the current state. Let's give the monitor a moment to catch up.

rhaschke commented Sep 20, 2016

View reviewed changes

rhaschke mentioned this pull request Sep 20, 2016

MSA: improve author information gui #221

Merged

davetcoleman reviewed Sep 20, 2016

View reviewed changes

davetcoleman assigned v4hn Sep 20, 2016

v4hn mentioned this pull request Sep 21, 2016

Initial Indigo release from ros-planning/moveit repo #100

Closed

15 tasks

This was referenced Sep 22, 2016

Kinetic Release #18

Closed

Maintainer Meeting TODOs (Sep 2016) #256

Closed

rhaschke added 9 commits October 6, 2016 14:33

fix order of unlocking

f57d33f

unlocking needs to be performed in reverse order of locking otherwise deadlocks are risked

reformatting

aa0b9e2

PSM::waitForCurrentRobotState() + PSM::syncSceneUpdates()

2b89bd1

renamed CurrentStateMonitor::waitForCurrentState() to waitForComplete…

9c9d8bd

…State() deprecated old functions, which should be removed in L-turtle

rviz: execute state update in background

d094efe

... because we might wait up to 1s for a robot state update

renamed ROS_DEPRECATED to MOVEIT_DEPRECATED

710f536

add robot_state update test

aaaf070

fixed formatting

0d47df1

davetcoleman requested changes Oct 6, 2016

View reviewed changes

davetcoleman unassigned v4hn Oct 6, 2016

v4hn mentioned this pull request Oct 12, 2016

Invalid Trajectory: start point deviates #283

Closed

rhaschke added 3 commits October 14, 2016 14:52

improved comments, fixed formatting

e88baea

renamed wall_last_state_update_ to last_robot_state_update_wall_time_

2bd734c

increase number of test iterations

fe8acb3

removed trailing underscore

025ecab

davetcoleman approved these changes Oct 14, 2016

View reviewed changes

v4hn mentioned this pull request Oct 17, 2016

wait a second before updating "current" in RViz #291

Merged

rhaschke added 2 commits October 19, 2016 00:35

moved robot_state_update test into own rostest

fd39d56

ROS_WARN_NAMED

5b5cb3d

rhaschke closed this Oct 18, 2016

This was referenced Oct 18, 2016

some cleanup #297

Merged

renamed CSM::waitForCurrentState() to CSM::waitForCompleteState() #298

Merged

rhaschke mentioned this pull request Nov 13, 2016

fix race conditions when updating PlanningScene #350

Merged

dcconner mentioned this pull request May 10, 2017

Timing issues with joint state updates in PlanningSceneMonitor #501

Closed

rkeatin3 mentioned this pull request Jan 23, 2018

Invalid Trajectory: start point deviates from current robot state #749

Closed

JafarAbdi added a commit to JafarAbdi/moveit that referenced this pull request Mar 24, 2022

kinematics_base: remove deprecated initialize function (moveit#232)

247b667

fix race conditions when updating PlanningScene #232

fix race conditions when updating PlanningScene #232

Conversation

rhaschke commented Sep 20, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davetcoleman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davetcoleman commented Sep 29, 2016

rhaschke commented Oct 6, 2016

davetcoleman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davetcoleman Oct 6, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davetcoleman commented Oct 6, 2016

v4hn commented Oct 6, 2016

Jntzko commented Oct 7, 2016

rhaschke commented Oct 10, 2016

v4hn commented Oct 10, 2016

Jntzko commented Oct 11, 2016

davetcoleman commented Oct 11, 2016

rhaschke commented Oct 14, 2016

rhaschke commented Oct 14, 2016 • edited

v4hn commented Oct 16, 2016

rhaschke commented Oct 17, 2016

rhaschke commented Oct 17, 2016

Jntzko commented Oct 18, 2016

Jntzko commented Oct 18, 2016

rhaschke commented Oct 18, 2016 • edited

davetcoleman Oct 6, 2016 •

edited

rhaschke commented Oct 14, 2016 •

edited

rhaschke commented Oct 18, 2016 •

edited