Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix race conditions when updating PlanningScene #232

Closed
wants to merge 15 commits into from
Closed

fix race conditions when updating PlanningScene #232

wants to merge 15 commits into from

Conversation

rhaschke
Copy link
Contributor

After having merged #63 we can finally attempt to merge https://github.com/rhaschke/moveit/pull/1 against ros-planning/moveit. This PR attempts to fix moveit/moveit_ros#442 and the flaky test mentioned in #221 (comment). This PR draws on reverted PRs moveit/moveit_ros#716 moveit/moveit_ros#724, and moveit/moveit_ros#728.

  • I removed AsyncSpinner and CallBackQueue in PlanningSceneMonitor for now. With fix trajectory service blocking callback queue moveit_ros#717 (cherry-picking #713 and #717 #59) this is not required anymore to update the robot state. However syncSceneUpdates() relies on them, which is why I had to remove them for now too. It's anyway in discussion, whether this function is useful. I will file a separate PR for this.
  • I renamed CurrentStateMonitor::waitForCurrentState() and added a deprecated fallback function calling the new method waitForCompleteState(). Deprecation should be OK in Kinetic, but not Indigo and Jade. This should be considered when back-porting.

res.error_code.val = moveit_msgs::MoveItErrorCodes::TIMED_OUT;
else
res.error_code.val = moveit_msgs::MoveItErrorCodes::CONTROL_FAILED;
res.error_code.val = moveit_msgs::MoveItErrorCodes::CONTROL_FAILED;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is simply reformatting nested if's into a linear structure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes like this should be in a separate PR

@@ -322,6 +322,7 @@ void move_group::MoveGroupPickPlaceAction::executePickupCallback(const moveit_ms
{
setPickupState(PLANNING);

context_->planning_scene_monitor_->waitForCurrentRobotState(ros::Time::now());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before we start planning, ensure that we have the latest robot state received...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your Github comment should be an inline comment instead :)

@@ -199,6 +208,7 @@ class CurrentStateMonitor
ros::Time last_tf_update_;

mutable boost::mutex state_update_lock_;
mutable boost::condition_variable state_update_condition_;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This introduces an ABI incompatibility.

@@ -878,7 +878,7 @@ bool TrajectoryExecutionManager::validate(const TrajectoryExecutionContext &cont
ROS_DEBUG_NAMED("traj_execution", "Validating trajectory with allowed_start_tolerance %g", allowed_start_tolerance_);

robot_state::RobotStatePtr current_state;
if (!csm_->waitForCurrentState(1.0) || !(current_state = csm_->getCurrentState()))
if (!csm_->waitForCurrentState(ros::Time::now()) || !(current_state = csm_->getCurrentState()))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, calling the new function waitForCurrentState() instead of the old one, which was renamed to waitForCompleteRobotState().

Copy link
Member

@davetcoleman davetcoleman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

skimming looks good to me but @v4hn is the one who needs to really review this

res.error_code.val = moveit_msgs::MoveItErrorCodes::TIMED_OUT;
else
res.error_code.val = moveit_msgs::MoveItErrorCodes::CONTROL_FAILED;
res.error_code.val = moveit_msgs::MoveItErrorCodes::CONTROL_FAILED;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1


/** @brief Wait for at most \e wait_time seconds until the complete robot state is known. Return true if the full state is known */
bool waitForCompleteState(double wait_time) const;
/// replaced by waitForCompleteState, will be removed in L-turtle
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment should be /** */ for consistency

{
ROS_DEBUG_STREAM_NAMED("planning_scene_monitor", "last robot motion: " << (t-last_robot_motion_time_).toSec() << " ago");
new_scene_update_condition_.wait_for(lock, boost::chrono::nanoseconds(timeout.toNSec()));
timeout -= ros::WallTime::now()-start; // compute remaining wait_time
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spacing

@davetcoleman
Copy link
Member

ping @v4hn - this PR is the last blocker for the kinetic release!

also, @rhaschke do you mind rebasing for the merge conflicts?

unlocking needs to be performed in reverse order of locking
otherwise deadlocks are risked
…ue_)

Due to an upstream bug, it's not possible to start multiple AsyncSpinners from different threads.
Filed PR: ros/ros_comm#867

The spinner is now only needed to serve our own callback_queue_ for
scene updates, which is only required for syncSceneUpdates() that
syncs all kind of scene updates, not only the robot state.
…State()

deprecated old functions, which should be removed in L-turtle
... because we might wait up to 1s for a robot state update
@rhaschke
Copy link
Contributor Author

rhaschke commented Oct 6, 2016

Rebased.

Copy link
Member

@davetcoleman davetcoleman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very tricky PR to review - you have too many things changes here in a very complex way. please simplify. i also take issue with how you've renamed a function then created a new function with the same name - not good!

@@ -322,6 +322,7 @@ void move_group::MoveGroupPickPlaceAction::executePickupCallback(const moveit_ms
{
setPickupState(PLANNING);

context_->planning_scene_monitor_->waitForCurrentRobotState(ros::Time::now());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your Github comment should be an inline comment instead :)

/** @brief Wait for at most \e wait_time seconds until the complete current state is known. Return true if the full state is known */
bool waitForCurrentState(double wait_time) const;
/** @brief Wait for at most \e wait_time seconds for a robot state more recent than t */
bool waitForCurrentState(const ros::Time t=ros::Time::now(), double wait_time=1) const;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spaces needed around =

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 should be 1.0 or 1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in comment add "(defaults to 1 sec)"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in comment explain what the return value means - e.g. @return false indicates the current robot state at time \e t was not found within the \e wait_time

/** @brief Wait for at most \e wait_time seconds until the complete robot state is known. Return true if the full state is known */
bool waitForCompleteState(double wait_time) const;
/** replaced by waitForCompleteState, will be removed in L-turtle */
MOVEIT_DEPRECATED bool waitForCurrentState(double wait_time) const;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you clarify in the comment if this new function behaves differently than this deprecated waitForCurrentState?

@@ -344,6 +344,13 @@ class PlanningSceneMonitor : private boost::noncopyable
/** @brief This function is called every time there is a change to the planning scene */
void triggerSceneUpdateEvent(SceneUpdateType update_type);

/** \brief Wait for robot state to become more recent than t.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"than time t"

* If there is no state monitor active, there will be no scene updates.
* Hence, you can specify a timeout to wait for those updates. Default is 1s.
*/
bool waitForCurrentRobotState(const ros::Time &t, double wait_time = 1.);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you deprecate the function named waitForCurrentState in the current_state_monitor but here add one named that? Should it also be waitForCompleteState here? I find your varying use of these two functions in current_state_monitor and planning_scene_monitor very confusing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As explained by @v4hn, we agreed to correctly rename the old function waitForCurrentRobotState -> waitForCompleteRobotState to reflect its actual semantics.
The new function waitForCurrentRobotState actually does what its name states: waiting for an update of the current robot state.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok that makes sense, its just hard to review


/* If the robot doesn't move, we will never receive an update from CSM in planning scene.
As we ensured that an update, if it is triggered by CSM, is directly passed to the scene,
we can early return true here (if we successfully received a CSM update). Otherwise return false. */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice comment!

// However, scene updates are only published if the robot actually moves. Hence we need a timeout!
// As publishing planning scene updates is throttled (2Hz by default), a 1s timeout is a suitable default.
boost::shared_lock<boost::shared_mutex> lock(scene_update_mutex_);
ros::Time prev_robot_motion_time_ = last_robot_motion_time_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there should not be an underscore at the end of this local variable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only member variables should have an underscore in the end, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, and prev_robot_motion_time_ is a local variable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. I missed the not in your original comment.

if (octomap_monitor_)
octomap_monitor_->getOcTreePtr()->unlockRead();
scene_update_mutex_.unlock_shared();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes like this should be in a separate PR

if self.group.execute(self.plan(target)):
actual = np.asarray(self.group.get_current_joint_values())
self.assertTrue(np.allclose(target, actual, atol=1e-4, rtol=0.0))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this could be in a separate PR also

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this? This is the unittest which tests the behavior fixed by this PR.

res.error_code.val = moveit_msgs::MoveItErrorCodes::TIMED_OUT;
else
res.error_code.val = moveit_msgs::MoveItErrorCodes::CONTROL_FAILED;
res.error_code.val = moveit_msgs::MoveItErrorCodes::CONTROL_FAILED;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes like this should be in a separate PR

@davetcoleman
Copy link
Member

I really want to get this merged in so we can release kinetic, so have attempted to review, above. Would greatly appreciate anyone who can test this PR on their physical robot setup. @Jntzko?

@v4hn
Copy link
Contributor

v4hn commented Oct 6, 2016

Thanks dave!

waitForCurrentState

Yes this is meddling with a core function, but I discussed that with @rhaschke in the old request already.
The old function is plainly broken beyond repair! While it is called waitForCurrentState it does not wait for a current state at all! @rhaschke proposed to change this (for the better) and provide an old fallback and I fully agree with him here.
A number of people rely on that function to do what it says, and it should do that.

@Jntzko
Copy link
Contributor

Jntzko commented Oct 7, 2016

I've tested this PR in indigo with the 9 commits from @rhaschke in the simulation.

Therefore I made a small test case where the ur5 moves in a loop to four positions.
After a few iterations I get this error:

Invalid Trajectory: start point deviates from current robot state more than 0.01 joint 'ur5_wrist_3_joint': expected: -2.52336, current: -2.53886

As @v4hn told me, this error should not appear with the fixes from the PR.

To get this test running you need to check out the following repositories:

The branch testFix#442 from:
https://github.com/Jntzko/moveit.git

The test case:
https://github.com/Jntzko/fix_haschke.git

And also this repositories for our lab setup:
https://github.com/TAMS-Group/tams_ur5.git
https://github.com/TAMS-Group/tams_ur5_setup.git

Start the test with:
roslaunch tams_ur5_setup_moveit_config demo.launch
and
rosrun fix_haschke move_loop

@rhaschke
Copy link
Contributor Author

I guess, to reproduce the test, I also would need to have a UR arm, wouldn't I?
Is it possible that those deviations originate from sensor noise (or breaks)?
Actually, I added a similar test in moveit_ros/planning_interface/test/python_move_group.py doing some (only 3) random motions in simulations. @Jntzko, could you please run this test too, possibly increasing the number of loops? It runs like this:
rostest moveit_ros_planning_interface python_move_group.test

This test runs without problems for me with a simulation-only robot.

@v4hn
Copy link
Contributor

v4hn commented Oct 10, 2016

On Mon, Oct 10, 2016 at 06:02:32AM -0700, Robert Haschke wrote:

I guess, to reproduce the test, I also would need to have a UR arm, wouldn't I?

No, he ran this in demo mode. That's why I'm pretty sure the problem lies in the synchronization.

@Jntzko
Copy link
Contributor

Jntzko commented Oct 11, 2016

@rhaschke I've tested your code, also just in the simulation and I get the same "start point deviates from current robot state" error.

@davetcoleman
Copy link
Member

This discussion on what tests you are using to check this PR inspired me to write a stub tutorial on the topic: moveit/moveit_tutorials#24

@rhaschke
Copy link
Contributor Author

I've tested your code, also just in the simulation and I get the same "start point deviates from current robot state" error.

@Jntzko Sorry, but I cannot reproduce the error in Kinetic with my proposed unit test. Could you, please, send me the whole output of running the unit test?

rostest moveit_ros_planning_interface python_move_group.test

@rhaschke
Copy link
Contributor Author

rhaschke commented Oct 14, 2016

I addressed most of the additional comments from @davetcoleman. Those, which are not addressed, I commented inline. I didn't separated stuff into individual PRs. They are nicely separated by individual commits.

@v4hn
Copy link
Contributor

v4hn commented Oct 16, 2016

We will try to look at this tomorrow.
There is no sense in merging this if it actually fails to achieve what it was written for.
And it does not make too much sense to me that it works fine for @rhaschke on kinetic but fails for @Jntzko on indigo.

@rhaschke
Copy link
Contributor Author

There is no sense in merging this if it actually fails to achieve what it was written for.

Agreed. However, I don't see a reason yet, why it should fail. Do you?

And it does not make too much sense to me that it works fine for @rhaschke on kinetic but fails for @Jntzko on indigo.

That's why I asked for the full log.

v4hn pushed a commit that referenced this pull request Oct 17, 2016
This is a poor-man's-replacement for rhaschke's work that
waits for the current robot state. This can be removed after
his work is merged. (See #232)

Without the additional sleep the automatic update of the start state
will pick a point near the end of the executed trajectory instead of
the current state. Let's give the monitor a moment to catch up.
@rhaschke
Copy link
Contributor Author

@Jntzko Please send the full log of the unit test to allow me to further look into this.

@Jntzko
Copy link
Contributor

Jntzko commented Oct 18, 2016

@rhaschke I have cherry picked your latest commits in my indigo branch too and ran the test multiple times.

The following error does not appear in every run within the 30 movements, but it still appears every now and then(appeared in 2 of 3 runs):

[ERROR] [1476803281.592726674]:
Invalid Trajectory: start point deviates from current robot state more than 0.01
joint 'joint_6': expected: -0.359403, current: -0.37355

I work with the testFix#442 branch of this repo:
https://github.com/Jntzko/moveit.git

Here's one of the rostest-log files where the error appeared:
rostest-tams121-26002.log.txt

@Jntzko
Copy link
Contributor

Jntzko commented Oct 18, 2016

@rhaschke I ran your test on kinetic now and increased the number of movements to 100.
When I ran the test the third time, the error appeared again:

[ERROR] [1476828218.417227953]:
Invalid Trajectory: start point deviates from current robot state more than 0.01
joint 'joint_1': expected: -0.345288, current: -0.145362

Here's the rostest-log file:
rostest-thinkpad-28045.log.txt

@rhaschke
Copy link
Contributor Author

rhaschke commented Oct 18, 2016

I could indeed reproduce the problem in Kinetic too now. The update is missed rarely but sometimes. I think, the core issue is that the TrajectoryExecutionManager signals finishing of trajectory execution as soon as the last controller values were send out. However, the controller will need some time to process them and only after some delay the final joint state will be published by the hardware or controller.

We have the same situation with fake controllers too: They publish to /move_group/fake_controller_joint_states and a joint_state_publisher re-publishes with a fixed rate and its own timestamp.

Hence, it might happen, that waitForCurrentRobotState(ros::Time::now()) is called with a threshold time earlier than the finally published joint_states. In that case we will miss these final updates. Modifying joint_state_publisher to re-publish with the received timestamp (instead of a new one), solved the problem on my side, which supports the explanation.

However, this is of course not a feasible solution. The core issue is that we cannot know the delay between the last controller command send by TrajectoryExecutionManager and final joint_states received. A potential solution could be to wait in TrajectoryExecutionManager until the robot "settles" before sending the finished signal. As I integrated a CurrentStateMonitor recently, that should be feasible - using the same noise threshold as for validate().

Hence, I suggest to drop this PR for Kinetic. However, I will file some parts of this PR as separate PRs - to be included in the Kinetic release.

@rhaschke rhaschke closed this Oct 18, 2016
v4hn added a commit that referenced this pull request Oct 19, 2016
This is a poor-man's-replacement for rhaschke's work that
waits for the current robot state. This can be removed after
his work is merged. (See #232)

Without the additional sleep the automatic update of the start state
will pick a point near the end of the executed trajectory instead of
the current state. Let's give the monitor a moment to catch up.
v4hn added a commit that referenced this pull request Oct 19, 2016
This is a poor-man's-replacement for rhaschke's work that
waits for the current robot state. This can be removed after
his work is merged. (See #232)

Without the additional sleep the automatic update of the start state
will pick a point near the end of the executed trajectory instead of
the current state. Let's give the monitor a moment to catch up.
v4hn added a commit that referenced this pull request Oct 19, 2016
This is a poor-man's-replacement for rhaschke's work that
waits for the current robot state. This can be removed after
his work is merged. (See #232)

Without the additional sleep the automatic update of the start state
will pick a point near the end of the executed trajectory instead of
the current state. Let's give the monitor a moment to catch up.
130s pushed a commit to 130s/moveit-2 that referenced this pull request Oct 21, 2016
This is a poor-man's-replacement for rhaschke's work that
waits for the current robot state. This can be removed after
his work is merged. (See moveit#232)

Without the additional sleep the automatic update of the start state
will pick a point near the end of the executed trajectory instead of
the current state. Let's give the monitor a moment to catch up.
130s pushed a commit to 130s/moveit-2 that referenced this pull request Oct 21, 2016
This is a poor-man's-replacement for rhaschke's work that
waits for the current robot state. This can be removed after
his work is merged. (See moveit#232)

Without the additional sleep the automatic update of the start state
will pick a point near the end of the executed trajectory instead of
the current state. Let's give the monitor a moment to catch up.
JafarAbdi added a commit to JafarAbdi/moveit that referenced this pull request Mar 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants