Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix JTC segfault #164

Merged
merged 2 commits into from
Apr 8, 2021
Merged

Conversation

matthew-reynolds
Copy link
Member

Purpose

Addresses #132. Fixes a segfault that could occur in regular use of the joint_trajectory_controller, as exposed by the unit tests.

Summary

There was a race condition in the JTC, where rt_active_goal_ could be reset() in one thread (joint_trajectory_controller.cpp:568) but then dereferenced elsewhere. This would cause a segfault.

For example, in JointTrajectoryController::update(), we check that rt_active_goal_ is non-null, and then dereference it a couple lines later. But if we get unlucky with timing, we can reset the shared_ptr in another thread in-between the non-null check and the dereference. We don't currently have the appropriate thread safety mecanisms in place.

By taking a copy of the rt_active_goal_ shared ptr before checking and using it, we ensure the local copy will never becoming invalid while we're holding it. The RealtimeBuffer is required since we need to read and write to the shared ptr concurrently from multiple threads.

Testing done

✔️ colcon test --packages-select joint_trajectory_controller --retest-until-fail 100

Comment on lines +590 to 600
// Update the active goal
RealtimeGoalHandlePtr rt_goal = std::make_shared<RealtimeGoalHandle>(goal_handle);
rt_goal->preallocated_feedback_->joint_names = joint_names_;
rt_goal->execute();
rt_active_goal_.writeFromNonRT(rt_goal);

// Setup goal status checking timer
goal_handle_timer_ = node_->create_wall_timer(
action_monitor_period_.to_chrono<std::chrono::seconds>(),
std::bind(&RealtimeGoalHandle::runNonRealtime, rt_active_goal_));
std::bind(&RealtimeGoalHandle::runNonRealtime, rt_goal));
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs further work, probably in a follow-up PR.

In both the old and new code, I worry that we might end up in a condition where we're overwriting the goal_handle_timer_ while the previous rt_active_goal_ has pending operations, and thus dropping those operations. I think we should call goal_handle_timer_::execute_callback() or rt_active_goal_.readFromNonRT()->runNonRealtime() before creating the new timer.

Similarly, I think we could probably end up in a race condition where something is resetting the rt_active_goal_ right after this function writes the new rt_goal, and we could end up losing the goal handle. This would involve a little more work to solve, for example only resetting the rt_active_goal_ if new_data_available_ == false. That variable is private, but something along those lines.

Would appreciate your thoughts so we can consider appropriate follow-up PRs. This PR leaves the behaviour unchanged from before, and I haven't yet run into either of those cases, so I think we're ok to leave it to a future task.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This write-up above is a perfect start for the description of a follow-up issue ;)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@destogl beat me to it! See #166

@matthew-reynolds
Copy link
Member Author

matthew-reynolds commented Apr 5, 2021

The CI failure looks unrelated: colcon: error: Mixin 'coverage-gcc' is not available for 'test'.

ros-tooling/setup-ros v0.1.3 was just released a few days ago, maybe something changed?

Edit: Indeed, looks like a bump in colcon-mixin changed a warning to an error. See ros-tooling/action-ros-ci#525. I will open a PR to update our CI to use action-ros-ci v0.2.0

Edit 2: PR opened, see #165

Copy link
Member

@bmagyar bmagyar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me, just 2 notes about creating follow-up issues

@bmagyar
Copy link
Member

bmagyar commented Apr 8, 2021

Thanks a bunch for this fix, a big step toward a green pipeline! :D

@bmagyar bmagyar merged commit 0a6fe52 into ros-controls:master Apr 8, 2021
@matthew-reynolds matthew-reynolds deleted the fix_jtc_segfault branch April 8, 2021 13:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants