Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

protect rcl_node_init()/fini() with rcl logging mutex. #1124

Closed

Conversation

fujitatomoya
Copy link
Collaborator

fix candidate for #1042

Copy link
Member

@ivanpauno ivanpauno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As described here #1042 (comment), this "partially" fixes the issue but does not fix it completely.

We first need some consensus on how we want to fix the issue before going ahead with a solution.
If the consensus is doing something like this, we can follow-up later.

@ivanpauno ivanpauno added the bug Something isn't working label May 21, 2020
/// Return the global logging configure mutex.
RCLCPP_PUBLIC
std::shared_ptr<std::mutex>
get_rcl_logging_configure_mutex();
Copy link
Contributor

@rotu rotu May 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The shared_ptr in the return signature seems unnecessary. Returning by (const) reference would express intention better since:

  • The return value should never be null
  • The caller should not determine the lifetime of the returned object

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method should be marked const since it shouldn't mutate the Context

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I choose to hold a shared_ptr to the mutex in #998 because of the following reason:

  • We have the default context, with static lifetime.
  • The global logging mutex also have static lifetime.
  • According to the standard, objects with static lifetime created in function scope in a single translation unit are created on first use, and destructed in the opposite order they well created. There's no ordering ensured for objects in different translation units.

Then imagine the following situation:

  • An user has a program that uses two contexts: the default one and another one with static lifetime.
  • One of both contexts get initialized first (doesn't matter which one), get_global_logging_configure_mutex is called and then the mutex is initialized (on first use).
  • The other context is created afterwards.
  • As destruction ordering is not ensured between different translation units, a possible destruction order is:
    • context1
    • mutex
    • context2

In case shutdown for both contexts wasn't explicitly called the program, the Context destructor will call it and that takes the logging mutex, which can segfault if the mutex was previously destroyed.

That's the long story of why using the shared pointer ...

Having say that, I should've added a comment in the code. The rationale is not obvious at all.

node_name.c_str(), namespace_.c_str(),
context_->get_rcl_context().get(), &rcl_node_options);
{
std::lock_guard<std::mutex> guard(*(context_->get_rcl_logging_configure_mutex()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two subtle bugs here due to using a shared_ptr:

  1. What if it's null?
  2. What if the object is deleted after the shared_ptr is deterrence?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if it's null?

We should add a check and handle that case correctly.

What if the object is deleted after the shared_ptr is deterrence?

If the Node holds the shared pointer, that's avoided.

@ivanpauno
Copy link
Member

We decided to finally add mutual exclusion protection in rclcpp/rclpy instead of rcl, ros2/rcl#660 is a start on that path.

We're near to release date and we are running out of time, for that reason we will continue the fix proposal within the ros2 team to streamline the implementation and iterate on reviews faster.

I will probably borrow some of your code @fujitatomoya, thanks for starting a fix!

@fujitatomoya
Copy link
Collaborator Author

@ivanpauno @rotu

thanks for catching up!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants