Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

defer signal handling to a singleton thread #605

Merged
merged 11 commits into from
Dec 12, 2018

Conversation

wjwwood
Copy link
Member

@wjwwood wjwwood commented Dec 8, 2018

Fixes #604

hidmic and others added 3 commits December 7, 2018 18:21
Signed-off-by: William Woodall <william@osrfoundation.org>
Signed-off-by: William Woodall <william@osrfoundation.org>
@wjwwood wjwwood added bug Something isn't working in review Waiting for review (Kanban column) labels Dec 8, 2018
@wjwwood wjwwood added this to the crystal milestone Dec 8, 2018
@wjwwood wjwwood self-assigned this Dec 8, 2018
@nuclearsandwich
Copy link
Member

@ros-pull-request-builder retest this please

@nuclearsandwich
Copy link
Member

The Cpr CI failures look like they might actually useful. The errors relate to contents of the PR and dependency packages are all available.

@nuclearsandwich
Copy link
Member

nuclearsandwich commented Dec 8, 2018

Optimistically started CI builds

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status

@mjcarroll
Copy link
Member

Looks like MacOS may have failed due to network issues, then failed again on test_get_node_names: https://ci.ros2.org/job/ci_osx/4837/testReport/

@hidmic
Copy link
Contributor

hidmic commented Dec 10, 2018

Another network failure. I don't see any errors though.

@nuclearsandwich
Copy link
Member

curl'd the build log locally and grep'd through it and it does appear that every test which ran passed so I'm suggesting we treat that mac build as passing.

@dirk-thomas
Copy link
Member

it does appear that every test which ran passed

The summary line explicitly says: 24522 tests, 0 errors, 0 failures, 50 skipped 👍

Karsten1987
Karsten1987 previously approved these changes Dec 10, 2018
Copy link
Contributor

@Karsten1987 Karsten1987 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this patch makes CI happy, then this looks good to me. I believe I don't have too much context to see to review as deeply as I wanted to, but I had a few minor comments generally to it.

rclcpp/include/rclcpp/context.hpp Outdated Show resolved Hide resolved
void
deferred_signal_handler()
{
while (true) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while (is_installed()) ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this would work because the location of the if (!is_installed()) { is important, in that it comes after the loop over the context's. So I'd pass on changing this logic unless @hidmic thinks it's a good idea to do so.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @wjwwood, location of the statement is important. Changing it may cause the process to lock forever if signal handlers are uninstalled as a result of context shutdown. Keeping both may prevent a signal to be handled if it comes while the handler is about to be uninstalled.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's fine for me. I was just wondering if we could somehow change the while(true), but I guess in this case it's all right.

rclcpp/src/rclcpp/utilities.cpp Outdated Show resolved Hide resolved
rclcpp/src/rclcpp/context.cpp Outdated Show resolved Hide resolved
rclcpp/include/rclcpp/context.hpp Outdated Show resolved Hide resolved
rclcpp/include/rclcpp/context.hpp Outdated Show resolved Hide resolved
rclcpp/src/rclcpp/context.cpp Show resolved Hide resolved
#endif
signal_received.exchange(true);
events_condition_variable.notify_one();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @hidmic mentioned this before, calling notify_one() from a signal handler is undefined behavior. Is there anything else that can be used to wake the thread?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I asked him about this, but I don't remember his answer. Do you know of an alternative? Only thing I can think of is a busy wait in the signal handler loop and an atomic, but that doesn't sound very nice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Searching only turns up things that are platform specific.

sem_post() for POSIX platforms http://pubs.opengroup.org/onlinepubs/009695399/functions/sem_post.html

I'm not sure about windows. I can't tell from the documentation if windows semaphores are signal safe, https://docs.microsoft.com/en-us/windows/desktop/sync/using-semaphore-objects

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like notfiy_one() in libcxx uses pthread:

https://github.com/llvm-mirror/libcxx/blob/35a0c2ce5e18ce57d8f867e1f57deffc8428d76b/include/__threading_support#L274-L278

Which I gather isn't safe enough? Is that your conclusion as well @sloretz?

On Windows it uses WakeConditionVariable():

https://github.com/llvm-mirror/libcxx/blob/b52e084ca81da9f2d03ab21a8625aeada7e64a67/src/support/win32/thread_win32.cpp#L95-L99

We're not using libcxx on Windows, but I imagine the std::condition_variable is also implemented with that function in MSVC.


What do you (or others) think we should do?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sloretz Yeah, that's the thing with signal handlers 😅

AFAIK there's no completely safe cross-platform synchronization mechanism that can be used from a signal handler (if only Windows was POSIX compliant...). Busy waits, with or without a timeout, are a portable option. Not a very neat one though. We can also do some more research on how to do this right for each platform.

I'd expect the current implementation to work in almost all cases though. Unless a signal comes right when the thread is about to do something with the condition variable. and that only takes place during a negligible fraction of the process lifetime.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which I gather isn't safe enough? Is that your conclusion as well @sloretz?

It uses pthread_cond_signal, which this page says

It is not safe to use the pthread_cond_signal() function in a signal handler that is invoked 
asynchronously. Even if it were safe, there would still be a race between the test of the Boolean 
pthread_cond_wait() that could not be efficiently eliminated. 

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so I was wrong, this is not safe for pthread, see:

It is not safe to use the pthread_cond_signal() function in a signal handler that is invoked asynchronously. Even if it were safe, there would still be a race between the test of the Boolean pthread_cond_wait() that could not be efficiently eliminated.

Mutexes and condition variables are thus not suitable for releasing a waiting thread by signaling from code running in a signal handler.
-- https://linux.die.net/man/3/pthread_cond_signal

rclcpp/src/rclcpp/utilities.cpp Show resolved Hide resolved
Signed-off-by: William Woodall <william@osrfoundation.org>
Signed-off-by: William Woodall <william@osrfoundation.org>
Signed-off-by: William Woodall <william@osrfoundation.org>
Signed-off-by: William Woodall <william@osrfoundation.org>
Signed-off-by: William Woodall <william@osrfoundation.org>
Signed-off-by: William Woodall <william@osrfoundation.org>
@wjwwood
Copy link
Member Author

wjwwood commented Dec 11, 2018

I addressed some of the feedback, still need to address the locking of the install/uninstall functions.

Also the safety of notifying from a signal.

Signed-off-by: William Woodall <william@osrfoundation.org>
@wjwwood
Copy link
Member Author

wjwwood commented Dec 11, 2018

Ok, please re-review this. I'm working on a solution for the unsafe notification issue, but I think we should merge it as-is under the idea that's it is an incremental improvement (we were doing much more than using cond var's in the signal handler before).

Can someone do CI and stuff for this while I work on the improved signal notification solution?

We can do that in a follow up pr, since it should be API and ABI neutral.

@sloretz
Copy link
Contributor

sloretz commented Dec 11, 2018

CI (testing packages above rclcpp)

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status

Copy link
Contributor

@Karsten1987 Karsten1987 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm with green CI. Just to make sure, the flaky test is addressed with the CI configuration?

@sloretz
Copy link
Contributor

sloretz commented Dec 11, 2018

I think this is a step in the right direction, though it still doesn't solve ros2/build_farmer#152. I can reproduce the same timed-out test with this PR using

colcon test --retest-until-fail 100 --event-handlers console_direct+ --packages-select test_communication --ctest-args -R test_action_client_server__Fibonacci__rclcpp__rmw_fastrtps_cpp\$

Edit: oops ros2/build_farmer#152 (comment)

@hidmic
Copy link
Contributor

hidmic commented Dec 11, 2018

It used to happen roughly once every 10 times or so for me, so I agree here.

@wjwwood
Copy link
Member Author

wjwwood commented Dec 12, 2018

So is this good to merge then @sloretz? @hidmic have you had a chance to try and reproduce it too with this branch? @sloretz and I were not able to do so, despite the fact that it appeared this way at first (it was failing, not hanging, for a different reason).

@wjwwood wjwwood merged commit a54a329 into master Dec 12, 2018
@wjwwood wjwwood removed the in review Waiting for review (Kanban column) label Dec 12, 2018
@wjwwood wjwwood deleted the hidmic/defer-signal-handling branch December 12, 2018 03:07
cho3 pushed a commit to cho3/rclcpp that referenced this pull request Jun 3, 2019
* [WIP] Refactor signal handling.

* fix deadlock

Signed-off-by: William Woodall <william@osrfoundation.org>

* finished fixing signal handling and removing more global state

Signed-off-by: William Woodall <william@osrfoundation.org>

* add missing include of <condition_variable>

* use unordered map in signal handling class

Signed-off-by: William Woodall <william@osrfoundation.org>

* use consistent terminology

Signed-off-by: William Woodall <william@osrfoundation.org>

* use emplace in map

Signed-off-by: William Woodall <william@osrfoundation.org>

* avoid throwing in destructor

Signed-off-by: William Woodall <william@osrfoundation.org>

* words

Signed-off-by: William Woodall <william@osrfoundation.org>

* avoid throwing from destructors in a few places

Signed-off-by: William Woodall <william@osrfoundation.org>

* make install/uninstall thread-safe

Signed-off-by: William Woodall <william@osrfoundation.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants