Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

potential fix for issue 192 #193

Merged
merged 1 commit into from
Feb 16, 2016
Merged

potential fix for issue 192 #193

merged 1 commit into from
Feb 16, 2016

Conversation

jacquelinekay
Copy link
Contributor

Connects to #192

I'm not sure if this is a good strategy, do we need to manage the state of the client somehow?

@esteve what are your thoughts?

@jacquelinekay jacquelinekay added the in progress Actively being worked on (Kanban column) label Feb 3, 2016
@esteve
Copy link
Member

esteve commented Feb 3, 2016

I think this should raise an exception. I haven't taken a deeper look at #192, but it seems to be a race condition. Should we guard the map with a lock and also raise the exception?

@jacquelinekay
Copy link
Contributor Author

Alright, I'm throwing an exception and locking pending_requests.

@jacquelinekay
Copy link
Contributor Author

http://ci.ros2.org/job/ci_linux/909/
http://ci.ros2.org/job/ci_osx/768/

Not sure why the Windows job didn't launch, perhaps something is wrong with the build machine.

@esteve
Copy link
Member

esteve commented Feb 8, 2016

@jacquelinekay I had to launch one manually this morning, not from launcher.

@wjwwood
Copy link
Member

wjwwood commented Feb 8, 2016

Seems to work for me? http://ci.ros2.org/job/ci_windows/999/ Or are you saying the launcher job didn't work?

@esteve
Copy link
Member

esteve commented Feb 8, 2016

@wjwwood the launcher didn't launch a Windows job.

@dirk-thomas
Copy link
Member

Each build captures the console output: http://ci.ros2.org/job/ci_launcher/296/console

@dirk-thomas
Copy link
Member

It looks like when the CI job is being saved through the web UI the empty CMAKE_BUILD_TYPE is being dropped.

@dirk-thomas
Copy link
Member

Should be fixed by ros2/ros2#187

@jacquelinekay
Copy link
Contributor Author

Thanks for addressing the issue with CI.

The mutex solution for this issue doesn't work as I initially thought it would (fails for Opensplice on Jenkins and locally).

So, the summarize, the new test case includes a client that sends a request and then goes out of scope after getting the response, then another client (same topic name) that sends another request.

I think that the new test fails because some of the state associated with the first client remains in Node after it goes out of scope. I haven't pinpointed exactly where this happens though.

@esteve
Copy link
Member

esteve commented Feb 9, 2016

@jacquelinekay do you think it could be the references to these shared pointers?

https://github.com/ros2/rclcpp/blob/issue_192/rclcpp/include/rclcpp/client.hpp#L63

I think that once the client goes out of scope, the references are invalid, but still accessed in the callback.

@esteve
Copy link
Member

esteve commented Feb 9, 2016

@jacquelinekay changing them to shared pointers without reference might fix the issue, but I don't know for sure.

@jacquelinekay
Copy link
Contributor Author

Alright, I am totally mystified now.

I fixed the segfault by changing the storage of Clients in CallbackGroup to weak_ptr instead of shared_ptr, so that the Client shared_ptr doesn't persist after the shared_ptr owned by the user goes out of scope. After this change, the test passed.

However, when I changed the storage of Servers in CallbackGroup to weak_ptr as well, the test hangs with deadlock.

I have no idea why that is.

@jacquelinekay
Copy link
Contributor Author

@jacquelinekay
Copy link
Contributor Author

well, according to Jenkins that solution still isn't good enough...

@dirk-thomas dirk-thomas added in review Waiting for review (Kanban column) and removed in progress Actively being worked on (Kanban column) labels Feb 13, 2016
@dirk-thomas
Copy link
Member

Squashing in preparation of the merge.

@firesurfer
Copy link

I tested this fix. It works mostly. I get no segfaults anymore but an exception:

terminate called after throwing an instance of 'std::runtime_error'
  what():  Received invalid sequence number. Request no longer exists.

I don't think that this should happen or?

@dirk-thomas
Copy link
Member

@firesurfer Can you please clarify what "this" means? Did you test with the patches from all three related PRs (see #192):

@gerkey
Copy link
Member

gerkey commented Feb 13, 2016

+1 to merge (after we resolve @firesurfer's exception).

@firesurfer
Copy link

I used this PR for testing. (I assume it's the issue_192 branch in rclcpp ?)
My build uses rmw_opensplice so i didn't use the patch for rmw_connext.

By this I mean the exception I got.
The full backtrace from gdb:

erminate called after throwing an instance of 'std::runtime_error'
  what():  Received invalid sequence number. Request no longer exists.

Program received signal SIGABRT, Aborted.
0x00007fffed938507 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:55
55  ../sysdeps/unix/sysv/linux/raise.c: Datei oder Verzeichnis nicht gefunden.
(gdb) backtrace 
#0  0x00007fffed938507 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:55
#1  0x00007fffed9398da in __GI_abort () at abort.c:89
#2  0x00007fffee47033d in __gnu_cxx::__verbose_terminate_handler() ()
   from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007fffee46e3b6 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007fffee46e401 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007fffee46e619 in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007ffff0f07d9b in rclcpp::client::Client<rcl_interfaces::srv::ListParameters>::handle_response(std::shared_ptr<void>, std::shared_ptr<void>) ()
   from /home/firesurfer/workspace/ros2_ws/install/lib/librclcpp.so
#7  0x00007ffff0eb746f in rclcpp::executor::Executor::execute_client(std::shared_ptr<rclcpp::client::ClientBase>) () from /home/firesurfer/workspace/ros2_ws/install/lib/librclcpp.so
#8  0x00007ffff0eb6ddb in rclcpp::executor::Executor::execute_any_executable(std::shared_ptr<rclcpp::executor::AnyExecutable>) ()
   from /home/firesurfer/workspace/ros2_ws/install/lib/librclcpp.so
#9  0x00007ffff0eb6a41 in rclcpp::executor::Executor::spin_once(std::chrono::duration<long, std::ratio<1l, 1000000000l> >) () from /home/firesurfer/workspace/ros2_ws/install/lib/librclcpp.so
#10 0x00000000006e3d81 in rclcpp::executor::Executor::spin_until_future_complete<std::vector<rclcpp::parameter::ParameterVariant, std::allocator<rclcpp::parameter::ParameterVariant> >, std::ratio<1l, 1000l> > (this=0x7fffffffbcc0, future=..., timeout=...)
    at /home/firesurfer/workspace/ros2_ws/install/include/rclcpp/executor.hpp:184
#11 0x00000000006d2636 in rclcpp::executors::spin_node_until_future_complete<std::vector<rclcpp::parameter::ParameterVariant, std::allocator<rclcpp::parameter::ParameterVariant> >, std::ratio<1l, 1000l> > (executor=..., node_ptr=std::shared_ptr (count 80, weak 1) 0x9f3e30, future=..., 

The exception is raised after using spin_node_until_future_complete for several times.

@gerkey
Copy link
Member

gerkey commented Feb 14, 2016

@firesurfer How can I reproduce the problem? I tried your client/server demo (https://github.com/firesurfer/Segfault_demo) on OSX and Linux, in Debug and Release, but can't get an exception to throw. To test, I modified your client to return from main instead of spinning at the end (https://github.com/firesurfer/Segfault_demo/blob/master/src/test.cpp#L43). If I run one instance of the server and then run the client in a bash loop, it seems to work fine.

For reference, I'm on master for everything except rclcpp, where I'm on issue_192.

@firesurfer
Copy link

After several hours of trying to reproduce this problem in an example I didn't manage to do so.
The exception is thrown because this statement is true:

this->pending_requests_.count(sequence_number) == 0

What could cause that the handle response_function is called even though there aren't any pending requests ?

@gerkey
Copy link
Member

gerkey commented Feb 15, 2016

@firesurfer Are you still seeing the exception in your application, but not in a simple example?

@esteve suggested that the exception be introduced because it's an error condition. Perhaps he can comment on how it might happen.

@dirk-thomas
Copy link
Member

When we debugged the client disconnect I think the server received a last a not alive instance which had a very high sequence number. Maybe that is triggering the exception https://github.com/ros2/rclcpp/blob/issue_192/rclcpp/include/rclcpp/client.hpp#L121
I don't think throwing an exception in this case is the right solution. If the case should not just be silently ignored it could print a warning instead?

@firesurfer
Copy link

@gerkey Yes I get the exception in my application but cannot reproduce it in an example.

@dirk-thomas I replaced the exception for testing with a return. Everything works fine then. (But I don't know if there are any side effects like higher cpu load - that's hard to debug)

@dirk-thomas
Copy link
Member

I replaced the exception with an error message and return in 5ac02fd.

New CI jobs:

dirk-thomas added a commit that referenced this pull request Feb 16, 2016
potential fix for issue 192
@dirk-thomas dirk-thomas merged commit 69f7bca into master Feb 16, 2016
@dirk-thomas dirk-thomas deleted the issue_192 branch February 16, 2016 22:19
@dirk-thomas dirk-thomas removed the in review Waiting for review (Kanban column) label Feb 16, 2016
nnmm pushed a commit to ApexAI/rclcpp that referenced this pull request Jul 9, 2022
* adapt to NULL removal from rmw result valiation string

Signed-off-by: Ethan Gao <ethan.gao@linux.intel.com>

* adapt to rmw validation change and update rcl topic validation

Signed-off-by: Ethan Gao <ethan.gao@linux.intel.com>

* tweak the default error string returned

Signed-off-by: Ethan Gao <ethan.gao@linux.intel.com>

* tweak error output format

Signed-off-by: Ethan Gao <ethan.gao@linux.intel.com>

* fix build error

Signed-off-by: Ethan Gao <ethan.gao@linux.intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants