-
Notifications
You must be signed in to change notification settings - Fork 418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
potential fix for issue 192 #193
Conversation
I think this should raise an exception. I haven't taken a deeper look at #192, but it seems to be a race condition. Should we guard the map with a lock and also raise the exception? |
Alright, I'm throwing an exception and locking |
http://ci.ros2.org/job/ci_linux/909/ Not sure why the Windows job didn't launch, perhaps something is wrong with the build machine. |
@jacquelinekay I had to launch one manually this morning, not from launcher. |
Seems to work for me? http://ci.ros2.org/job/ci_windows/999/ Or are you saying the launcher job didn't work? |
@wjwwood the launcher didn't launch a Windows job. |
Each build captures the console output: http://ci.ros2.org/job/ci_launcher/296/console |
It looks like when the CI job is being saved through the web UI the empty CMAKE_BUILD_TYPE is being dropped. |
Should be fixed by ros2/ros2#187 |
Thanks for addressing the issue with CI. The mutex solution for this issue doesn't work as I initially thought it would (fails for Opensplice on Jenkins and locally). So, the summarize, the new test case includes a client that sends a request and then goes out of scope after getting the response, then another client (same topic name) that sends another request. I think that the new test fails because some of the state associated with the first client remains in Node after it goes out of scope. I haven't pinpointed exactly where this happens though. |
@jacquelinekay do you think it could be the references to these shared pointers? https://github.com/ros2/rclcpp/blob/issue_192/rclcpp/include/rclcpp/client.hpp#L63 I think that once the client goes out of scope, the references are invalid, but still accessed in the callback. |
@jacquelinekay changing them to shared pointers without reference might fix the issue, but I don't know for sure. |
Alright, I am totally mystified now. I fixed the segfault by changing the storage of Clients in CallbackGroup to weak_ptr instead of shared_ptr, so that the Client shared_ptr doesn't persist after the shared_ptr owned by the user goes out of scope. After this change, the test passed. However, when I changed the storage of Servers in CallbackGroup to weak_ptr as well, the test hangs with deadlock. I have no idea why that is. |
CI for this solution: http://ci.ros2.org/job/ci_linux/926/ |
well, according to Jenkins that solution still isn't good enough... |
Squashing in preparation of the merge. |
e027b1c
to
d23af5b
Compare
I tested this fix. It works mostly. I get no segfaults anymore but an exception:
I don't think that this should happen or? |
@firesurfer Can you please clarify what "this" means? Did you test with the patches from all three related PRs (see #192): |
+1 to merge (after we resolve @firesurfer's exception). |
I used this PR for testing. (I assume it's the issue_192 branch in rclcpp ?) By this I mean the exception I got.
The exception is raised after using spin_node_until_future_complete for several times. |
@firesurfer How can I reproduce the problem? I tried your client/server demo (https://github.com/firesurfer/Segfault_demo) on OSX and Linux, in Debug and Release, but can't get an exception to throw. To test, I modified your client to return from main instead of spinning at the end (https://github.com/firesurfer/Segfault_demo/blob/master/src/test.cpp#L43). If I run one instance of the server and then run the client in a bash loop, it seems to work fine. For reference, I'm on master for everything except rclcpp, where I'm on issue_192. |
After several hours of trying to reproduce this problem in an example I didn't manage to do so. this->pending_requests_.count(sequence_number) == 0 What could cause that the handle response_function is called even though there aren't any pending requests ? |
@firesurfer Are you still seeing the exception in your application, but not in a simple example? @esteve suggested that the exception be introduced because it's an error condition. Perhaps he can comment on how it might happen. |
When we debugged the client disconnect I think the server received a last a not alive instance which had a very high sequence number. Maybe that is triggering the exception https://github.com/ros2/rclcpp/blob/issue_192/rclcpp/include/rclcpp/client.hpp#L121 |
@gerkey Yes I get the exception in my application but cannot reproduce it in an example. @dirk-thomas I replaced the exception for testing with a return. Everything works fine then. (But I don't know if there are any side effects like higher cpu load - that's hard to debug) |
I replaced the exception with an error message and return in 5ac02fd. New CI jobs: |
c8268ec
to
4a04fe8
Compare
potential fix for issue 192
* adapt to NULL removal from rmw result valiation string Signed-off-by: Ethan Gao <ethan.gao@linux.intel.com> * adapt to rmw validation change and update rcl topic validation Signed-off-by: Ethan Gao <ethan.gao@linux.intel.com> * tweak the default error string returned Signed-off-by: Ethan Gao <ethan.gao@linux.intel.com> * tweak error output format Signed-off-by: Ethan Gao <ethan.gao@linux.intel.com> * fix build error Signed-off-by: Ethan Gao <ethan.gao@linux.intel.com>
Connects to #192
I'm not sure if this is a good strategy, do we need to manage the state of the client somehow?
@esteve what are your thoughts?