Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use wait_for_service after creating parameters_client #219

Merged
merged 8 commits into from
Aug 23, 2017

Conversation

dhood
Copy link
Member

@dhood dhood commented Aug 12, 2017

connects to ros2/rclcpp#356

I don't think it's necessary to do any waiting when calling node->set_parameters because FWIU there are no service calls involved (correct me if I'm wrong).
Also, now that we wait in so many places, the timeout has to be increased significantly (15 waits of 20s)

CI including other branches that add the implementation of wait_for_service and use it in demos:

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status

this should fix flaky tests e.g. http://ci.ros2.org/view/nightly/job/nightly_linux_repeated/744/testReport/junit/(root)/projectroot/gtest_local_parameters__rmw_fastrtps_cpp but it doesn't fix them completely yet (the test can hang while waiting for the service instead of the assertion being raised after max 20s Build Status)

@dhood dhood added the in progress Actively being worked on (Kanban column) label Aug 12, 2017
@dhood dhood self-assigned this Aug 12, 2017
@dhood
Copy link
Member Author

dhood commented Aug 17, 2017

I don't think that wait_for_service is the issue here anymore. I'll provide a more detailed description tomorrow but will merge the connected PRs tonight before the nightlies start to get fewer test failures in the demo_nodes_cpp package in the meantime.

@dhood
Copy link
Member Author

dhood commented Aug 18, 2017

OK, following up. This only happens with fastrtps (here's connext passing 100 times without any other modifications: Build Status). It usually gets stuck waiting for the service response. The reason why I didn't think that it was related to the wait_for_service method of the parameters client is because sometimes this happens even after a successful call to the service, e.g.:

[ RUN      ] test_local_parameters__rmw_fastrtps_cpp.local_synchronous
Setting parameters
Listing parameters with recursive depth
Listing parameters with depth of 1
Listing parameters with depth of 2
Getting parameters
Getting nonexistent parameters
Listing parameters with recursive depth
Listing parameters with depth 100
Listing parameters with depth 1

<hangs>

It seems to wait for the Nth service call indefinitely and never receive it. This is fixed by ros2/rmw_fastrtps#143

Other times the service call is complete and the executor is trying to return success, but then it hangs. This is fixed by ros2/rmw_fastrtps#142

While investigating, I've fixed a potential memory error in ros2/rcl_interfaces#18 and also prevented another crash that occurs at shutdown with eProsima/Fast-DDS#134, but neither of those fix the hanging by themselves.

On my machine, which at this point has lots of zombie nodes lying around, to run the tests repeatedly eProsima/Fast-DDS#134 is needed to prevent crashing on teardown. On the buildfarm it didn't seem to be necessary though:

These jobs include this PR and both of the rmw_fastrtps PRs. It does NOT include the fastrtps PR or the rcl_interfaces PR. The local_parameters__rmw_fastrtps_cpp test was run 100 times checking for flakiness:

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status

(For the record, here are jobs that also include the fastrtps PR)
This PR, the fastrtps PR, and both of the rmw_fastrtps PRs:

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status

@dhood dhood force-pushed the sync_param_wait_for_service branch from 2e3da85 to ce72ada Compare August 18, 2017 18:24
@dhood
Copy link
Member Author

dhood commented Aug 18, 2017

Redid the OSX job. This is rerunning local_parameters__rmw_fastrtps_cpp tests 100 times on OS X again, but this time including ros2/rcl_interfaces#18: Build Status
Since we're dealing with flaky tests it's hard to know for sure if that PR had anything to do with it, but regardless, the tests in local_paramters__rmw_fastrtps_cpp are much much less flaky now if they can pass 100 times.

This is repeating all of the test_rclcpp tests for fastrtps 100 times Build Status. The only failures are of type "service not available after waiting" but I'd consider that a separate issue that I won't investigate at this time.

In summary, with the other PRs allowing this PR to be tested thoroughly I'm confident in it and am putting this PR back in review.

@dhood dhood added in review Waiting for review (Kanban column) and removed in progress Actively being worked on (Kanban column) labels Aug 18, 2017
@@ -379,6 +404,7 @@ TEST(CLASSNAME(test_local_parameters, RMW_IMPLEMENTATION), set_parameter_if_not_
int main(int argc, char ** argv)
{
// NOTE: use custom main to ensure that rclcpp::init is called only once
setvbuf(stdout, NULL, _IONBF, BUFSIZ);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move this line one line higher. Otherwise it separates the comment line from the line it is commenting.

@@ -140,15 +140,15 @@ if(BUILD_TESTING)
TIMEOUT 70)
custom_gtest(gtest_local_parameters
"test/test_local_parameters.cpp"
TIMEOUT 30)
TIMEOUT 90)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice that we can reduce it to 90!

@dhood
Copy link
Member Author

dhood commented Aug 22, 2017

CI in ros2/rmw_fastrtps#147 (comment) showed that local parameter tests don't need wait_for_service, it must just be the remote ones. Indeed, https://github.com/ros2/system_tests/pull/159/files only ever added the workaround for the remote tests.

I didn't realise that when the service constructor returns, it must be good to go (I figured there was middleware work going on asynchronously). I'll remove the wait_for_service calls from the local_paramters tests.

@dhood
Copy link
Member Author

dhood commented Aug 23, 2017

Nevermind, there may/may not still be middleware work going on asynchronously. Regardless, we decided that the wait_for_service calls do belong conceptually, so I restored them.

Potentially, adding these wait_for_service calls might now make these tests flaky if whatever causes the service not available after waiting error in the remote client-server tests is, for example, a race condition that we're now bringing into these tests. I don't think that should impact our action for now, just being verbose.

adding so many wait_for_service calls means the timeout has to be large to avoid timeouts. I opened ros2/rclcpp#360 to track that we should put timeouts on the parameter calls when available so that we don't waste 5min if one service call fails.

standard CI

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status (unrelated flaky tests)

repeating parameter tests 100 times:

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status

@dhood dhood merged commit b778786 into master Aug 23, 2017
@dhood dhood deleted the sync_param_wait_for_service branch August 23, 2017 17:29
@dhood dhood removed the in review Waiting for review (Kanban column) label Aug 23, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants