Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_communication tests take a long time and provide unclear feedback on error #404

Open
rotu opened this issue Mar 10, 2020 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@rotu
Copy link
Contributor

rotu commented Mar 10, 2020

Bug report

Required Info:

  • Operating System:
    • All
  • Installation type:
    • All
  • Version or commit hash:
  • DDS implementation:
    • All
  • Client library (if applicable):
    • All

Steps to reproduce issue

colcon test --packages-select test_communication

Expected behavior

If failures exist, each test case terminates in a short (<1s) time and reports a relevant failure message (something like "10 messages were sent but 0 were received").

Actual behavior

On failure, the test takes a long time (10s) and the message only reports "timed out waiting for ... to finish". This makes it sound like the receiving process deadlocked. Additionally, the assertion uses a confusing string representation of the launch action object, where the subscriber executable name and arguments would be more appropriate.

https://ci.ros2.org/user/rotu/my-views/view/Extra%20RMW/job/nightly_linux-aarch64_extra_rmw_release/670/testReport/test_communication/TestPublisherSubscriber/test_subscriber_terminates_in_a_finite_amount_of_time_Arrays_/

Traceback (most recent call last):
  File "/home/jenkins-agent/workspace/nightly_linux-aarch64_extra_rmw_release/ws/build/test_communication/test_publisher_subscriber__rclpy__rclcpp__rmw_fastrtps_dynamic_cpp__rmw_cyclonedds_cpp_Release.py", line 66, in test_subscriber_terminates_in_a_finite_amount_of_time
    proc_info.assertWaitForShutdown(process=subscriber_process, timeout=10)
  File "/home/jenkins-agent/workspace/nightly_linux-aarch64_extra_rmw_release/ws/install/launch_testing/lib/python3.6/site-packages/launch_testing/proc_info_handler.py", line 144, in assertWaitForShutdown
    assert success, "Timed out waiting for process '{}' to finish".format(process)
AssertionError: Timed out waiting for process '<launch.actions.execute_process.ExecuteProcess object at 0xffff7e693e80>' to finish
@hidmic
Copy link
Contributor

hidmic commented Mar 20, 2020

I fully agree assertion messages can (and should) be made clearer, simply not obscuring subscriber output, where the actual test is. But I don't think:

If failures exist, each test case terminates in a short (<1s) time and reports a relevant failure message (something like "10 messages were sent but 0 were received").

is desirable, let alone achievable (in the short to mid-term). Particularly making tests faster.


Cross-vendor pub sub tests consist of a publisher process sending a fixed sequence of known messages in cycles and a subscriber process asserting on those messages. In that context, launch testing infrastructure plays the same role than ROS 1 rostest, and nothing more. Namely, a process launcher with timeouts and return codes checks, plus test output aggregation (where IMHO launch_testing is particularly bad at).

To assert on the number of sent messages vs. received messages implies there's a synchronization mechanism between launcher, publisher and subscriber processes in place, or otherwise latencies in process creation, process scheduling, DDS participant creation, DDS participant discovery, to name a few, will render the test flakey and unpredictable. A synchronization mechanism that does not currently exist in the framework. We can later discuss whether that should be introduced or not, but it's certainly unlikely to land in the short term.

We could achieve something like what you describe by simplifying the test down to whether a message was received or not, but that would then be a different test.

@rotu
Copy link
Contributor Author

rotu commented Mar 20, 2020

  1. There is a synchronization mechanism implied; all messages must be received within a 10s window from launch for the test to be considered passing. That window seems unreasonably long and could be made much shorter. Considering all test cases I can see pass in < 2 seconds, it seems that window can be reduced.

  2. Yes, it's totally reasonable to have some communication from the listener back to the test. Right now, that communication is an exit status. It could be a file, shared memory, a stream, or any sort of IPC channel.

@hidmic
Copy link
Contributor

hidmic commented Mar 21, 2020

There is a synchronization mechanism implied; all messages must be received within a 10s window from launch for the test to be considered passing.

Processes start of execution is synchronized, but that's far from enough. That window is a worst case scenario, though AFAIK it's true that no exhaustive search has been conducted to find a lower bound for it.

That window seems unreasonably long and could be made much shorter. Considering all test cases I can see pass in < 2 seconds, it seems that window can be reduced.

Which tests? With which RMW implementation? On which platform? Under what CPU load?

Yes, it's totally reasonable to have some communication from the listener back to the test.

The listener is the test. So I'd refrain from rolling out an IPC just to get the launcher on this test to do the assertion.


Thinking about this again, we could explore having more timeouts in the listener e.g. timeout to first message arrival, timeout to final message arrival, instead of a single, global one, though I'd not dare to guess how small these can be made to get the test passing for all (RMW, OS) combinations. You're more than welcome to contribute an attempt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants