Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Flaky test #29

Closed
dirk-thomas opened this issue Apr 30, 2015 · 13 comments
Closed

Flaky test #29

dirk-thomas opened this issue Apr 30, 2015 · 13 comments
Labels
duplicate This issue or pull request already exists

Comments

@dirk-thomas
Copy link
Member

The following test failed: http://54.183.26.131:8080/view/ros2/job/ros2_batch_ci_osx/97/testReport/(root)/test_publish_subscribe__rmw_opensplice_cpp/test_publish_subscribe/

But the same state passed in the other build: http://54.183.26.131:8080/view/ros2/job/ros2_batch_ci_osx/98/testReport/

@wjwwood wjwwood added the in progress Actively being worked on (Kanban column) label Apr 30, 2015
@wjwwood wjwwood self-assigned this Apr 30, 2015
@wjwwood
Copy link
Member

wjwwood commented Apr 30, 2015

Unfortunately, I'm unable to reproduce this. I did this (after getting the lastest ros2.repos and cloning the sources fresh and building everything):

% while ./src/ament/ament_tools/scripts/ament.py test ./src --skip-build --skip-install --only userland; do :; done

After over an hour it had not failed. So I guess I'll have to run the farm until it fails and then hope to look to the OpenSplice logs.

@wjwwood
Copy link
Member

wjwwood commented May 5, 2015

I tried more this weekend and I was unable to reproduce this test on the VM, on the Farm, or on my machine. I guess I'll keep pushing jobs to the farm looking for a failure, but right now I don't know how to reproduce it reliably.

@wjwwood
Copy link
Member

wjwwood commented May 6, 2015

Ok, I've run the test in a loop hundreds of times without failure and done 10+ full builds on the farm (107 - 117 and then earlier builds):

http://54.183.26.131:8080/view/ros2/job/ros2_batch_ci_osx/

I can't seem to reproduce this. I'd say we should close it and reopen if it pops up again. If we want to leave it open, then I'm fine with that as well, but I don't think it makes sense for me to spend more time trying to reproduce it.

@dirk-thomas
Copy link
Member Author

@tfoote
Copy link
Contributor

tfoote commented May 6, 2015 via email

@wjwwood
Copy link
Member

wjwwood commented May 6, 2015

I'll just store the ospl log here so it doesn't get lost:

http://54.183.26.131:8080/view/ros2/job/ros2_batch_ci_linux/ws/workspace/src/ros2/examples/userland/ospl-info.log

========================================================================================
Report      : INFO
Date        : Tue May 05 17:30:13 UTC 2015
Description : u_domainNew:Database allocated on heap
Node        : 3695bb279581
Process     : test_publisher__rmw_opensplice_cpp <14318>
Thread      : spliced 2b460af6e700
Internals   : V6.4.140407OSS///user::u_domain::u_domainNew/u_domain.c/706/0/1430847013.274171622
========================================================================================
Report      : INFO
Date        : Tue May 05 17:30:13 UTC 2015
Description : u_domainNew:Database allocated on heap
Node        : 3695bb279581
Process     : test_subscriber__rmw_opensplice_cpp <14319>
Thread      : spliced 2b8ea9716700
Internals   : V6.4.140407OSS///user::u_domain::u_domainNew/u_domain.c/706/0/1430847013.274171611
========================================================================================
Report      : INFO
Date        : Tue May 05 17:30:13 UTC 2015
Description : ---------------------------------------------------------------
              -- The service is using the following configuration settings --
              ---------------------------------------------------------------
              Domain (id)          : ospl_sp_ddsi (0)
              Storage              : 10240 Kbytes
              Storage threshold    : 1024 Kbytes
              Storage address      : Not available
              Locking              : default
              Memory mode          : Heap memory
              Builtin topics       : true
              Priority inheritance : false
Node        : 3695bb279581
Process     : test_subscriber__rmw_opensplice_cpp <14319>
Thread      : spliced 2b8ea9716700
Internals   : V6.4.140407OSS///The OpenSplice domain service/u_domain.c/827/0/1430847013.282640899
========================================================================================
Report      : INFO
Date        : Tue May 05 17:30:13 UTC 2015
Description : ---------------------------------------------------------------
              -- The service is using the following configuration settings --
              ---------------------------------------------------------------
              Domain (id)          : ospl_sp_ddsi (0)
              Storage              : 10240 Kbytes
              Storage threshold    : 1024 Kbytes
              Storage address      : Not available
              Locking              : default
              Memory mode          : Heap memory
              Builtin topics       : true
              Priority inheritance : false
Node        : 3695bb279581
Process     : test_publisher__rmw_opensplice_cpp <14318>
Thread      : spliced 2b460af6e700
Internals   : V6.4.140407OSS///The OpenSplice domain service/u_domain.c/827/0/1430847013.282656243
========================================================================================
Report      : INFO
Date        : Tue May 05 17:30:13 UTC 2015
Description : Service cmsoap disabled
Node        : 3695bb279581
Process     : test_publisher__rmw_opensplice_cpp <14318>
Thread      : spliced 2b460af6e700
Internals   : V6.4.140407OSS///OpenSplice domain service/sr_componentInfo.c/439/0/1430847013.282884905
========================================================================================
Report      : INFO
Date        : Tue May 05 17:30:13 UTC 2015
Description : Service cmsoap disabled
Node        : 3695bb279581
Process     : test_subscriber__rmw_opensplice_cpp <14319>
Thread      : spliced 2b8ea9716700
Internals   : V6.4.140407OSS///OpenSplice domain service/sr_componentInfo.c/439/0/1430847013.283150527
========================================================================================
Report      : INFO
Date        : Tue May 05 17:30:13 UTC 2015
Description : ++++++++++++++++++++++++++++++++++++++++++++++++
              ++ The service 'splicedaemon' is now operational. 
              ++++++++++++++++++++++++++++++++++++++++++++++++
Node        : 3695bb279581
Process     : test_subscriber__rmw_opensplice_cpp <14319>
Thread      : spliced 2b8ea9716700
Internals   : V6.4.140407OSS///v_serviceChangeState/v_service.c/329/0/1430847013.283824133
========================================================================================
Report      : INFO
Date        : Tue May 05 17:30:13 UTC 2015
Description : ++++++++++++++++++++++++++++++++++++++++++++++++
              ++ The service 'splicedaemon' is now operational. 
              ++++++++++++++++++++++++++++++++++++++++++++++++
Node        : 3695bb279581
Process     : test_publisher__rmw_opensplice_cpp <14318>
Thread      : spliced 2b460af6e700
Internals   : V6.4.140407OSS///v_serviceChangeState/v_service.c/329/0/1430847013.284118283
========================================================================================
Report      : INFO
Date        : Tue May 05 17:30:13 UTC 2015
Description : ++++++++++++++++++++++++++++++++++++++++++++++++
              ++ The service 'ddsi2' is now operational. 
              ++++++++++++++++++++++++++++++++++++++++++++++++
Node        : 3695bb279581
Process     : test_subscriber__rmw_opensplice_cpp <14319>
Thread      : ddsi2 2b8eaca13700
Internals   : V6.4.140407OSS///v_serviceChangeState/v_service.c/329/0/1430847013.290753066
========================================================================================
Report      : INFO
Date        : Tue May 05 17:30:13 UTC 2015
Description : ++++++++++++++++++++++++++++++++++++++++++++++++
              ++ The service 'ddsi2' is now operational. 
              ++++++++++++++++++++++++++++++++++++++++++++++++
Node        : 3695bb279581
Process     : test_publisher__rmw_opensplice_cpp <14318>
Thread      : ddsi2 2b460e26b700
Internals   : V6.4.140407OSS///v_serviceChangeState/v_service.c/329/0/1430847013.301862006
========================================================================================
Report      : INFO
Date        : Tue May 05 17:30:13 UTC 2015
Description : The durability service can by identified by ID: 155390938
Node        : 3695bb279581
Process     : test_subscriber__rmw_opensplice_cpp <14319>
Thread      : durability 2b8eac6a2700
Internals   : V6.4.140407OSS///DurabilityService/d_durability.c/476/0/1430847013.305562558
========================================================================================
Report      : INFO
Date        : Tue May 05 17:30:14 UTC 2015
Description : The durability service can by identified by ID: 155390938
Node        : 3695bb279581
Process     : test_publisher__rmw_opensplice_cpp <14318>
Thread      : durability 2b460defa700
Internals   : V6.4.140407OSS///DurabilityService/d_durability.c/476/0/1430847014.004733349
========================================================================================
Report      : INFO
Date        : Tue May 05 17:30:20 UTC 2015
Description : ++++++++++++++++++++++++++++++++++++++++++++++++
              ++ The service 'durability' is now operational. 
              ++++++++++++++++++++++++++++++++++++++++++++++++
Node        : 3695bb279581
Process     : test_subscriber__rmw_opensplice_cpp <14319>
Thread      : durability 2b8eac6a2700
Internals   : V6.4.140407OSS///v_serviceChangeState/v_service.c/329/0/1430847020.313174654
========================================================================================
Report      : INFO
Date        : Tue May 05 17:30:21 UTC 2015
Description : ++++++++++++++++++++++++++++++++++++++++++++++++
              ++ The service 'durability' is now operational. 
              ++++++++++++++++++++++++++++++++++++++++++++++++
Node        : 3695bb279581
Process     : test_publisher__rmw_opensplice_cpp <14318>
Thread      : durability 2b460defa700
Internals   : V6.4.140407OSS///v_serviceChangeState/v_service.c/329/0/1430847021.012343457
========================================================================================
Report      : INFO
Date        : Tue May 05 17:30:24 UTC 2015
Description : Termination request (signal 2) received from PID 14317, UID 1001
Node        : 3695bb279581
Process     : test_subscriber__rmw_opensplice_cpp <14319>
Thread      : signalHandler 2b8eac0b5700
Internals   : V6.4.140407OSS///signalHandlerThread/os_signalHandler.c/503/0/1430847024.476668533
========================================================================================
Report      : INFO
Date        : Tue May 05 17:30:25 UTC 2015
Description : u_domainNew:Database allocated on heap
Node        : 3695bb279581
Process     : test_server__rmw_opensplice_cpp <14416>
Thread      : spliced 2b175c626700
Internals   : V6.4.140407OSS///user::u_domain::u_domainNew/u_domain.c/706/0/1430847025.137084301
========================================================================================
Report      : INFO
Date        : Tue May 05 17:30:25 UTC 2015
Description : u_domainNew:Database allocated on heap
Node        : 3695bb279581
Process     : test_client__rmw_opensplice_cpp <14417>
Thread      : spliced 2aebac79c700
Internals   : V6.4.140407OSS///user::u_domain::u_domainNew/u_domain.c/706/0/1430847025.142758022
========================================================================================
Report      : INFO
Date        : Tue May 05 17:30:25 UTC 2015
Description : ---------------------------------------------------------------
              -- The service is using the following configuration settings --
              ---------------------------------------------------------------
              Domain (id)          : ospl_sp_ddsi (0)
              Storage              : 10240 Kbytes
              Storage threshold    : 1024 Kbytes
              Storage address      : Not available
              Locking              : default
              Memory mode          : Heap memory
              Builtin topics       : true
              Priority inheritance : false
Node        : 3695bb279581
Process     : test_server__rmw_opensplice_cpp <14416>
Thread      : spliced 2b175c626700
Internals   : V6.4.140407OSS///The OpenSplice domain service/u_domain.c/827/0/1430847025.144236791
========================================================================================
Report      : INFO
Date        : Tue May 05 17:30:25 UTC 2015
Description : Service cmsoap disabled
Node        : 3695bb279581
Process     : test_server__rmw_opensplice_cpp <14416>
Thread      : spliced 2b175c626700
Internals   : V6.4.140407OSS///OpenSplice domain service/sr_componentInfo.c/439/0/1430847025.144455416
========================================================================================
Report      : INFO
Date        : Tue May 05 17:30:25 UTC 2015
Description : ++++++++++++++++++++++++++++++++++++++++++++++++
              ++ The service 'splicedaemon' is now operational. 
              ++++++++++++++++++++++++++++++++++++++++++++++++
Node        : 3695bb279581
Process     : test_server__rmw_opensplice_cpp <14416>
Thread      : spliced 2b175c626700
Internals   : V6.4.140407OSS///v_serviceChangeState/v_service.c/329/0/1430847025.145118095
========================================================================================
Report      : INFO
Date        : Tue May 05 17:30:25 UTC 2015
Description : ---------------------------------------------------------------
              -- The service is using the following configuration settings --
              ---------------------------------------------------------------
              Domain (id)          : ospl_sp_ddsi (0)
              Storage              : 10240 Kbytes
              Storage threshold    : 1024 Kbytes
              Storage address      : Not available
              Locking              : default
              Memory mode          : Heap memory
              Builtin topics       : true
              Priority inheritance : false
Node        : 3695bb279581
Process     : test_client__rmw_opensplice_cpp <14417>
Thread      : spliced 2aebac79c700
Internals   : V6.4.140407OSS///The OpenSplice domain service/u_domain.c/827/0/1430847025.149906919
========================================================================================
Report      : INFO
Date        : Tue May 05 17:30:25 UTC 2015
Description : Service cmsoap disabled
Node        : 3695bb279581
Process     : test_client__rmw_opensplice_cpp <14417>
Thread      : spliced 2aebac79c700
Internals   : V6.4.140407OSS///OpenSplice domain service/sr_componentInfo.c/439/0/1430847025.150115538
========================================================================================
Report      : INFO
Date        : Tue May 05 17:30:25 UTC 2015
Description : ++++++++++++++++++++++++++++++++++++++++++++++++
              ++ The service 'splicedaemon' is now operational. 
              ++++++++++++++++++++++++++++++++++++++++++++++++
Node        : 3695bb279581
Process     : test_client__rmw_opensplice_cpp <14417>
Thread      : spliced 2aebac79c700
Internals   : V6.4.140407OSS///v_serviceChangeState/v_service.c/329/0/1430847025.151134895
========================================================================================
Report      : INFO
Date        : Tue May 05 17:30:25 UTC 2015
Description : ++++++++++++++++++++++++++++++++++++++++++++++++
              ++ The service 'ddsi2' is now operational. 
              ++++++++++++++++++++++++++++++++++++++++++++++++
Node        : 3695bb279581
Process     : test_server__rmw_opensplice_cpp <14416>
Thread      : ddsi2 2b175f4fa700
Internals   : V6.4.140407OSS///v_serviceChangeState/v_service.c/329/0/1430847025.158122729
========================================================================================
Report      : INFO
Date        : Tue May 05 17:30:25 UTC 2015
Description : ++++++++++++++++++++++++++++++++++++++++++++++++
              ++ The service 'ddsi2' is now operational. 
              ++++++++++++++++++++++++++++++++++++++++++++++++
Node        : 3695bb279581
Process     : test_client__rmw_opensplice_cpp <14417>
Thread      : ddsi2 2aebaf670700
Internals   : V6.4.140407OSS///v_serviceChangeState/v_service.c/329/0/1430847025.163757112
========================================================================================
Report      : INFO
Date        : Tue May 05 17:30:25 UTC 2015
Description : The durability service can by identified by ID: 2069236959
Node        : 3695bb279581
Process     : test_server__rmw_opensplice_cpp <14416>
Thread      : durability 2b175f189700
Internals   : V6.4.140407OSS///DurabilityService/d_durability.c/476/0/1430847025.464427966
========================================================================================
Report      : INFO
Date        : Tue May 05 17:30:25 UTC 2015
Description : The durability service can by identified by ID: 2060337673
Node        : 3695bb279581
Process     : test_client__rmw_opensplice_cpp <14417>
Thread      : durability 2aebaf2ff700
Internals   : V6.4.140407OSS///DurabilityService/d_durability.c/476/0/1430847025.768200304
========================================================================================
Report      : INFO
Date        : Tue May 05 17:30:26 UTC 2015
Description : Termination request (signal 2) received from PID 14415, UID 1001
Node        : 3695bb279581
Process     : test_server__rmw_opensplice_cpp <14416>
Thread      : signalHandler 2b175eb9c700
Internals   : V6.4.140407OSS///signalHandlerThread/os_signalHandler.c/503/0/1430847026.450742454

@wjwwood
Copy link
Member

wjwwood commented May 6, 2015

I don't see anything at issue in the ospl log. I suppose it could be a setup race condition. There's no way of knowing when the ospl subscriber is ready to receive message and if the (I think) 10s of publishing before giving up is sufficient. If 10s is not enough, there maybe some other issue, but currently it's pretty open loop, and that'd be my best guess as to why this occurs. I'm open to ideas.

@dirk-thomas
Copy link
Member Author

-1 on closing this. We see it happening on at least two platforms. We might want to reach out to the OpenSplice people and ask for their support.

@wjwwood
Copy link
Member

wjwwood commented May 6, 2015

This one also seems to have failed: http://54.183.26.131:8080/job/ros2_batch_ci_osx/118/

@dirk-thomas dirk-thomas changed the title Flaky test on OS X Flaky test May 22, 2015
@wjwwood
Copy link
Member

wjwwood commented Jun 10, 2015

Yet another instance: http://54.183.26.131:8080/job/ros2_batch_ci_linux/142/testReport/junit/(root)/test_publisher_subscriber_cpp__builtins__rmw_opensplice_cpp/test_publisher_subscriber/

This is after we added the environment variables which direct warnings to stdout.

@wjwwood wjwwood added ready Work is about to start (Kanban column) and removed in progress Actively being worked on (Kanban column) ready Work is about to start (Kanban column) labels Jun 11, 2015
@wjwwood wjwwood removed their assignment Jun 16, 2015
@dirk-thomas
Copy link
Member Author

Since all builds referenced in this ticket are not available anymore I will add links to a recent (hopefully similar incident):

One flaky test (works in different builds): http://ci.ros2.org/view/nightly/job/ros2_batch_ci_osx_nightly/14/
The console shows a CTest timeout: http://ci.ros2.org/view/nightly/job/ros2_batch_ci_osx_nightly/14/consoleFull#console-section-141

But the gtest output shows a successful and complete run (not available anymore http://ci.ros2.org/view/nightly/job/ros2_batch_ci_osx_nightly/ws/workspace/build/test_rclcpp/test_results/test_rclcpp/gtest_subscription__rmw_connext_cpp.xml/*view*/):

<?xml version="1.0" encoding="UTF-8"?>
<testsuites tests="1" failures="0" disabled="0" errors="0" timestamp="2015-08-20T10:23:01" time="3.467" name="AllTests">
  <testsuite name="test_subscription__rmw_connext_cpp" tests="1" failures="0" disabled="0" errors="0" time="3.467">
    <testcase name="subscription_and_spinning" status="run" time="3.467" classname="test_subscription__rmw_connext_cpp" />
  </testsuite>
</testsuites>

Same for the run_test.py output file (http://ci.ros2.org/view/nightly/job/ros2_batch_ci_osx_nightly/ws/workspace/build/test_rclcpp/ament_cmake_gtest/gtest_subscription__rmw_connext_cpp.txt/*view*/):

Running main() from gtest_main.cc
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from test_subscription__rmw_connext_cpp
[ RUN      ] test_subscription__rmw_connext_cpp.subscription_and_spinning
RTI Data Distribution Service Evaluation License issued to OSRF (OSRF01) dthomas@osrfoundation.org For non-production use only.
Expires on 08-dec-2015 See www.rti.com for more information.
spin_node_once(nonblocking) - no callback expected
spin_node_some() - no callback expected
spin_node_once() - callback (1) expected
  callback() 1 with message data 1
spin_node_once(nonblocking) - no callback expected
spin_node_some() - no callback expected
spin_node_once(nonblocking) - callback (2) expected
  callback() 2 with message data 2
spin_node_once(nonblocking) - callback (3) expected
  callback() 3 with message data 3
spin_node_some() - callbacks (4 and 5) expected
  callback() 4 with message data 4
  callback() 5 with message data 5
spin_node_some() - no callbacks expected
[       OK ] test_subscription__rmw_connext_cpp.subscription_and_spinning (3467 ms)
[----------] 1 test from test_subscription__rmw_connext_cpp (3467 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (3467 ms total)
[  PASSED  ] 1 test.

The conclusion seems to be that the test runner script executes the gtest binary successful but that fails to exit itself. CTest then kills the test runner script when the timeout is reached and marks the test as failed. Two parts which need fixing:

  • the test runner script must obviously terminate
  • the output should be visible up to the same point as in the separate output files
    • if the test itself flushs output frequently (in case it gets killed from the outside) the output is now visible - even on Windows

@dirk-thomas
Copy link
Member Author

ament/ament_cmake#24 is an attempt to fix the reading from proc. It also writes all printed message to the output file to ensure that the exact execution sequence can be traced (independent on CTests behavior to cache the output and being subject to loose buffered output in case of segfaults).

@jacquelinekay jacquelinekay added the duplicate This issue or pull request already exists label Oct 15, 2015
@jacquelinekay
Copy link
Contributor

Duplicate of ros2/ros2#127

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
duplicate This issue or pull request already exists
Projects
None yet
Development

No branches or pull requests

4 participants