Skip to content
This repository was archived by the owner on Jul 8, 2022. It is now read-only.

Conversation

@mliszcz
Copy link
Collaborator

@mliszcz mliszcz commented Jun 15, 2019

This PR resets the database and device servers for each tests.

There is a wrapper script:

run_with_fixture.sh <test-binary> [param1] [param2] ...

The script:

  1. starts docker with mysql and database ds
  2. starts DevTest and FwdTest
  3. runs <test-binary> (forwards all parameters)
  4. stops DevTest and FwdTest
  5. stops dockers

Note: resetting environment for each CxxTest's test_something() function is not in scope of this PR.

Changes

  • Fix tests that were depending on configuration set in previous tests:
    • 24ece26 Correct event::att_conf_event_buffer test
    • 4688e24 Correct CXX::cxx_mem_attr test
    • 8b69890 Correct event::event_lock test
  • Docker client is needed to start mysql/database ds from wrapper script
    • 840d759 Expose docker daemon in CI containers
  • tango_admin is needed to ping database / admin device
    • ebdd4ed Install tango_admin in CI containers
  • The script itself
    • dfb49e0 Add new test runner and server start/stop scripts
    • ae2dd81 Enable new test runner script in CI
  • We can now run tests in parallel
    • d86ffeb Run tests in parallel in CI
  • Update README
    • 666d719 Update instructions how to run tests
  • Remove old scripts, they will no longer work
    • b3db05b Remove old scripts for running tests

Testcase output is collected in test_results directory:

cpp_test_suite/test_results/
└── cxx_stateless_subscription_20190616.193616.465333747
    ├── conf_devtest.out
    ├── cxx_stateless_subscription_exit_code.out
    ├── cxx_stateless_subscription.out
    ├── DevTest_test.0.out
    ├── DevTest_test2.0.out
    ├── DevTest_test.out
    ├── FwdTest_test.out
    └── server_pids
        ├── 3122
        └── 3152

Results

I've compared ctest output with and without changes in this PR. It is the same (timestamps, pids, etc. ... differ of course), except CXX::cxx_old_poll, which seems to produce different output on every run.

There is ~10s overhead for each test (start/stop dockers/servers). This gives ~15m increase in sequential test execution of ~90 tests.

If we run multiple tests in parallel, some may fail (due to timer expiration under high load).
If there are any failures in CI, I re-run failed tests sequentially and require them to pass two times in a row.

8 parallel tests is a reasonable value for Travis (0 - 4 tests usually fail). I tried with 16 parallel tests but on first run 21 out of 93 tests failed and there was no improvement in execution time.

Remarks

  • Codacy is failing due to changes in Dockerfiles. The bot suggest to specify version in apt like: apt install package=x.y.z. I don't think it will do any good.
  • I'm not sure if/how code coverage calculation is impacted. I needed to adjust CMAKE_CTEST_COMMAND for coveralls plugin, but coveralls seems to be no longer in use. @Ingvord could you comment on? Is it possible to see coverage report for this PR on sonar?

mliszcz added 3 commits June 13, 2019 12:39
Explicitly set attribute's min_value to 0. Until now, min_value = 0
was carried from previous testcase (event::att_conf_event).
Write Short_attr_w value so that device is in ALARM state.
Until now, the value = 10 was carried from previous testcase
(CXX::cxx_attr_write, test_write_some_scalar_attributes).
Set abs_change property for attribute event_change_tst. Until now,
property was carried from previous testcase (event::multi_event).
@mliszcz mliszcz force-pushed the tests-reset-environment branch from 3601eb5 to c8313c7 Compare June 15, 2019 19:55
@tango-controls tango-controls deleted a comment Jun 15, 2019
@mliszcz mliszcz force-pushed the tests-reset-environment branch from c8313c7 to 8318481 Compare June 15, 2019 20:06
@tango-controls tango-controls deleted a comment Jun 15, 2019
@tango-controls tango-controls deleted a comment Jun 15, 2019
@mliszcz mliszcz force-pushed the tests-reset-environment branch from 8318481 to 840d759 Compare June 15, 2019 20:13
@tango-controls tango-controls deleted a comment Jun 15, 2019
@tango-controls tango-controls deleted a comment Jun 15, 2019
@mliszcz mliszcz force-pushed the tests-reset-environment branch from d2f2838 to ebdd4ed Compare June 15, 2019 20:32
@tango-controls tango-controls deleted a comment Jun 15, 2019
@mliszcz mliszcz force-pushed the tests-reset-environment branch from 92c3ac4 to c8f0da6 Compare June 15, 2019 21:17
@tango-controls tango-controls deleted a comment Jun 15, 2019
@tango-controls tango-controls deleted a comment Jun 15, 2019
@mliszcz mliszcz force-pushed the tests-reset-environment branch from c8f0da6 to 8bc2ba3 Compare June 15, 2019 21:29
@tango-controls tango-controls deleted a comment Jun 15, 2019
@mliszcz mliszcz force-pushed the tests-reset-environment branch from 8bc2ba3 to dfb49e0 Compare June 15, 2019 21:45
@tango-controls tango-controls deleted a comment Jun 15, 2019
@tango-controls tango-controls deleted a comment Jun 15, 2019
@mliszcz mliszcz force-pushed the tests-reset-environment branch from 1697660 to a51a7a1 Compare June 16, 2019 05:29
@tango-controls tango-controls deleted a comment Jun 16, 2019
@mliszcz mliszcz force-pushed the tests-reset-environment branch from a51a7a1 to ae2dd81 Compare June 16, 2019 06:12
@tango-controls tango-controls deleted a comment Jun 16, 2019
@tango-controls tango-controls deleted a comment Jun 16, 2019
@mliszcz mliszcz force-pushed the tests-reset-environment branch from 24410b2 to 41bde04 Compare June 16, 2019 09:40
@mliszcz mliszcz force-pushed the tests-reset-environment branch from b3db05b to 809fb3a Compare June 16, 2019 17:39
@bourtemb
Copy link
Member

When looking at the first log of Travis, it seems like one test failed (37 - CXX::cxx_reconnection_zmq (Failed)) but Travis check status is green as if everything was fine.
So there must be a problem with the final value returned by the script ran by Travis.

@mliszcz
Copy link
Collaborator Author

mliszcz commented Jul 18, 2019

So there must be a problem with the final value returned by the script ran by Travis.

This is by design.

We have problems with our testcases as they rely on timers and sleeps.
When we run multiple tests in parallel, the system experiences a higher load and tests fail more often.
That's why I added a second CTest run with -rerun-failed.

if ! docker exec -w "${build_dir}" cpp_tango ctest -output-on-failure -j8
then
  if ! docker exec -w "${build_dir}" cpp_tango ctest --output-on-failure --rerun-failed --repeat-until-fail 2
  then
    exit 1
  fi
fi

The test failed in the first run and then passed two times in a row on the second run, so I assumed that the test is unstable and the commit should be accepted. Is this behavior ok from your pov?

99% tests passed, 1 tests failed out of 93
Total Test time (real) = 287.51 sec
The following tests FAILED:
	 37 - CXX::cxx_reconnection_zmq (Failed)
Errors while running CTest
Test project /home/tango/src/build
    Start 37: CXX::cxx_reconnection_zmq
    Test #37: CXX::cxx_reconnection_zmq ........   Passed  103.31 sec
    Start 37: CXX::cxx_reconnection_zmq
1/1 Test #37: CXX::cxx_reconnection_zmq ........   Passed  103.49 sec
100% tests passed, 0 tests failed out of 1
Total Test time (real) = 206.81 sec

@bourtemb
Copy link
Member

Indeed, you already explained that in the long description of the Pull Request...
Sorry for making you repeat yourself... I was so used that when a test fails, Travis check is marked as red...
Your proposal is a clever approach.
I think the best would be that each test never fails, but with your approach, if a test is failing, it should pass again at least twice successfully to be considered as OK.
I think this could be acceptable in the current situation.
Still the best would be that each test never fails.

Could we infer the number of available cores on the Travis machine and use that number (or a relevant number for this number of available cores) in the -j option?

@tango-controls tango-controls deleted a comment Jul 19, 2019
@tango-controls tango-controls deleted a comment Jul 19, 2019
@mliszcz
Copy link
Collaborator Author

mliszcz commented Jul 19, 2019

Could we infer the number of available cores on the Travis machine and use that number (or a relevant number for this number of available cores) in the -j option?

Do you think we need to determine the number of available cores, i.e. can it change? Per travis docs, it is fixed to 2: https://docs.travis-ci.com/user/reference/overview/

Also, there is $TRAVIS_NUMCORES: travis-ci/travis-build#1079

Anyway, I did a dummy commit. According to lscpu output and /proc/cpuinfo, there are two CPUs in the machine used by travis. Judgind by the clock rate and L3 cache those may be Xeons E5-2699 v3. There is some virtualization involved (KVM) and we have just one hyper-threaded core in each CPU. This means that we have four cores in total.

The problem is to determine the relation between the cpu cores and the number of parallel tests.

Some tests are sleeping for a long time, some are waiting for I/O, etc. ... and in such cases another process may have a chance to run on the cpu.

From original post:
sequential: 1856.24 sec
8 in parallel: 287.84 sec (so speedup is almost linear, 1856.24 sec/8 = 232)
16 is too much

If we really want flexibility here, I propose to use: (num of jobs) = 2 * (num of cores).
Update: (num of cores) above includes HT sibling cores.

@mliszcz
Copy link
Collaborator Author

mliszcz commented Nov 4, 2019

@bourtemb, @t-b, @Ingvord is there any interest in having this PR merged? If yes, could you please do a review? I can then do a rework, apply your comments and resolve any confllicts. Otherwise I can close this PR without merging.

@t-b
Copy link
Collaborator

t-b commented Nov 4, 2019

@mliszcz I would be in favour of having this here.

Copy link
Member

@Ingvord Ingvord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine with me. Go ahead and merge!

@mliszcz
Copy link
Collaborator Author

mliszcz commented Nov 4, 2019

@t-b thanks!. Have you had a chance to read through the changes?

@t-b, @Ingvord, @bourtemb
I think we need to discuss at least whether this feature should be

  • optional (cmake flag, enabled by default?)
  • or mandatory (with old behavior removed, including start-/stop-tango targets and pre_test / post_test scripts)?

Currently I've made it mandatory, so that you can't start the database or device servers manually. All you can (and need to) do is to run ctest. This makes running tests simpler but maybe some more complex scenarios (or manual testing) will be imposible to execute. What is your opinion?

@Ingvord
Copy link
Member

Ingvord commented Nov 4, 2019

I believe you can always run DevTest and co manually using e.g. CLion. Since these are just executables. Obviously one will need to run conf_devtest on a fresh database before running any test server. Would be nice to have some convenience script for that matter.

Anyway for me this seems to be enough to let's say debug server/client. I can not envision any other complex case right away. Once anyone bumps into that we will handled it.

@bourtemb
Copy link
Member

bourtemb commented Nov 4, 2019

Currently I've made it mandatory, so that you can't start the database or device servers manually. All you can (and need to) do is to run ctest. This makes running tests simpler but maybe some more complex scenarios (or manual testing) will be imposible to execute. What is your opinion?

I agree with Igor. We can change the behaviour if needed in the future.

@t-b
Copy link
Collaborator

t-b commented Nov 5, 2019

@mliszcz I'll make the review after the rebase.

@mliszcz
Copy link
Collaborator Author

mliszcz commented Nov 5, 2019

@t-b thanks! I won't be able to rebase the PR at least during the next couple of days. Also, let's merge first #608 which will bring more conflicts and will require some extra changes to start the second database just for group test.

If you have any general comments regarding the proposed solution, please tell. Otherwise I'll ask you for a review once I have the PR rebased.

@t-b
Copy link
Collaborator

t-b commented Nov 5, 2019

@mliszcz The general direction is very nice!

Copy link
Collaborator

@t-b t-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mliszcz This is a really cool PR!!

The title says only something about databaser server resetting but in fact running the tests locally now just works! It took here 240s with 12 parallel jobs. I had to cherry-pick 4b5128f (zmqeventconsumer/zmqeventsupplier: Make it compatible with all stock cppzmq versions, 2019-06-05) though. So we need to get that merged :))

@mliszcz
Copy link
Collaborator Author

mliszcz commented Dec 11, 2019

Thanks @t-b

It will be hard to bring this whole PR up to date.
I already descoped some parts to separate PRs (which can be merged without conflicts).
Before rebasing this one on master, I'd like to have these merged: #632, #633, #634.
I see you already approved the other PRs. @Ingvord or @bourtemb could you please check them as well?

@t-b
Copy link
Collaborator

t-b commented Dec 11, 2019

@mliszcz I've merged the mentioned PRs.

@mliszcz
Copy link
Collaborator Author

mliszcz commented Dec 15, 2019

Obsoleted by #640. Let's continue the review in the other PR.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants