Reset database and device servers for each test (attempt 2) #640

mliszcz · 2019-12-15T10:13:48Z

Another approach to #568. Changes (compared to #568):

more robust cleanup (stopping servers and containers in a detached background process) - 217.26 sec for 93 passed tests with 12 parallel jobs on my laptop
run_with_fixture.sh script simplified (now re-uses start_server.sh and kill_server.sh)
First DevTest process logs to DevTest_test.0.log file (instead of DevTest_test.log)
possibility to run testcase without setting servers and containers up (export TANGO_TEST_CASE_SKIP_FIXTURE=1)

bourtemb

@mliszcz , I love this PR.
It seems to really speed up the tests and should help to make them more reproducible, especially when we run them independently.
Since this PR is also about parallelizing stuffs, please consider compiling using -j 2 option to speed up even more the Travis tests (Travis machines have 2 cores) like in the following commit:
dbcef00

I did the test with make -j 2 in Travis and got some good results in terms of speed:
https://travis-ci.org/bourtemb/cppTango/builds/625802785
compared to:
https://travis-ci.org/tango-controls/cppTango/builds/625414079

I don't like much the retry at the end for the tests which failed because Travis will report green if the tests which previously failed are passing the second time, and this might hide some issues and I'm afraid people won't have a look at the logs when Travis is green. I would prefer people to re-trigger manually the build in this case. It would encourage us to fix the tests which are failing more often even though I can understand such a situation could annoy people so I can accept the proposed solution for the moment, especially because more tests might fail if we run them in parallel, depending on the load of the Travis machine at the time of the Travis build.

I didn't find the time to test it on my machine. That's why I don't approve it for the moment.
But please feel free to merge if you feel confident enough.
This PR should not break anything in the Tango library.

.travis/run_ctest.sh

andygotz · 2019-12-19T05:47:29Z

Nice work! If I understand correctly there is roughly a 20% (from 25 to 20 minutes) decrease in the time needed to run the tests? The will help cool the planet and make developers happy! Well done!

mliszcz · 2019-12-19T07:58:10Z

Hi Andy,

With current configuration (8 parallel tests), there is 75% (3/4) decrease in test-suite time, e.g. on Debian 10:

before: Total Test time (real) = 1016.31 sec
this PR: Total Test time (real) = 247.26 sec

The overall build time decreases by ~10 minutes.

Previously:
https://travis-ci.org/tango-controls/cppTango/builds/626270998?utm_source=github_status&utm_medium=notification

Now:
https://travis-ci.org/tango-controls/cppTango/builds/625414079?utm_source=github_status&utm_medium=notification

mliszcz · 2019-12-19T10:07:08Z

I did some more tests:

-j12 https://travis-ci.org/tango-controls/cppTango/builds/627128708?utm_source=github_status&utm_medium=notification
-j8 https://travis-ci.org/tango-controls/cppTango/builds/625414079?utm_source=github_status&utm_medium=notification
-j4 https://travis-ci.org/tango-controls/cppTango/builds/627111183?utm_source=github_status&utm_medium=notification

If there are two numbers in 'Time' column, it means: (all tests in parallel) + (rerun of failed tests).

	Deb 10	Deb 10 (rel)	Deb 9	Deb 8	Deb 7
Time [s] (-j12)	187.45 + 34.45	183.12	192.53 + 67.32	194.51 + 11.65	200.33 + 98.96
Failures (-j12)	1	0	1	1	2
Time [s] (-j8)	247.26	230.34	247.92	273.85	232.56 + 215.52
Failures (-j8)	0	0	0	0	2
Time [s] (-j4)	390.47	393.20	404.85	399.44	402.79 + 31.79
Failures (-j4)	0	0	0	0	1

Manually re-triggering the build in case of a failure will be extremely inconvenient, because most of the time some tests are failing. I'm afraid that Ci will be red all the time (if you restart the build, some other test may fail in the next run).

Reducing the number of parallel jobs helps a bit, but we need at least two parallel jobs, otherwise execution time will be longer than without this PR (database setup overhead).

@bourtemb what do you think? How much tests we should run in parallel? Should we re-run failures?

bourtemb · 2019-12-20T13:45:02Z

.travis/run_ctest.sh

+#!/usr/bin/env bash
+
+if ! ctest \
+    -output-on-failure \


I think a dash is missing here: -output-on-failure => --output-on-failure

Resolved now in c17e364

bourtemb · 2019-12-20T16:13:08Z

(if you restart the build, some other test may fail in the next run).

Good point!

Reducing the number of parallel jobs helps a bit, but we need at least two parallel jobs, otherwise execution time will be longer than without this PR (database setup overhead).

I think it is reasonable to think that it should not fail with 2 parallel jobs on a machine with 2 cores.
Did you try with 2 parallel jobs already? Did you experience failures in this case? Is it really much slower?

I've tried on one of our machines (not a vm), a beast with 24 cores and I got about the same duration when running 24 parallel tests (~95 s, all tests passed) and when running 22, 30 or 36 parallel tests (~96 seconds, all tests passed Lightning fast compare to what we had before!). It's even slightly longer when going over 24 parallel tests in my use case. ~100 s with 48 parallel tests on a 24 cores real machine.
But in the case of Travis, the behaviour might be different. Maybe it can benefit from more computing power if the hypervisor is not too much loaded, depending on the configuration?
The only failure I experienced was with the pipe_event test (when running with 22 parallel jobs).
In this case, it is known that this test sometimes fails because there is a problem with the current Tango and zmq code that needs to be fixed at some point.

@bourtemb what do you think? How much tests we should run in parallel? Should we re-run failures?

Now that --output-on-failure option has been repaired, I think it could be useful to see what kind of failures do we get when some tests are failing.
If we go for $(nproc) tests in parallel, I'm confident we should be able to remove the rerun-failed option but the build might be slower most of the time than when running 8 or 12 parallel tests, but I guess the behaviour should be more reproducible when running ($(nproc)) tests in parallel. We should less depend on the load of the Travis machine I guess (I might be wrong of course).

mliszcz · 2019-12-20T22:01:24Z

I think it is reasonable to think that it should not fail with 2 parallel jobs on a machine with 2 cores.
Did you try with 2 parallel jobs already? Did you experience failures in this case? Is it really much slower?

Please remember that this PR adds ~5 overhead to each test (for starting mysql container). This is extra 500 on top of the current test suite run time with single database (~1000s). So running with -j2 the test run should take (~1500/2)

I've just tested with two tests in parallel (ad161d9). On Debian 10 test suite took: 725.86 sec. As expected. No failures in any job.

Now it gets complicated if you add more cores and more tests. On a 2-core VM on my laptop (and on a 2-core travis machine) I can usually run 8 parallel tests with no failures. This is because our tests are IO-bound, sleeping most of the time - client waits for the server's response, server waits for a command, etc. On the other hand, if you try to run too much parallel tests, the processes are fighting for CPU time, context switches are frequent, maybe there are some caching issues if process is re-scheduled on another core, etc. ...

I don't think that we can come up with a simple formula for calculating optimal number of parallel tests. Either we run as many tests as we have cores available or we empirically check how many we can run (with acceptable number of failures).

So, do you prefer to run with two parallel tests and remove the re-run option? I can adapt the PR accordingly.

But in the case of Travis, the behaviour might be different. Maybe it can benefit from more computing power if the hypervisor is not too much loaded, depending on the configuration?

I guess you're right here. Just check build times for some other PRs. Sometimes there is a 2-3 minute difference for the same job.

Now that --output-on-failure option has been repaired, I think it could be useful to see what kind of failures do we get when some tests are failing.

Basing on travis logs from my previous post, at least these are failing:
event::att_conf_event_buffer, event::per_event, event::pipe_event, CXX::cxx_fwd_att, old_tests::ConfEventBugClient, CXX::cxx_reconnection_zmq, old_tests::ring_depth (2x)

BTW we must do something with CXX::cxx_reconnection_zmq. It takes 90s to complete so even on a large multi-core workstation we won't go faster than that :)

t-b · 2019-12-22T14:18:37Z

Very nice!

I got Total Test time (real) = 132.98 sec with 12 parallel tests (CTEST_PARALLEL_LEVEL=12 is set in my .bashrc) on a 6 core (plus HT) machine.

bourtemb · 2020-01-07T10:20:29Z

I don't think that we can come up with a simple formula for calculating optimal number of parallel tests. Either we run as many tests as we have cores available or we empirically check how many we can run (with acceptable number of failures).

So, do you prefer to run with two parallel tests and remove the re-run option? I can adapt the PR accordingly.

I would prefer that indeed. In any case, appveyor builds are still much slower than the Travis builds.
This PR is a great improvement because it dramatically decreases the tests execution time if you have a powerful computer so developers can run the tests themselves and get results in a reasonable amount of time.

BTW we must do something with CXX::cxx_reconnection_zmq. It takes 90s to complete so even on a large multi-core workstation we won't go faster than that :)

This test is a bit special since we have to wait for the reconnection to happen so there are some long sleep calls but there is probably room for improvement indeed since the reconnection should happen after 10 seconds and we have some sleep calls of 40 seconds.
Of course, this can be done in another PR.

mliszcz · 2020-01-07T12:23:07Z

@bourtemb I'll remove the re-run feature in the coming days then.

The test runner will start a fresh environment (database and device servers) for each test case.

The scripts are not needed anymore as ctest command sets the test environment up.

If some tests failed during parallel run, try to re-run them sequentially to see if the failure is permanent.

Extract setup scripts so that it is possible to run the setup manually and then attach the debugger or perform some additional configuration.

Run up to (num of CPUs) tests in parallel to avoid overloading the machine and to reduce the number of failures of unstable tests.

Since we are running up to (num of CPUs) tests in parallel, no failures are expected.

The owning group of docker socket varies between different docker packages and different operating systems. Official docker-ce package uses gid 957 while docker.io package on Ubuntu 20.04 uses gid 103. New group with matching gid matching docker host is added to the cpp_tango container to allow tango user access the docker socket.

mliszcz · 2020-01-22T10:12:05Z

Hi @bourtemb and @t-b, I've removed the re-run option and rebased the PR on top of current development branch (some changes were required to support Ubuntu Focal). Last three commits are new. Please re-review when you find some time.

bourtemb

Thanks @mliszcz for the great work!
Looks good to me.
I think we should re-enable COVERALLS and SONAR_SCANNER at some point (unless I forgot a decision which was taken about that) for at least one Travis build but this could be done in another PR.

mliszcz · 2020-01-22T12:40:02Z

@bourtemb this PR may have some interaction with coveralls and sonar. I'll enable it in this PR just to be sure it still works. I saw your comment in #607 and I think that we just missed this during the review.

Sonar and Coveralls were disabled in 226d70c and this was unnoticed until now. We again enable those integrations.

t-b · 2020-01-22T14:40:34Z

@mliszcz Go for it!

mliszcz requested review from bourtemb and t-b as code owners December 15, 2019 10:13

mliszcz requested a review from Ingvord December 15, 2019 13:27

mliszcz mentioned this pull request Dec 15, 2019

Reset database and servers for each test #568

Closed

mliszcz force-pushed the reset-database-and-ds-for-each-test branch 2 times, most recently from 05ebc53 to 5889306 Compare December 15, 2019 21:54

mliszcz added the Ready For Merge label Dec 16, 2019

bourtemb reviewed Dec 18, 2019

View reviewed changes

.travis/run_ctest.sh Outdated Show resolved Hide resolved

bourtemb reviewed Dec 20, 2019

View reviewed changes

mliszcz force-pushed the reset-database-and-ds-for-each-test branch from 491ece4 to f5d0bf3 Compare December 20, 2019 14:46

t-b previously approved these changes Dec 22, 2019

View reviewed changes

mliszcz added 10 commits January 20, 2020 09:38

Expose docker daemon in CI containers

5bcd47a

Add new test runner and server start/stop scripts

1c1e205

The test runner will start a fresh environment (database and device servers) for each test case.

Enable new test runner script (9d6a727)

c243cb2

Remove old scripts for running the tests

445764a

The scripts are not needed anymore as ctest command sets the test environment up.

Run tests in parallel in CI

7cd2351

If some tests failed during parallel run, try to re-run them sequentially to see if the failure is permanent.

Update instructions how to run the tests

7ab4262

Extract test fixture setup to a separate script

7671bb1

Extract setup scripts so that it is possible to run the setup manually and then attach the debugger or perform some additional configuration.

Set number of parallel tests basing on CPU count

ee64cec

Run up to (num of CPUs) tests in parallel to avoid overloading the machine and to reduce the number of failures of unstable tests.

Do not re-run the tests on a failure in CI

6dc48da

Since we are running up to (num of CPUs) tests in parallel, no failures are expected.

Install Docker in Ubuntu 20.04 container in CI

0c13be4

mliszcz dismissed t-b’s stale review via 0c13be4 January 20, 2020 08:44

mliszcz force-pushed the reset-database-and-ds-for-each-test branch from ad161d9 to 0c13be4 Compare January 20, 2020 08:44

bourtemb previously approved these changes Jan 22, 2020

View reviewed changes

Enable Sonar and Coveralls integration in CI build

890fd59

Sonar and Coveralls were disabled in 226d70c and this was unnoticed until now. We again enable those integrations.

mliszcz dismissed bourtemb’s stale review via 890fd59 January 22, 2020 12:47

t-b self-requested a review January 22, 2020 14:34

t-b approved these changes Jan 22, 2020

View reviewed changes

bourtemb approved these changes Jan 22, 2020

View reviewed changes

t-b mentioned this pull request Jan 22, 2020

CMakeLists.txt: Require C++11 and cmake 3.7 #653

Merged

6 tasks

mliszcz merged commit 34d034a into tango-controls:tango-9-lts Jan 22, 2020

Reset database and device servers for each test (attempt 2) #640

Reset database and device servers for each test (attempt 2) #640

Uh oh!

Conversation

mliszcz commented Dec 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bourtemb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

andygotz commented Dec 19, 2019

Uh oh!

mliszcz commented Dec 19, 2019

Uh oh!

mliszcz commented Dec 19, 2019

Uh oh!

bourtemb Dec 20, 2019

Choose a reason for hiding this comment

Uh oh!

mliszcz Dec 20, 2019

Choose a reason for hiding this comment

Uh oh!

bourtemb commented Dec 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mliszcz commented Dec 20, 2019

Uh oh!

t-b commented Dec 22, 2019

Uh oh!

bourtemb commented Jan 7, 2020

Uh oh!

mliszcz commented Jan 7, 2020

Uh oh!

mliszcz commented Jan 22, 2020

Uh oh!

bourtemb left a comment

Choose a reason for hiding this comment

Uh oh!

mliszcz commented Jan 22, 2020

Uh oh!

t-b commented Jan 22, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mliszcz commented Dec 15, 2019 •

edited

Loading

bourtemb commented Dec 20, 2019 •

edited

Loading