Skip to content
This repository was archived by the owner on Jul 8, 2022. It is now read-only.

Conversation

@mliszcz
Copy link
Collaborator

@mliszcz mliszcz commented Dec 15, 2019

Another approach to #568. Changes (compared to #568):

  • more robust cleanup (stopping servers and containers in a detached background process) - 217.26 sec for 93 passed tests with 12 parallel jobs on my laptop
  • run_with_fixture.sh script simplified (now re-uses start_server.sh and kill_server.sh)
  • First DevTest process logs to DevTest_test.0.log file (instead of DevTest_test.log)
  • possibility to run testcase without setting servers and containers up (export TANGO_TEST_CASE_SKIP_FIXTURE=1)

@mliszcz mliszcz requested review from bourtemb and t-b as code owners December 15, 2019 10:13
@mliszcz mliszcz requested a review from Ingvord December 15, 2019 13:27
@mliszcz mliszcz force-pushed the reset-database-and-ds-for-each-test branch 2 times, most recently from 05ebc53 to 5889306 Compare December 15, 2019 21:54
Copy link
Member

@bourtemb bourtemb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mliszcz , I love this PR.
It seems to really speed up the tests and should help to make them more reproducible, especially when we run them independently.
Since this PR is also about parallelizing stuffs, please consider compiling using -j 2 option to speed up even more the Travis tests (Travis machines have 2 cores) like in the following commit:
dbcef00

I did the test with make -j 2 in Travis and got some good results in terms of speed:
https://travis-ci.org/bourtemb/cppTango/builds/625802785
compared to:
https://travis-ci.org/tango-controls/cppTango/builds/625414079

I don't like much the retry at the end for the tests which failed because Travis will report green if the tests which previously failed are passing the second time, and this might hide some issues and I'm afraid people won't have a look at the logs when Travis is green. I would prefer people to re-trigger manually the build in this case. It would encourage us to fix the tests which are failing more often even though I can understand such a situation could annoy people so I can accept the proposed solution for the moment, especially because more tests might fail if we run them in parallel, depending on the load of the Travis machine at the time of the Travis build.

I didn't find the time to test it on my machine. That's why I don't approve it for the moment.
But please feel free to merge if you feel confident enough.
This PR should not break anything in the Tango library.

@andygotz
Copy link
Collaborator

Nice work! If I understand correctly there is roughly a 20% (from 25 to 20 minutes) decrease in the time needed to run the tests? The will help cool the planet and make developers happy! Well done!

@mliszcz
Copy link
Collaborator Author

mliszcz commented Dec 19, 2019

Hi Andy,

With current configuration (8 parallel tests), there is 75% (3/4) decrease in test-suite time, e.g. on Debian 10:

  • before: Total Test time (real) = 1016.31 sec
  • this PR: Total Test time (real) = 247.26 sec

The overall build time decreases by ~10 minutes.

Previously:
https://travis-ci.org/tango-controls/cppTango/builds/626270998?utm_source=github_status&utm_medium=notification
obraz

Now:
https://travis-ci.org/tango-controls/cppTango/builds/625414079?utm_source=github_status&utm_medium=notification
obraz

@mliszcz
Copy link
Collaborator Author

mliszcz commented Dec 19, 2019

I did some more tests:

-j12 https://travis-ci.org/tango-controls/cppTango/builds/627128708?utm_source=github_status&utm_medium=notification
-j8 https://travis-ci.org/tango-controls/cppTango/builds/625414079?utm_source=github_status&utm_medium=notification
-j4 https://travis-ci.org/tango-controls/cppTango/builds/627111183?utm_source=github_status&utm_medium=notification

If there are two numbers in 'Time' column, it means: (all tests in parallel) + (rerun of failed tests).

Deb 10 Deb 10 (rel) Deb 9 Deb 8 Deb 7
Time [s] (-j12) 187.45 + 34.45 183.12 192.53 + 67.32 194.51 + 11.65 200.33 + 98.96
Failures (-j12) 1 0 1 1 2
Time [s] (-j8) 247.26 230.34 247.92 273.85 232.56 + 215.52
Failures (-j8) 0 0 0 0 2
Time [s] (-j4) 390.47 393.20 404.85 399.44 402.79 + 31.79
Failures (-j4) 0 0 0 0 1

Manually re-triggering the build in case of a failure will be extremely inconvenient, because most of the time some tests are failing. I'm afraid that Ci will be red all the time (if you restart the build, some other test may fail in the next run).

Reducing the number of parallel jobs helps a bit, but we need at least two parallel jobs, otherwise execution time will be longer than without this PR (database setup overhead).

@bourtemb what do you think? How much tests we should run in parallel? Should we re-run failures?

#!/usr/bin/env bash

if ! ctest \
-output-on-failure \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a dash is missing here: -output-on-failure => --output-on-failure

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved now in c17e364

@mliszcz mliszcz force-pushed the reset-database-and-ds-for-each-test branch from 491ece4 to f5d0bf3 Compare December 20, 2019 14:46
@bourtemb
Copy link
Member

bourtemb commented Dec 20, 2019

(if you restart the build, some other test may fail in the next run).

Good point!

Reducing the number of parallel jobs helps a bit, but we need at least two parallel jobs, otherwise execution time will be longer than without this PR (database setup overhead).

I think it is reasonable to think that it should not fail with 2 parallel jobs on a machine with 2 cores.
Did you try with 2 parallel jobs already? Did you experience failures in this case? Is it really much slower?

I've tried on one of our machines (not a vm), a beast with 24 cores and I got about the same duration when running 24 parallel tests (~95 s, all tests passed) and when running 22, 30 or 36 parallel tests (~96 seconds, all tests passed Lightning fast compare to what we had before!). It's even slightly longer when going over 24 parallel tests in my use case. ~100 s with 48 parallel tests on a 24 cores real machine.
But in the case of Travis, the behaviour might be different. Maybe it can benefit from more computing power if the hypervisor is not too much loaded, depending on the configuration?
The only failure I experienced was with the pipe_event test (when running with 22 parallel jobs).
In this case, it is known that this test sometimes fails because there is a problem with the current Tango and zmq code that needs to be fixed at some point.

@bourtemb what do you think? How much tests we should run in parallel? Should we re-run failures?

Now that --output-on-failure option has been repaired, I think it could be useful to see what kind of failures do we get when some tests are failing.
If we go for $(nproc) tests in parallel, I'm confident we should be able to remove the rerun-failed option but the build might be slower most of the time than when running 8 or 12 parallel tests, but I guess the behaviour should be more reproducible when running ($(nproc)) tests in parallel. We should less depend on the load of the Travis machine I guess (I might be wrong of course).

@mliszcz
Copy link
Collaborator Author

mliszcz commented Dec 20, 2019

I think it is reasonable to think that it should not fail with 2 parallel jobs on a machine with 2 cores.
Did you try with 2 parallel jobs already? Did you experience failures in this case? Is it really much slower?

Please remember that this PR adds ~5 overhead to each test (for starting mysql container). This is extra 500 on top of the current test suite run time with single database (~1000s). So running with -j2 the test run should take (~1500/2)

I've just tested with two tests in parallel (ad161d9). On Debian 10 test suite took: 725.86 sec. As expected. No failures in any job.

Now it gets complicated if you add more cores and more tests. On a 2-core VM on my laptop (and on a 2-core travis machine) I can usually run 8 parallel tests with no failures. This is because our tests are IO-bound, sleeping most of the time - client waits for the server's response, server waits for a command, etc. On the other hand, if you try to run too much parallel tests, the processes are fighting for CPU time, context switches are frequent, maybe there are some caching issues if process is re-scheduled on another core, etc. ...

I don't think that we can come up with a simple formula for calculating optimal number of parallel tests. Either we run as many tests as we have cores available or we empirically check how many we can run (with acceptable number of failures).

So, do you prefer to run with two parallel tests and remove the re-run option? I can adapt the PR accordingly.

But in the case of Travis, the behaviour might be different. Maybe it can benefit from more computing power if the hypervisor is not too much loaded, depending on the configuration?

I guess you're right here. Just check build times for some other PRs. Sometimes there is a 2-3 minute difference for the same job.

Now that --output-on-failure option has been repaired, I think it could be useful to see what kind of failures do we get when some tests are failing.

Basing on travis logs from my previous post, at least these are failing:
event::att_conf_event_buffer, event::per_event, event::pipe_event, CXX::cxx_fwd_att, old_tests::ConfEventBugClient, CXX::cxx_reconnection_zmq, old_tests::ring_depth (2x)

BTW we must do something with CXX::cxx_reconnection_zmq. It takes 90s to complete so even on a large multi-core workstation we won't go faster than that :)

@t-b
Copy link
Collaborator

t-b commented Dec 22, 2019

Very nice!

I got Total Test time (real) = 132.98 sec with 12 parallel tests (CTEST_PARALLEL_LEVEL=12 is set in my .bashrc) on a 6 core (plus HT) machine.

t-b
t-b previously approved these changes Dec 22, 2019
@bourtemb
Copy link
Member

bourtemb commented Jan 7, 2020

I don't think that we can come up with a simple formula for calculating optimal number of parallel tests. Either we run as many tests as we have cores available or we empirically check how many we can run (with acceptable number of failures).

So, do you prefer to run with two parallel tests and remove the re-run option? I can adapt the PR accordingly.

I would prefer that indeed. In any case, appveyor builds are still much slower than the Travis builds.
This PR is a great improvement because it dramatically decreases the tests execution time if you have a powerful computer so developers can run the tests themselves and get results in a reasonable amount of time.

BTW we must do something with CXX::cxx_reconnection_zmq. It takes 90s to complete so even on a large multi-core workstation we won't go faster than that :)

This test is a bit special since we have to wait for the reconnection to happen so there are some long sleep calls but there is probably room for improvement indeed since the reconnection should happen after 10 seconds and we have some sleep calls of 40 seconds.
Of course, this can be done in another PR.

@mliszcz
Copy link
Collaborator Author

mliszcz commented Jan 7, 2020

@bourtemb I'll remove the re-run feature in the coming days then.

The test runner will start a fresh environment
(database and device servers) for each test case.
The scripts are not needed anymore as ctest command sets
the test environment up.
If some tests failed during parallel run, try to re-run
them sequentially to see if the failure is permanent.
Extract setup scripts so that it is possible to run the setup manually
and then attach the debugger or perform some additional configuration.
Run up to (num of CPUs) tests in parallel to avoid overloading
the machine and to reduce the number of failures of unstable tests.
Since we are running up to (num of CPUs) tests in parallel, no failures
are expected.
@mliszcz mliszcz force-pushed the reset-database-and-ds-for-each-test branch from ad161d9 to 0c13be4 Compare January 20, 2020 08:44
The owning group of docker socket varies between different docker
packages and different operating systems. Official docker-ce package
uses gid 957 while docker.io package on Ubuntu 20.04 uses gid 103.
New group with matching gid matching docker host is added to the
cpp_tango container to allow tango user access the docker socket.
@mliszcz
Copy link
Collaborator Author

mliszcz commented Jan 22, 2020

Hi @bourtemb and @t-b, I've removed the re-run option and rebased the PR on top of current development branch (some changes were required to support Ubuntu Focal). Last three commits are new. Please re-review when you find some time.

bourtemb
bourtemb previously approved these changes Jan 22, 2020
Copy link
Member

@bourtemb bourtemb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mliszcz for the great work!
Looks good to me.
I think we should re-enable COVERALLS and SONAR_SCANNER at some point (unless I forgot a decision which was taken about that) for at least one Travis build but this could be done in another PR.

@mliszcz
Copy link
Collaborator Author

mliszcz commented Jan 22, 2020

@bourtemb this PR may have some interaction with coveralls and sonar. I'll enable it in this PR just to be sure it still works. I saw your comment in #607 and I think that we just missed this during the review.

Sonar and Coveralls were disabled in 226d70c and this was unnoticed
until now. We again enable those integrations.
@t-b
Copy link
Collaborator

t-b commented Jan 22, 2020

@mliszcz Go for it!

@mliszcz mliszcz merged commit 34d034a into tango-controls:tango-9-lts Jan 22, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants