Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need to make able to run flaky tests in parallel #80

Open
avtikhon opened this issue Dec 6, 2020 · 0 comments
Open

Need to make able to run flaky tests in parallel #80

avtikhon opened this issue Dec 6, 2020 · 0 comments
Milestone

Comments

@avtikhon
Copy link
Contributor

avtikhon commented Dec 6, 2020

Initially to make testing faster it was implemented parallel testing tarantool/test-run#56. Later more testing suites were added to parallel testing. It produced a lot of flaky fails in testing. It was decided to implement feature with fragile tests lists tarantool/test-run#187, to avoid of flaky results, because the major number of issues resolved by running flaky tests back inline as tests from fragile tests lists do. Then it was added the ability to rerun flaky tests using tests results files checksums tarantool/test-run#189 based on fragile tests lists. Later more and more tests were added to fragile list and all of them became to run inline - it dramatically increased the testing time from 2 minutes to 20 minutes. In general usage of checksums feature provides ability to avoid of run tests inline, due to tests can be rerun enough times to pass (Note: It caused to have not only first found checksum of the failing test, but all of them for all flaky places in the tests). Current patch divides this feature to rerun tests from fragile lists inline by found checksums in 2 separate features:

  • rerun flaky tests by checksums from flaky tests lists;
  • run tests inline from fragile tests lists.
@avtikhon avtikhon self-assigned this Dec 6, 2020
avtikhon referenced this issue in tarantool/tarantool Dec 25, 2020
Preserved environment variables from gitlab-ci environment to packaging:
  REPLICATION_SYNC_TIMEOUT
  TEST_TIMEOUT
  NO_OUTPUT_TIMEOUT

Part of tarantool/test-run#251
avtikhon referenced this issue in tarantool/tarantool Dec 25, 2020
Preserved environment variables from gitlab-ci environment to packaging:
  REPLICATION_SYNC_TIMEOUT
  TEST_TIMEOUT
  NO_OUTPUT_TIMEOUT

Part of tarantool/test-run#251
avtikhon referenced this issue in tarantool/tarantool Dec 25, 2020
Preserved environment variables from gitlab-ci environment to packaging:
  PRESERVE_ENVVARS=REPLICATION_SYNC_TIMEOUT,TEST_TIMEOUT,NO_OUTPUT_TIMEOUT
  REPLICATION_SYNC_TIMEOUT
  TEST_TIMEOUT
  NO_OUTPUT_TIMEOUT

Part of tarantool/test-run#251
avtikhon referenced this issue in tarantool/tarantool Dec 25, 2020
Preserved environment variables from gitlab-ci environment to packaging:
  PRESERVE_ENVVARS=REPLICATION_SYNC_TIMEOUT,TEST_TIMEOUT,NO_OUTPUT_TIMEOUT
  REPLICATION_SYNC_TIMEOUT
  TEST_TIMEOUT
  NO_OUTPUT_TIMEOUT

Part of tarantool/test-run#251
avtikhon referenced this issue in tarantool/tarantool Dec 25, 2020
Preserved environment variables from gitlab-ci environment to packaging:
  PRESERVE_ENVVARS=REPLICATION_SYNC_TIMEOUT,TEST_TIMEOUT,NO_OUTPUT_TIMEOUT
  REPLICATION_SYNC_TIMEOUT
  TEST_TIMEOUT
  NO_OUTPUT_TIMEOUT

Part of tarantool/test-run#251
avtikhon referenced this issue in tarantool/tarantool Dec 25, 2020
Preserved environment variables from gitlab-ci environment to packaging:
  PRESERVE_ENVVARS=REPLICATION_SYNC_TIMEOUT,TEST_TIMEOUT,NO_OUTPUT_TIMEOUT
  REPLICATION_SYNC_TIMEOUT
  TEST_TIMEOUT
  NO_OUTPUT_TIMEOUT

Part of tarantool/test-run#251
avtikhon referenced this issue in tarantool/tarantool Dec 25, 2020
Preserved environment variables from gitlab-ci environment to packaging:
  PRESERVE_ENVVARS=REPLICATION_SYNC_TIMEOUT,TEST_TIMEOUT,NO_OUTPUT_TIMEOUT
  REPLICATION_SYNC_TIMEOUT
  TEST_TIMEOUT
  NO_OUTPUT_TIMEOUT

Part of tarantool/test-run#251
avtikhon referenced this issue in tarantool/tarantool Dec 25, 2020
Preserved environment variables from gitlab-ci environment to packaging:
  PRESERVE_ENVVARS=REPLICATION_SYNC_TIMEOUT,TEST_TIMEOUT,NO_OUTPUT_TIMEOUT
  REPLICATION_SYNC_TIMEOUT
  TEST_TIMEOUT
  NO_OUTPUT_TIMEOUT

Different jobs use environment in different ways and formats:
 - base jobs run inside docker use timeouts variables as they are;
 - pack/deploy jobs additionaly use PRESERVE_ENVVARS variable to pass
   timeouts variables;
 - freebsd job uses additional exports for each timeout variable;
 - out-of-source build uses additional setup of these variables to the
   docker run process it runs in;
 - default gcc centos7 job uses it's local setup of these variables.

Part of tarantool/test-run#251
avtikhon referenced this issue in tarantool/tarantool Dec 25, 2020
Preserved environment variables from gitlab-ci environment to packaging:
  PRESERVE_ENVVARS=REPLICATION_SYNC_TIMEOUT,TEST_TIMEOUT,NO_OUTPUT_TIMEOUT

Different jobs use environment in different ways and formats:
 - freebsd job uses additional exports for each timeout variable;
 - out-of-source build uses additional setup of these variables to the
   docker run process it runs in;
 - pack/deploy/default_gcc_centos7 jobs additionaly use PRESERVE_ENVVARS
   variable to pass timeouts variables;
 - the rest of jobs run inside docker use timeouts variables as they are.

Part of tarantool/test-run#251
Totktonada referenced this issue in tarantool/tarantool Dec 26, 2020
Removed obvious part in rpm spec for Travis-CI, due to it is no
longer in use.

---- Comments from @Totktonada ----

This change is a kind of revertion of the commit
d48406d ('test: add more tests to
packaging testing'), which did close #4599.

Here I described the story, why the change was made and why it is
reverted now.

We run testing during an RPM package build: it may catch some
distribution specific problem. We had reduced quantity of tests and
single thread tests execution to keep the testing stable and don't break
packages build and deployment due to known fragile tests.

Our CI had to use Travis CI, but we were in transition to GitLab CI to
use our own machines and don't reach Travis CI limit with five jobs
running in parallel.

We moved package builds to GitLab CI, but kept build+deploy jobs on
Travis CI for a while: GitLab CI was the new for us and we wanted to do
this transition smoothly for users of our APT / YUM repositories.

After enabling packages building on GitLab CI, we wanted to enable more
tests (to catch more problems) and wanted to enable parallel execution
of tests to speed up testing (and reduce amount of time a developer wait
for results).

We observed that if we'll enable more tests and parallel execution on
Travis CI, the testing results will become much less stable and so we'll
often have holes in deployed packages and red CI.

So, we decided to keep the old way testing on Travis CI and perform all
changes (more tests, more parallelism) only for GitLab CI.

We had a guess that we have enough machine resources and will able to do
some load balancing to overcome flaky fails on our own machines, but in
fact we picked up another approach later (see below).

That's all story behind #4599. What changes from those days?

We moved deployment jobs to GitLab CI[^1] and now we completely disabled
Travis CI (see #4410 and #4894). All jobs were moved either to GitLab CI
or right to GitHub Actions[^2].

We revisited our approach to improve stability of testing. Attemps to do
some load balancing together with attempts to keep not-so-large
execution time were failed. We should increase parallelism for speed,
but decrease it for stability at the same time. There is no optimal
balance.

So we decided to track flaky fails in the issue tracker and restart a
test after a known fail (see details in [1]). This way we don't need to
exclude tests and disable parallelism in order to get the stable and
fast testing[^3]. At least in theory. We're on the way to verify this
guess, but hopefully we'll stick with some adequate defaults that will
work everywhere[^4].

To sum up, there are several reasons to remove the old workaround, which
was implemented in the scope of #4599: no Travis CI, no foreseeable
reasons to exclude tests and reduce parallelism depending on a CI
provider.

Footnotes:

[^1]: This is simplification. Travis CI deployment jobs were not moved
      as is. GitLab CI jobs push packages to the new repositories
      backend (#3380). Travis CI jobs were disabled later (as part of
      #4947), after proofs that the new infrastructure works fine.
      However this is the another story.

[^2]: Now we're going to use GitHub Actions for all jobs, mainly because
      GitLab CI is poorly integrated with GitHub pull requests (when
      source branch is in a forked repository).

[^3]: Some work toward this direction still to be done:

      First, 'replication' test suite still excluded from the testing
      under RPM package build. It seems, we should just enable it back,
      it is tracked by #4798.

      Second, there is the issue [2] to get rid of ancient traces of the
      old attempts to keep the testing stable (from test-run side).
      It'll give us more parallelism in testing.

[^4]: Of course, we perform investigations of flaky fails and fix code
      and testing problems it feeds to us. However it appears to be the
      long activity.

References:

[1]: tarantool/test-run#217
[2]: https://github.com/tarantool/test-run/issues/251
Totktonada referenced this issue in tarantool/tarantool Dec 26, 2020
Removed obvious part in rpm spec for Travis-CI, due to it is no
longer in use.

---- Comments from @Totktonada ----

This change is a kind of revertion of the commit
d48406d ('test: add more tests to
packaging testing'), which did close #4599.

Here I described the story, why the change was made and why it is
reverted now.

We run testing during an RPM package build: it may catch some
distribution specific problem. We had reduced quantity of tests and
single thread tests execution to keep the testing stable and don't break
packages build and deployment due to known fragile tests.

Our CI had to use Travis CI, but we were in transition to GitLab CI to
use our own machines and don't reach Travis CI limit with five jobs
running in parallel.

We moved package builds to GitLab CI, but kept build+deploy jobs on
Travis CI for a while: GitLab CI was the new for us and we wanted to do
this transition smoothly for users of our APT / YUM repositories.

After enabling packages building on GitLab CI, we wanted to enable more
tests (to catch more problems) and wanted to enable parallel execution
of tests to speed up testing (and reduce amount of time a developer wait
for results).

We observed that if we'll enable more tests and parallel execution on
Travis CI, the testing results will become much less stable and so we'll
often have holes in deployed packages and red CI.

So, we decided to keep the old way testing on Travis CI and perform all
changes (more tests, more parallelism) only for GitLab CI.

We had a guess that we have enough machine resources and will able to do
some load balancing to overcome flaky fails on our own machines, but in
fact we picked up another approach later (see below).

That's all story behind #4599. What changes from those days?

We moved deployment jobs to GitLab CI[^1] and now we completely disabled
Travis CI (see #4410 and #4894). All jobs were moved either to GitLab CI
or right to GitHub Actions[^2].

We revisited our approach to improve stability of testing. Attemps to do
some load balancing together with attempts to keep not-so-large
execution time were failed. We should increase parallelism for speed,
but decrease it for stability at the same time. There is no optimal
balance.

So we decided to track flaky fails in the issue tracker and restart a
test after a known fail (see details in [1]). This way we don't need to
exclude tests and disable parallelism in order to get the stable and
fast testing[^3]. At least in theory. We're on the way to verify this
guess, but hopefully we'll stick with some adequate defaults that will
work everywhere[^4].

To sum up, there are several reasons to remove the old workaround, which
was implemented in the scope of #4599: no Travis CI, no foreseeable
reasons to exclude tests and reduce parallelism depending on a CI
provider.

Footnotes:

[^1]: This is simplification. Travis CI deployment jobs were not moved
      as is. GitLab CI jobs push packages to the new repositories
      backend (#3380). Travis CI jobs were disabled later (as part of
      #4947), after proofs that the new infrastructure works fine.
      However this is the another story.

[^2]: Now we're going to use GitHub Actions for all jobs, mainly because
      GitLab CI is poorly integrated with GitHub pull requests (when
      source branch is in a forked repository).

[^3]: Some work toward this direction still to be done:

      First, 'replication' test suite still excluded from the testing
      under RPM package build. It seems, we should just enable it back,
      it is tracked by #4798.

      Second, there is the issue [2] to get rid of ancient traces of the
      old attempts to keep the testing stable (from test-run side).
      It'll give us more parallelism in testing.

[^4]: Of course, we perform investigations of flaky fails and fix code
      and testing problems it feeds to us. However it appears to be the
      long activity.

References:

[1]: tarantool/test-run#217
[2]: https://github.com/tarantool/test-run/issues/251

(cherry picked from commit d9c25b7)
Totktonada referenced this issue in tarantool/tarantool Dec 26, 2020
Removed obvious part in rpm spec for Travis-CI, due to it is no
longer in use.

---- Comments from @Totktonada ----

This change is a kind of revertion of the commit
d48406d ('test: add more tests to
packaging testing'), which did close #4599.

Here I described the story, why the change was made and why it is
reverted now.

We run testing during an RPM package build: it may catch some
distribution specific problem. We had reduced quantity of tests and
single thread tests execution to keep the testing stable and don't break
packages build and deployment due to known fragile tests.

Our CI had to use Travis CI, but we were in transition to GitLab CI to
use our own machines and don't reach Travis CI limit with five jobs
running in parallel.

We moved package builds to GitLab CI, but kept build+deploy jobs on
Travis CI for a while: GitLab CI was the new for us and we wanted to do
this transition smoothly for users of our APT / YUM repositories.

After enabling packages building on GitLab CI, we wanted to enable more
tests (to catch more problems) and wanted to enable parallel execution
of tests to speed up testing (and reduce amount of time a developer wait
for results).

We observed that if we'll enable more tests and parallel execution on
Travis CI, the testing results will become much less stable and so we'll
often have holes in deployed packages and red CI.

So, we decided to keep the old way testing on Travis CI and perform all
changes (more tests, more parallelism) only for GitLab CI.

We had a guess that we have enough machine resources and will able to do
some load balancing to overcome flaky fails on our own machines, but in
fact we picked up another approach later (see below).

That's all story behind #4599. What changes from those days?

We moved deployment jobs to GitLab CI[^1] and now we completely disabled
Travis CI (see #4410 and #4894). All jobs were moved either to GitLab CI
or right to GitHub Actions[^2].

We revisited our approach to improve stability of testing. Attemps to do
some load balancing together with attempts to keep not-so-large
execution time were failed. We should increase parallelism for speed,
but decrease it for stability at the same time. There is no optimal
balance.

So we decided to track flaky fails in the issue tracker and restart a
test after a known fail (see details in [1]). This way we don't need to
exclude tests and disable parallelism in order to get the stable and
fast testing[^3]. At least in theory. We're on the way to verify this
guess, but hopefully we'll stick with some adequate defaults that will
work everywhere[^4].

To sum up, there are several reasons to remove the old workaround, which
was implemented in the scope of #4599: no Travis CI, no foreseeable
reasons to exclude tests and reduce parallelism depending on a CI
provider.

Footnotes:

[^1]: This is simplification. Travis CI deployment jobs were not moved
      as is. GitLab CI jobs push packages to the new repositories
      backend (#3380). Travis CI jobs were disabled later (as part of
      #4947), after proofs that the new infrastructure works fine.
      However this is the another story.

[^2]: Now we're going to use GitHub Actions for all jobs, mainly because
      GitLab CI is poorly integrated with GitHub pull requests (when
      source branch is in a forked repository).

[^3]: Some work toward this direction still to be done:

      First, 'replication' test suite still excluded from the testing
      under RPM package build. It seems, we should just enable it back,
      it is tracked by #4798.

      Second, there is the issue [2] to get rid of ancient traces of the
      old attempts to keep the testing stable (from test-run side).
      It'll give us more parallelism in testing.

[^4]: Of course, we perform investigations of flaky fails and fix code
      and testing problems it feeds to us. However it appears to be the
      long activity.

References:

[1]: tarantool/test-run#217
[2]: https://github.com/tarantool/test-run/issues/251

(cherry picked from commit d9c25b7)
Totktonada referenced this issue in tarantool/tarantool Dec 26, 2020
Removed obvious part in rpm spec for Travis-CI, due to it is no
longer in use.

---- Comments from @Totktonada ----

This change is a kind of revertion of the commit
d48406d ('test: add more tests to
packaging testing'), which did close #4599.

Here I described the story, why the change was made and why it is
reverted now.

We run testing during an RPM package build: it may catch some
distribution specific problem. We had reduced quantity of tests and
single thread tests execution to keep the testing stable and don't break
packages build and deployment due to known fragile tests.

Our CI had to use Travis CI, but we were in transition to GitLab CI to
use our own machines and don't reach Travis CI limit with five jobs
running in parallel.

We moved package builds to GitLab CI, but kept build+deploy jobs on
Travis CI for a while: GitLab CI was the new for us and we wanted to do
this transition smoothly for users of our APT / YUM repositories.

After enabling packages building on GitLab CI, we wanted to enable more
tests (to catch more problems) and wanted to enable parallel execution
of tests to speed up testing (and reduce amount of time a developer wait
for results).

We observed that if we'll enable more tests and parallel execution on
Travis CI, the testing results will become much less stable and so we'll
often have holes in deployed packages and red CI.

So, we decided to keep the old way testing on Travis CI and perform all
changes (more tests, more parallelism) only for GitLab CI.

We had a guess that we have enough machine resources and will able to do
some load balancing to overcome flaky fails on our own machines, but in
fact we picked up another approach later (see below).

That's all story behind #4599. What changes from those days?

We moved deployment jobs to GitLab CI[^1] and now we completely disabled
Travis CI (see #4410 and #4894). All jobs were moved either to GitLab CI
or right to GitHub Actions[^2].

We revisited our approach to improve stability of testing. Attemps to do
some load balancing together with attempts to keep not-so-large
execution time were failed. We should increase parallelism for speed,
but decrease it for stability at the same time. There is no optimal
balance.

So we decided to track flaky fails in the issue tracker and restart a
test after a known fail (see details in [1]). This way we don't need to
exclude tests and disable parallelism in order to get the stable and
fast testing[^3]. At least in theory. We're on the way to verify this
guess, but hopefully we'll stick with some adequate defaults that will
work everywhere[^4].

To sum up, there are several reasons to remove the old workaround, which
was implemented in the scope of #4599: no Travis CI, no foreseeable
reasons to exclude tests and reduce parallelism depending on a CI
provider.

Footnotes:

[^1]: This is simplification. Travis CI deployment jobs were not moved
      as is. GitLab CI jobs push packages to the new repositories
      backend (#3380). Travis CI jobs were disabled later (as part of
      #4947), after proofs that the new infrastructure works fine.
      However this is the another story.

[^2]: Now we're going to use GitHub Actions for all jobs, mainly because
      GitLab CI is poorly integrated with GitHub pull requests (when
      source branch is in a forked repository).

[^3]: Some work toward this direction still to be done:

      First, 'replication' test suite still excluded from the testing
      under RPM package build. It seems, we should just enable it back,
      it is tracked by #4798.

      Second, there is the issue [2] to get rid of ancient traces of the
      old attempts to keep the testing stable (from test-run side).
      It'll give us more parallelism in testing.

[^4]: Of course, we perform investigations of flaky fails and fix code
      and testing problems it feeds to us. However it appears to be the
      long activity.

References:

[1]: tarantool/test-run#217
[2]: https://github.com/tarantool/test-run/issues/251

(cherry picked from commit d9c25b7)
avtikhon referenced this issue in tarantool/tarantool Dec 26, 2020
Preserved environment variables from gitlab-ci environment to packaging:
  PRESERVE_ENVVARS=REPLICATION_SYNC_TIMEOUT,TEST_TIMEOUT,NO_OUTPUT_TIMEOUT

Different jobs use environment in different ways and formats:
 - freebsd job uses additional exports for each timeout variable;
 - out-of-source build uses additional setup of these variables to the
   docker run process it runs in;
 - pack/deploy/default_gcc_centos7 jobs additionaly use PRESERVE_ENVVARS
   variable to pass timeouts variables;
 - the rest of jobs run inside docker use timeouts variables as they are.

Part of tarantool/test-run#251
@kyukhin kyukhin transferred this issue from tarantool/test-run Feb 12, 2021
@kyukhin kyukhin added the teamQ label Apr 15, 2021
@kyukhin kyukhin added this to the wishlist milestone Oct 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants