QA runner improvements #9580

williambanfield · 2022-10-17T22:24:16Z

Overview

This issue outlines a set of improvements that should be taken within the near term and longer term to allow the core team to run QA tests more quickly and with less effort.

During the release of Tendermint version v0.37.x, we executed steps of the new release process outlined in RELEASES.md to ensure there was not a clear regression in the quality of the software. The steps performed were quite manual, requiring the operator to run a series of scripts from their local machine to setup the instances, generate the configuration files, start the processes, run the load, and capture the results. After running this process once we have demonstrated that large scale testnets on virtual machines are a reasonable way to test Tendermint and we have learned a lot about how to orchestrate a large Tendermint network. We should improve this process to reduce the amount of effort required to run the QA process and capture the results.

Near term improvements

This section suggests a set of changes that should be implemented within the next 1-1.5 quarters. These changes largely comprise migrating logic from scripts in the tendermint-testnet repository into the Tendermint e2e test runner that performs a similar set of functionality using docker instances on a local network. The logic, as implemented in the testnet repository, is written as a set of shell scripts and ansible playbooks that are not very portable, not tolerant to transient failures in the network and digital ocean API, and are difficult add functionality too due to their already large degree of complexity.

Runner generates the network configuration files

Currently, a bash script and ansible playbook create the set of Tendermint configuration files for the test network and copy them to the testnet machines. This logic can and should be moved to the e2e runner. The runner already is used for most of the testnet configuration generation with the bash script just updating a few config values and the IP addresses so that they match those from the Digital Ocean infrastructure.

Runner adds the load to the network

The e2e test runner currently generates load for the e2e tests. This logic could be extended to generate transactions for the release testnets.

The release testnets require a transaction data format that is more specific than what the nightly tests currently use. The data format can be ported over to be used by the e2e runner. Additional work will be needed to incorporate the tm-loadtest periodic load generation logic into the runner.

Runner starts and stops the processes

The nightly e2e runner currently starts and stops the Tendermint docker instances during the nightly tests. This logic can be adapted to start and stop the Tendermint process on remote nodes during a large testnet running on many machines.

Runner retrieves the data

Currently, retrieving the Tendermint blockstore and the prometheus data captured during the large scale testnet is a manual process performed with a pair of ansible playbooks 1 2. The data is collected by the network operator upon completion of the test. The data is then manually uploaded to Digital Ocean storage.

This procedure can be automated and combined into the runner process. Upon completion of the test, the runner can fetch the blockstore and the prometheus database and automatically upload them to Digital Ocean, either by placing them onto a mounted drive that is intended for reuse, or by uploading them directly to a Digital Ocean 'space'.

Long term improvements

Runner manages the infrastructure

In the long term, the runner should be improved to directly manage the infrastructure running the testnet. This means the runner, running on a single DO instance, should be updated to able to spawn and destroy all of the necessary droplets.

Managing a fleet of infrastructure is complex and existing tools and practices like Terraform run from the command line have many advantages. Terraform implements a declarative syntax, idempotent requests for resource creation, and has built in definitions for many Digital Ocean resource types already.

A future version of the runner should be augmented to perform the role of resource creation and destruction without operator intervention. This would need to be carefully done so as to avoid any possible scenarios where the tool provisions too many resources or fails to destroy resources and leaves them running indefinitely. This is listed as a long term improvement because it is complex and will take more careful consideration.

Runner triggered from a github action upon release

Once the runner is able to provision resources in digital ocean, run the entire suite automatically, and retrieve and upload the results, it should be enhanced to started from a github action when a release is triggered.

Overall TODO

The text was updated successfully, but these errors were encountered:

sergio-mena · 2022-10-18T16:18:45Z

Thanks for this write-up @williambanfield ! A few comments:

Does it make sense to add the improvements I have in mind on the result-extraction part, or is it worth doing it in a separate issue?
in "Runner triggered from a GitHub action upon release", a factor to take into account is where the Terraform credentials will be stored
This write-up seems great to me as a tracking issue. I'd propose to add a checklist at the end of the description with a one-liner on each of these points, which can be converted into sub-issues later on... so that if someone in the team wants to contribute, they can easily pick up a chunk of the work.

williambanfield · 2022-10-18T16:22:49Z

Does it make sense to add the improvements I have in mind on the result-extraction part, or is it worth doing it in a separate issue?

Mind elaborating, I'm not remembering exactly which improvements you're describing, but the answer is probably 'yes it does make sense'

in "Runner triggered from a GitHub action upon release", a factor to take into account is where the Terraform credentials will be stored

This is definitely true. It would be nice if Digital Ocean provided some facility for this as well. The likely answer is to use Github Secrets, which allow you to keep encrypted information that is accessible during workflows.

This write-up seems great to me as a tracking issue. I'd propose to add a checklist at the end of the description with a one-liner on each of these points, which can be converted into sub-issues later on... so that if someone in the team wants to contribute, they can easily pick up a chunk of the work.

This makes total sense. I'll do that.

EDIT: done.

sergio-mena · 2022-10-18T17:37:40Z

A summary of the result-extraction improvements I have in mind:

Turn (exploratory) octave scripts into python for automatic plotting
Find a better way to extract Prometheus graphs. By "better", I mean: a) less manual than Firefox snapshot feature, b) allowing for customizing things like labels, legend, axes setup, title
Modify loadtime tool to sort the output of the experiments in the report
(stretch goal) Come up with a less manual way of generating the report (results presented more like a CI report, less like research results)

If you agree, I can add a section above with these (and any other I might have missed)

williambanfield · 2022-10-24T17:29:53Z

Modify loadtime tool to sort the output of the experiments in the report

Can you elaborate on this specific goal?

The other goals definitely make sense.

sergio-mena · 2022-11-02T15:53:29Z

Modify loadtime tool to sort the output of the experiments in the report

Can you elaborate on this specific goal?

Sure. See the report obtained last time we ran 200-node for v0.37.x: report.txt. I'd like the entries to be ordered by connections and rate. I'm sure it's not a big deal to implement, but it is a source of extra manual adjustments when preparing the table in the "Finding the Saturation Point" section.

The other goals definitely make sense.

Thanks. Let me add them, then

thanethomson added the qa Quality assurance label Oct 19, 2022

thanethomson mentioned this issue Dec 14, 2022

Rolling out new functionality faster, but reliably #9882

Closed

14 tasks

thanethomson added the T:tracking Tracking issue for other issues label Dec 20, 2022

lasarojc mentioned this issue Dec 23, 2022

modularity: Rolling out new functionality faster, but reliably cometbft/cometbft#42

Open

20 tasks

sergio-mena mentioned this issue Dec 23, 2022

QA runner improvements cometbft/cometbft#51

Open

13 tasks

github-actions bot added the stale for use by stalebot label Feb 17, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QA runner improvements #9580

QA runner improvements #9580

williambanfield commented Oct 17, 2022 •

edited by sergio-mena

sergio-mena commented Oct 18, 2022

williambanfield commented Oct 18, 2022 •

edited

sergio-mena commented Oct 18, 2022 •

edited

williambanfield commented Oct 24, 2022

sergio-mena commented Nov 2, 2022

QA runner improvements #9580

QA runner improvements #9580

Comments

williambanfield commented Oct 17, 2022 • edited by sergio-mena

Overview

Near term improvements

Runner generates the network configuration files

Runner adds the load to the network

Runner starts and stops the processes

Runner retrieves the data

Long term improvements

Runner manages the infrastructure

Runner triggered from a github action upon release

Overall TODO

sergio-mena commented Oct 18, 2022

williambanfield commented Oct 18, 2022 • edited

sergio-mena commented Oct 18, 2022 • edited

williambanfield commented Oct 24, 2022

sergio-mena commented Nov 2, 2022

williambanfield commented Oct 17, 2022 •

edited by sergio-mena

williambanfield commented Oct 18, 2022 •

edited

sergio-mena commented Oct 18, 2022 •

edited