Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QA runner improvements #9580

Closed
7 of 11 tasks
Tracked by #9882
williambanfield opened this issue Oct 17, 2022 · 5 comments
Closed
7 of 11 tasks
Tracked by #9882

QA runner improvements #9580

williambanfield opened this issue Oct 17, 2022 · 5 comments
Labels
qa Quality assurance stale for use by stalebot T:tracking Tracking issue for other issues

Comments

@williambanfield
Copy link
Contributor

williambanfield commented Oct 17, 2022

Overview

This issue outlines a set of improvements that should be taken within the near term and longer term to allow the core team to run QA tests more quickly and with less effort.

During the release of Tendermint version v0.37.x, we executed steps of the new release process outlined in RELEASES.md to ensure there was not a clear regression in the quality of the software. The steps performed were quite manual, requiring the operator to run a series of scripts from their local machine to setup the instances, generate the configuration files, start the processes, run the load, and capture the results. After running this process once we have demonstrated that large scale testnets on virtual machines are a reasonable way to test Tendermint and we have learned a lot about how to orchestrate a large Tendermint network. We should improve this process to reduce the amount of effort required to run the QA process and capture the results.

Near term improvements

This section suggests a set of changes that should be implemented within the next 1-1.5 quarters. These changes largely comprise migrating logic from scripts in the tendermint-testnet repository into the Tendermint e2e test runner that performs a similar set of functionality using docker instances on a local network. The logic, as implemented in the testnet repository, is written as a set of shell scripts and ansible playbooks that are not very portable, not tolerant to transient failures in the network and digital ocean API, and are difficult add functionality too due to their already large degree of complexity.

Runner generates the network configuration files

Currently, a bash script and ansible playbook create the set of Tendermint configuration files for the test network and copy them to the testnet machines. This logic can and should be moved to the e2e runner. The runner already is used for most of the testnet configuration generation with the bash script just updating a few config values and the IP addresses so that they match those from the Digital Ocean infrastructure.

Runner adds the load to the network

The e2e test runner currently generates load for the e2e tests. This logic could be extended to generate transactions for the release testnets.

The release testnets require a transaction data format that is more specific than what the nightly tests currently use. The data format can be ported over to be used by the e2e runner. Additional work will be needed to incorporate the tm-loadtest periodic load generation logic into the runner.

Runner starts and stops the processes

The nightly e2e runner currently starts and stops the Tendermint docker instances during the nightly tests. This logic can be adapted to start and stop the Tendermint process on remote nodes during a large testnet running on many machines.

Runner retrieves the data

Currently, retrieving the Tendermint blockstore and the prometheus data captured during the large scale testnet is a manual process performed with a pair of ansible playbooks 1 2. The data is collected by the network operator upon completion of the test. The data is then manually uploaded to Digital Ocean storage.

This procedure can be automated and combined into the runner process. Upon completion of the test, the runner can fetch the blockstore and the prometheus database and automatically upload them to Digital Ocean, either by placing them onto a mounted drive that is intended for reuse, or by uploading them directly to a Digital Ocean 'space'.

Long term improvements

Runner manages the infrastructure

In the long term, the runner should be improved to directly manage the infrastructure running the testnet. This means the runner, running on a single DO instance, should be updated to able to spawn and destroy all of the necessary droplets.

Managing a fleet of infrastructure is complex and existing tools and practices like Terraform run from the command line have many advantages. Terraform implements a declarative syntax, idempotent requests for resource creation, and has built in definitions for many Digital Ocean resource types already.

A future version of the runner should be augmented to perform the role of resource creation and destruction without operator intervention. This would need to be carefully done so as to avoid any possible scenarios where the tool provisions too many resources or fails to destroy resources and leaves them running indefinitely. This is listed as a long term improvement because it is complex and will take more careful consideration.

Runner triggered from a github action upon release

Once the runner is able to provision resources in digital ocean, run the entire suite automatically, and retrieve and upload the results, it should be enhanced to started from a github action when a release is triggered.

Overall TODO

@sergio-mena
Copy link
Contributor

Thanks for this write-up @williambanfield ! A few comments:

  • Does it make sense to add the improvements I have in mind on the result-extraction part, or is it worth doing it in a separate issue?
  • in "Runner triggered from a GitHub action upon release", a factor to take into account is where the Terraform credentials will be stored
  • This write-up seems great to me as a tracking issue. I'd propose to add a checklist at the end of the description with a one-liner on each of these points, which can be converted into sub-issues later on... so that if someone in the team wants to contribute, they can easily pick up a chunk of the work.

@williambanfield
Copy link
Contributor Author

williambanfield commented Oct 18, 2022

Does it make sense to add the improvements I have in mind on the result-extraction part, or is it worth doing it in a separate issue?

Mind elaborating, I'm not remembering exactly which improvements you're describing, but the answer is probably 'yes it does make sense'

in "Runner triggered from a GitHub action upon release", a factor to take into account is where the Terraform credentials will be stored

This is definitely true. It would be nice if Digital Ocean provided some facility for this as well. The likely answer is to use Github Secrets, which allow you to keep encrypted information that is accessible during workflows.

This write-up seems great to me as a tracking issue. I'd propose to add a checklist at the end of the description with a one-liner on each of these points, which can be converted into sub-issues later on... so that if someone in the team wants to contribute, they can easily pick up a chunk of the work.

This makes total sense. I'll do that.

EDIT: done.

@sergio-mena
Copy link
Contributor

sergio-mena commented Oct 18, 2022

A summary of the result-extraction improvements I have in mind:

  • Turn (exploratory) octave scripts into python for automatic plotting
  • Find a better way to extract Prometheus graphs. By "better", I mean: a) less manual than Firefox snapshot feature, b) allowing for customizing things like labels, legend, axes setup, title
  • Modify loadtime tool to sort the output of the experiments in the report
  • (stretch goal) Come up with a less manual way of generating the report (results presented more like a CI report, less like research results)

If you agree, I can add a section above with these (and any other I might have missed)

@thanethomson thanethomson added the qa Quality assurance label Oct 19, 2022
@williambanfield
Copy link
Contributor Author

Modify loadtime tool to sort the output of the experiments in the report

Can you elaborate on this specific goal?

The other goals definitely make sense.

@sergio-mena
Copy link
Contributor

Modify loadtime tool to sort the output of the experiments in the report

Can you elaborate on this specific goal?

Sure. See the report obtained last time we ran 200-node for v0.37.x: report.txt. I'd like the entries to be ordered by connections and rate. I'm sure it's not a big deal to implement, but it is a source of extra manual adjustments when preparing the table in the "Finding the Saturation Point" section.

The other goals definitely make sense.

Thanks. Let me add them, then

@thanethomson thanethomson added the T:tracking Tracking issue for other issues label Dec 20, 2022
@github-actions github-actions bot added the stale for use by stalebot label Feb 17, 2023
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
qa Quality assurance stale for use by stalebot T:tracking Tracking issue for other issues
Projects
Status: Done/Merged
Development

No branches or pull requests

3 participants