Skip to content

Commit

Permalink
docs: add structure of support review process
Browse files Browse the repository at this point in the history
docs: add structure of support review process

docs: update dev-guide with filters for process cmd

doc: review support and install guides

doc: PR review; review support guide and formatting

doc: creating troubleshooting document and migrating from user guide

doc: review install-review guide

doc: overall review

doc: @rborst PR review

docs/review: update mkdocs and dev ToC after rebase

docs: review - ready for final review

doc/support-guide: review checklist reference

doc/support-guide: PR review for @bostrt

doc/support-guide: add insights cmdline

Dedicated mode now default and baseline results download (#1)

* doc: dedicated mode is now default

* doc/support-guide: steps on downloading baseline results

* Update docs/user.md

Co-authored-by: Marco Braga <braga@mtulio.eng.br>

* Update docs/user.md

Co-authored-by: Marco Braga <braga@mtulio.eng.br>

* docs: remove development env guidance from user guide

* docs: no longer need aws CLI and can reference HTML webpage hosted in S3

* docs: remove requirement for AWS access key

Co-authored-by: Marco Braga <braga@mtulio.eng.br>
  • Loading branch information
mtulio and mtulio committed Dec 6, 2022
1 parent 6bed166 commit ce9f6b0
Show file tree
Hide file tree
Showing 8 changed files with 755 additions and 139 deletions.
8 changes: 8 additions & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,14 @@
# OpenShift Provider Certification Tool

Welcome to the documentation for the OpenShift Provider Certification Tool!

OpenShift Provider Certification Tool is used to evaluate an OpenShift installation on a provider or hardware is in conformance.

Here you can find the initial steps to use the OpenShift Provider Certification Tool.

- [User Guide](./user.md)
- [Installation Check List](./user-installation-checklist.md)
- [Installation Review](./user-installation-review.md)
- [Results Review](./user-results-review.md)
- [Support Guide](./support-guide.md)
- [Development Guide](./dev.md)
61 changes: 52 additions & 9 deletions docs/dev.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,21 @@
# Provider Certification Tool
# Provider Certification Tool - Developer Guide

## Release
This document is a guide for developers detailing the Provider Certification Tool solution, design choices and the implementation references.

Table of Contents:

- [Release](#release)
- [Development Notes](#dev-notes)
- [Command Line Interface](#dev-cli)
- [Integration with Sonobuoy CLI](#dev-integration-cli)
- [Sonobuoy Plugins](#dev-sonobuoy-plugins)
- [Diagrams](#dev-diagrams)
- [CLI commands](#dev-diagram-cli)
- [CLI Result filters](#dev-diagram-filters)
- [Running Customized Certification Plugins](#dev-running-custom-plugins)
- [Project Documentation](#dev-project-docs)

## Release <a name="release"></a>

Releasing a new version of the provider certification tool is done automatically through [this GitHub Action](https://github.com/redhat-openshift-ecosystem/provider-certification-tool/blob/main/.github/workflows/release.yaml)
which is run on new tags. Tags should be named in format: v0.1.0.
Expand All @@ -9,7 +24,7 @@ Tags should only be created from the `main` branch which only accepts pull-reque

Note that any version in v0.* will be considered part of the preview release of the certification tool.

## Development Notes
## Development Notes <a name="dev-notes"></a>

This tool builds heavily on
[Sonobuoy](https://sonobuoy.io) therefore at least
Expand All @@ -21,15 +36,15 @@ The OpenShift provider certification tool extends Sonobuoy in two places:
- Command line interface (CLI)
- Plugins

### Command Line Interface
### Command Line Interface <a name="dev-cli"></a>

Sonobuoy provides its own CLI but it has a considerable number of flags and options
which can be overwhelming. This isn't an issue with Sonobuoy, it's just the result
of being a very flexible tool. However, for simplicity sake, the OpenShift
certification tool extends the Sonobuoy CLI with some strong opinions specific
to the realm certifying OpenShift on new infrastructure.

#### Integration with Sonobuoy CLI
#### Integration with Sonobuoy CLI <a name="dev-integration-cli"></a>
The OpenShift provider certification tool's CLI is written in Golang so that extending
Sonobuoy is easily done. Sonobuoy has two specific areas on which we build on:

Expand Down Expand Up @@ -58,16 +73,44 @@ reader, ec, err := config.SonobuoyClient.RetrieveResults(&client.RetrieveConfig{
})
```

### Sonobuoy Plugins
### Sonobuoy Plugins <a name="dev-sonobuoy-plugins"></a>

*TODO* (Cert tool's plugin development is still in POC phase)

### Diagrams
### Diagrams <a name="dev-diagrams"></a>

#### CLI commands <a name="dev-diagram-cli"></a>

Here's the highest level diagram showing the filenames or packages for code:
![](./command-diagram.png)

### Running Customized Certification Plugins
#### CLI Result filters <a name="dev-diagram-filters"></a>

The CLI currently implements a few filters to help the reviewers (Partners, Support, Engineering teams) to find the root cause of the failures. The filters consumes the data sources below to improve the feedback, by plugin level, when using the command `process`:

- A. `"Provider's Result"`: This is the original list of failures by the plugin available on the command `results`
- B. `"Suite List"`: This is the list of e2e tests available on the respective suite. For example: plugin `openshift-kubernetes-conformance` uses the suite `kubernetes/conformance`
- C. `"Baseline's Result"`: This is the list of e2e tests that failed in the baseline provider. That list is built from the same Certification Environment (OCP Agnostic Installation) in a known/supported platform (for example AWS and vSphere). Red Hat has many teams dedicated to reviewing and improving the thousands of e2e tests running in CI, that list is constantly reviewed for improvement to decrease the number of false negatives and help to look for the root cause.
- D. `"Sippy"`: Sippy is the system used to extract insights from the CI jobs. It can provide individual e2e test statistics of failures across the entire CI ecosystem, providing one picture of the failures happening in the provider's environment. The filter will check for each failed e2e if has an occurrence of failures in the version used to be certified.

Currently, this is the order of filters used to show the failures on the `process` command:

- `A intersection B` -> `Filter1`
- `Filter1 exclusion C` -> `Filter2`
- `Filter2 exclusion D` -> `Filter3`

The reviewers should look at the list of failures in the following order:

- `Filter3`
- `Filter2`
- `Filter1`
- `A`

The diagram visualizing the filters is available on draw.io, stored on the shared Google Driver Storage, needing one valid Red Hat account to access it (we have plans to make it public soon):
- https://app.diagrams.net/#G1NOhcF3jJtE1MjWCtbVgLEeD24oKr3IGa


### Running Customized Certification Plugins <a name="dev-running-custom-plugins"></a>

In some situations, you may need to modify the certification plugins that are run by the certification tool.
Running the certification tool with customized plugin manifests cannot be used for final certification of an OpenShift cluster!
Expand All @@ -88,7 +131,7 @@ vi /tmp/openshift-kube-conformance.yaml
openshift-provider-cert run --plugin /tmp/openshift-kube-conformance.yaml --plugin /tmp/openshift-conformance-validated.yaml
```

### Project Documentation
### Project Documentation <a name="dev-project-docs"></a>

The documentation is available in the directory `docs/`. You can render it as HTML using `mkdocs` locally - it's not yet published the HTML version.

Expand Down
272 changes: 272 additions & 0 deletions docs/support-guide.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,272 @@
# OpenShift Provider Certification Tool - Support Guide

- [Support Case Check List](#check-list)
- [New Support Cases](#check-list-new-case)
- [New Executions](#check-list-new-executions)
- [Setting up the Review Environment](#setup)
- [Install tools](#setup-install)
- [Download dependencies](#setup-download-baseline)
- [Download Partner Results](#setup-download-results)
- [Review guide: exploring the failed tests](#review-process)
- [Exploring the failures](#review-process-exploring)
- [Extracting the failures to the local directory](#review-process-extracting)
- [Explaning the extracted files](#review-process-explain)
- [Review Guidelines](#review-process-guidelines)


## Support Case Check List <a name="check-list"></a>

### New Support Cases <a name="check-list-new-case"></a>

Check-list to require when **new** support case has been opened:

- Documentation: Installing Steps containing the flavors/size of the Infrastructure and the steps to install OCP
- Documentation: Diagram of the Architecture including zonal deployment
- Archive with Certification results
- Archive with must-gather
- [Installation Checklist (file `user-installation-checklist.md`)](./user-installation-checklist.md) with the partner's update to sign off post-instalation items

### New Executions <a name="check-list-new-executions"></a>

The following assets, certification assets, should be updated when certain conditions happen:

- Certification Results
- Must Gather
- Install Documentation (when any item/flavor/configuration has been modified)


The following conditions require new certification assets:

- The version of the OpenShift Container Platform has been updated
- Any Infrastructure component(s) (e.g.: server size, disk category, ELB type/size/config) or cluster dependencies (e.g.: external storage backend for image registry) have been modified


## Review Environment <a name="setup"></a>

### Install Tools <a name="setup-install"></a>

- Download the [openshift-provider-cert](./user.md#install): OpenShift Provider Certification tool
- Download the [`omg`](https://github.com/kxr/o-must-gather): tool to analyse Must-gather archive
```bash
pip3 install o-must-gather --user
```

### Download Baseline CI results <a name="setup-download-baseline"></a>

The Openshift provider certification tool is run periodically ([source code](https://github.com/openshift/release/blob/master/ci-operator/jobs/redhat-openshift-ecosystem/provider-certification-tool/redhat-openshift-ecosystem-provider-certification-tool-main-periodics.yaml)) in OpenShift CI using the latest stable release of OpenShift.
These baseline results are stored long-term in an AWS S3 bucket (`s3://openshift-provider-certification/baseline-results`). An HTML listing can be found here: https://openshift-provider-certification.s3.us-west-2.amazonaws.com/index.html.
These baseline results should be used as a reference when reviewing a partner's certification results.

1. Identify cluster version in the partner's must gather:
```bash
$ omg get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.11.13 True False 11h Cluster version is 4.11.13
```
2. Navigate to https://openshift-provider-certification.s3.us-west-2.amazonaws.com/index.html and find the latest results (by date) for the matching OpenShift version
3. Download the *latest* test results for the version (bottom of list). Copy the results archive link from the webpage in previous step.
```bash
$ curl --output 4.11.13-20221125.tar.gz https://openshift-provider-certification.s3.us-west-2.amazonaws.com/baseline-results/4.11.13-20221125.tar.gz
$ file 4.11.13-20221125.tar.gz
4.11.13-20221125.tar.gz: gzip compressed data, original size modulo 2^32 430269440
```

4. Proceed with comparing baseline results with actual provider results.
- Download the suite test list for the version used by the partner

```bash
RELEASE_VERSION="4.11.4->CHANGE_ME"
TESTS_IMG=$(oc adm release info ${RELEASE_VERSION} --image-for='tests')
oc image extract ${TESTS_IMG} --file="/usr/bin/openshift-tests"
chmod u+x ./openshift-tests
./openshift-tests run --dry-run kubernetes/conformance > ./test-list_openshift-tests_kubernetes-conformance.txt
./openshift-tests run --dry-run openshift/conformance > ./test-list_openshift-tests_openshift-validated.txt
```

### Download Partner Results <a name="setup-download-results"></a>

- Download the Provider certification archive from the Support Case. Example file name: `retrieved-archive.tar.gz`
- Download the Must-gather from the Support Case. Example file name: `must-gather.tar.gz`

## Review guide: exploring the failed tests <a name="review-process"></a>

The steps below use the subcommand `process` to apply filters on the failed tests and help to keep the initial focus of the investigation on the failures exclusively on the partner's results.

The filters use only tests included in the respective suite, isolating from common failures identified on the Baseline results or Flakes from CI. To see more details about the filters, read the [dev documentation describing filters flow](./dev.md#dev-diagram-filters).

Required to use this section:

- OPCT CLI downloaded to the current directory
- OpenShift e2e test suite exported to the current directory
- Baseline results exported to the current directory
- The Certification Result is in the current directory


### Exploring the failures <a name="review-process-exploring"></a>

Compare the provider results with the baseline:

```bash
./openshift-provider-cert-linux-amd64 process \
--baseline ./opct_baseline-ocp_4.11.4-platform_none-provider-date_uuid.tar.gz \
--base-suite-ocp ./test-list_openshift-tests_openshift-validated.txt \
--base-suite-k8s ./test-list_openshift-tests_kubernetes-conformance.txt \
./<timestamp>_sonobuoy_<uuid>.tar.gz
```

### Extracting the failures to a local directory <a name="review-process-extracting"></a>

Compare the results and extract the files (option `--save-to`) to the local directory `./results-provider-processed`:

```bash
./openshift-provider-cert-linux-amd64 process \
--baseline ./opct_baseline-ocp_4.11.4-platform_none-provider-date_uuid.tar.gz \
--base-suite-ocp ./test-list_openshift-tests_openshift-validated.txt \
--base-suite-k8s ./test-list_openshift-tests_kubernetes-conformance.txt \
--save-to processed \
./<timestamp>_sonobuoy_<uuid>.tar.gz
```

This is the expected output:

> Note: the tabulation is not ok when pasting to Markdown
```bash
(...Header...)

> Processed Summary <

Total Tests suites:
- kubernetes/conformance: 353
- openshift/conformance: 3488

Total Tests by Certification Layer:
- openshift-kube-conformance:
- Status: failed
- Total: 675
- Passed: 654
- Failed: 21
- Timeout: 0
- Skipped: 0
- Failed (without filters) : 21
- Failed (Filter SuiteOnly): 2
- Failed (Filter Baseline : 2
- Failed (Filter CI Flakes): 0
- Status After Filters : pass
- openshift-conformance-validated:
- Status: failed
- Total: 3818
- Passed: 1708
- Failed: 61
- Timeout: 0
- Skipped: 2049
- Failed (without filters) : 61
- Failed (Filter SuiteOnly): 32
- Failed (Filter Baseline : 7
- Failed (Filter CI Flakes): 2
- Status After Filters : failed

Total Tests by Certification Layer:

=> openshift-kube-conformance: (2 failures, 2 flakes)

--> Failed tests to Review (without flakes) - Immediate action:
<empty>

--> Failed flake tests - Statistic from OpenShift CI
Flakes Perc TestName
1 0.138% [sig-api-machinery] CustomResourcePublishOpenAPI [Privileged:ClusterAdmin] works for multiple CRDs of same group and version but different kinds [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s]
2 0.275% [sig-api-machinery] ResourceQuota should create a ResourceQuota and capture the life of a secret. [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s]


=> openshift-conformance-validated: (7 failures, 5 flakes)

--> Failed tests to Review (without flakes) - Immediate action:
[sig-network-edge][Feature:Idling] Unidling should handle many TCP connections by possibly dropping those over a certain bound [Serial] [Skipped:Network/OVNKubernetes] [Suite:openshift/conformance/serial]
[sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic PV (default fs)] provisioning should provision storage with pvc data source [Suite:openshift/conformance/parallel] [Suite:k8s]

--> Failed flake tests - Statistic from OpenShift CI
Flakes Perc TestName
101 10.576% [sig-arch][bz-DNS][Late] Alerts alert/KubePodNotReady should not be at or above pending in ns/openshift-dns [Suite:openshift/conformance/parallel]
67 7.016% [sig-arch][bz-Routing][Late] Alerts alert/KubePodNotReady should not be at or above pending in ns/openshift-ingress [Suite:openshift/conformance/parallel]
2 0.386% [sig-imageregistry] Image registry should redirect on blob pull [Suite:openshift/conformance/parallel]
32 4.848% [sig-network][Feature:EgressFirewall] egressFirewall should have no impact outside its namespace [Suite:openshift/conformance/parallel]
11 2.402% [sig-network][Feature:EgressFirewall] when using openshift-sdn should ensure egressnetworkpolicy is created [Suite:openshift/conformance/parallel]

Data Saved to directory './processed/'
```
> TODO: create the index with a legend with references to the output.
### Understanding the extracted results <a name="review-process-explain"></a>
The data extracted to local storage contains the following files for each plugin:
- `test_${PLUGIN_NAME}_baseline_failures.txt`: List of test failures from the baseline execution
- `test_${PLUGIN_NAME}_provider_failures.txt`: List of test failures from the execution
- `test_${PLUGIN_NAME}_provider_filter1-suite.txt`: List of test failures included on suite
- `test_${PLUGIN_NAME}_provider_filter2-baseline.txt`: List of test failures tests* after applying all filters
- `test_${PLUGIN_NAME}_provider_suite_full.txt`: List with suite e2e tests
The base directory (`./results-provider-processed`) also contains the **all error messages (stdout and fail summary)** for each failed test. Those errors are saved into individual files onto those sub-directories (for each plugin):
- `failures-baseline/${PLUGIN_NAME}_${INDEX}-failure.txt`: the error summary
- `failures-baseline/${PLUGIN_NAME}_${INDEX}-systemOut.txt`: the entire stdout of the failed plugin
Considerations:
- `${PLUGIN_NAME}`: currently these plugins names are valid: [`openshift-validated`, `kubernetes-conformance`]
- `${INDEX}` is the simple index ordered by test name on the list
Example of files on the extracted directory:
```bash
$ tree processed/
processed/
├── failures-baseline
[redacted]
├── failures-provider
[redacted]
├── failures-provider-filtered
│ ├── kubernetes-conformance_1-1-failure.txt
│ ├── kubernetes-conformance_1-1-systemOut.txt
│ ├── kubernetes-conformance_2-2-failure.txt
│ ├── kubernetes-conformance_2-2-systemOut.txt
│ ├── openshift-validated_1-31-failure.txt
│ ├── openshift-validated_1-31-systemOut.txt
[redacted]
│ ├── openshift-validated_7-1-failure.txt
│ └── openshift-validated_7-1-systemOut.txt
├── tests_kubernetes-conformance_baseline_failures.txt
├── tests_kubernetes-conformance_provider_failures.txt
├── tests_kubernetes-conformance_provider_filter1-suite.txt
├── tests_kubernetes-conformance_provider_filter2-baseline.txt
├── tests_kubernetes-conformance_suite_full.txt
├── tests_openshift-validated_baseline_failures.txt
├── tests_openshift-validated_provider_failures.txt
├── tests_openshift-validated_provider_filter1-suite.txt
├── tests_openshift-validated_provider_filter2-baseline.txt
└── tests_openshift-validated_suite_full.txt

3 directories, 300 files
```
### Review Guidelines <a name="review-process-guidelines"></a>
> WIP: the idea here is to provide guidance on the main points/assets to review, pointing to the details on the respective/dedicated sections.
This section is a guide of the initial files to review when start exploring the resulting archive.
Items to review:
- OCP version matches the certification request
- Review the result file
- Check if the failures are 0, if not, need to check one by one
- To provide a better interaction between the review process, one spreadsheet named `failures-index.xlsx` is created inside the extracted directory (`./processed/` exemplified in the last section). It can be used as a tool to review failures and take notes about them.
- Check details of each test failed on the sub-directory `failures-provider-filtered/*.txt`.
Additional items to review:
- explore the must-gather objects according to findings on the failures files
- run insights rules on the must-gather to check if there's a new know issue: `insights run -p ccx_rules_ocp ${MUST_GATHER_PATH}`
> TODO: provide steps to install and run insight OCP rules (opct could provide one container with it installed to avoid overhead and environment issues)

0 comments on commit ce9f6b0

Please sign in to comment.