OPCT-226: Added check rules for etcd (#80)

Introducing etcd check rules based in the parsed logs for etcd (etcd request took too long). The acceptance values are calibrated from the baseline providers: - AWS (ocp414rc0_AWS_None_202309222127_sonobuoy_47efe9ef-06e4-48f3-a190-4e3523ff1ae0.tar.gz) ![Screenshot from 2023-09-25 13-41-29](https://github.com/redhat-openshift-ecosystem/provider-certification-tool/assets/3216894/1291fe8f-e0c9-41bc-9c99-82c543ab3e5f) - AWS (4.13.9-20230925-HighlyAvailable-aws-None-202309250459_sonobuoy_f4e06587-e7b3-4cbd-bcf4-d1350ec35d9a.tar.gz): ![Screenshot from 2023-09-25 13-38-17](https://github.com/redhat-openshift-ecosystem/provider-certification-tool/assets/3216894/25027448-f141-448d-b6eb-f8932ff002d8) - vSphere (4.13.9-20230821-HighlyAvailable-vsphere-None.tar.gz): ![Screenshot from 2023-09-25 13-34-52](https://github.com/redhat-openshift-ecosystem/provider-certification-tool/assets/3216894/d1d43dbb-31fd-40d9-a487-fc6a5d213a9f) The check rules is implemented in the feature #76 https://github.com/redhat-openshift-ecosystem/provider-certification-tool/blob/dc2156317dfc92b47c582b494fef47876337d94b/internal/opct/report/checks.go#L321-L388 --------- Co-authored-by: Richard Vanderpool <49568690+rvanderp3@users.noreply.github.com>
redhat-openshift-ecosystem · Sep 25, 2023 · 32853e2 · 32853e2
1 parent 4b2d829
commit 32853e2
Showing 1 changed file with 64 additions and 0 deletions.
diff --git a/docs/review/rules.md b/docs/review/rules.md
@@ -163,6 +163,70 @@ Explore the pods:
 ```sh
 $ omc get pods -A |egrep -v '(Running|Completed)'
 ```
+
+### OPCT-010 <a name="OPCT-010"></a>
+- **Name**: etcd logs: slow requests: average should be under 500ms
+- **Description**: etcd logs are reporting slow requests with average above 500 milisseconds.
+- **Action**: Review if the storage volume for control plane nodes, or dedicated volume for etcd, has the required performance to run etcd in production environment.
+- **Troubleshooting**:
+
+1) Review the documentation for the required storage for etcd:
+
+- A) [Product Documentation](https://docs.openshift.com/container-platform/4.13/installing/installing_platform_agnostic/installing-platform-agnostic.html#installation-minimum-resource-requirements_installing-platform-agnostic)
+- B) [Red Hat Article: Understanding etcd and the tunables/conditions affecting performance](https://access.redhat.com/articles/7010406#effects-of-network-latency--jitter-on-etcd-4)
+- C) [Red Hat Article: How to Use 'fio' to Check Etcd Disk Performance in OCP](https://access.redhat.com/solutions/4885641)
+- D) [etcd-operator: baseline speed for standard hardware](https://github.com/openshift/cluster-etcd-operator/blob/f68835306c2d6670697a5fd98ba8c6ffe197ab02/pkg/hwspeedhelpers/hwhelper.go#L21-L34)
+
+2) Check the performance described in the article(B)
+
+3) Review the processed values from your environment
+
+!!! danger "Requirement"
+    It is required to run a conformance validation in a new cluster.
+
+    The validation tests parses the etcd logs from the entire cluster, including historical data, if you changed
+    the storage and didn't recreate the cluster, the results will include values containing slow requests from the
+    old storage, impacting in the current view.
+
+Run the report with debug flag `--loglevel=debug`:
+```text
+(...)
+DEBU[2023-09-25T12:52:05-03:00] Check OPCT-010 Failed Acceptance criteria: want=[<500] got=[690.412] 
+DEBU[2023-09-25T12:52:05-03:00] Check OPCT-011 Failed Acceptance criteria: want=[<1000] got=[3091.49]
+```
+
+Extract the information from the logs using internal utility:
+
+```sh
+# Export the path of extracted must-gather. Example:
+export MUST_GATHER_PATH=${PWD}/must-gather.local.2905984348081335046
+
+# Extract the utility
+oc image extract quay.io/ocp-cert/tools:latest --file="/usr/bin/ocp-etcd-log-filters" &&\
+chmod u+x ocp-etcd-log-filters
+
+# Run the utility
+cat ${MUST_GATHER_PATH}/*/namespaces/openshift-etcd/pods/*/etcd/etcd/logs/current.log \
+    | ./ocp-etcd-log-filters
+```
+
+References:
+
+- [etcd: Hardware recommendations](https://etcd.io/docs/v3.5/op-guide/hardware/)
+- [OpenShift Docs: Planning your environment according to object maximums](https://docs.openshift.com/container-platform/4.11/scalability_and_performance/planning-your-environment-according-to-object-maximums.html)
+- [OpenShift KCS: Backend Performance Requirements for OpenShift etcd](https://access.redhat.com/solutions/4770281)
+- [IBM: Using Fio to Tell Whether Your Storage is Fast Enough for Etcd](https://www.ibm.com/cloud/blog/using-fio-to-tell-whether-your-storage-is-fast-enough-for-etcd)
+
+
+### OPCT-011 <a name="OPCT-011"></a>
+
+- **Name**: etcd logs: slow requests: maximum should be under 1000ms
+- **Description**: etcd logs are reporting slow requests with maximum above 1000 milisseconds.
+- **Action**: Review if the storage volume for control plane nodes, or dedicated volume for etcd, has the required performance to run etcd in production environment.
+- **Troubleshooting**:
+
+Same as [`Troubleshooting` section of OPCT-010](#OPCT-010)
+
 ___
 <!-- 
 > Add new tests after "___" using the following template.