test: add multus validation test routine to rook binary #12069

BlaineEXE · 2023-04-11T22:02:42Z

Add a more involved multus validation test to the Rook binary. Because this is intended to be end-user runnable, make sure operator-only commands are hidden.

Build this into the rook binary instead of creating a separate binary for ease, and because any binary built with the kube api becomes 40+ megabytes. We save quite a bit of space by including this in the Rook binary, which is good for keeping container layers as small as possible.

Sample out:

2023-04-11 15:57:24.547256 I | multus-validation: starting multus validation test with the following config:
2023-04-11 15:57:24.547357 I | multus-validation:   namespace: "openshift-storage"
2023-04-11 15:57:24.547359 I | multus-validation:   public network: "public-net"
2023-04-11 15:57:24.547361 I | multus-validation:   cluster network: "cluster-net"
2023-04-11 15:57:24.547363 I | multus-validation:   daemons per node: 3
2023-04-11 15:57:24.547366 I | multus-validation:   resource timeout: 3m0s
2023-04-11 15:57:25.022325 I | multus-validation: continuing: no web server network info: pod has no network status yet: cannot find network status
2023-04-11 15:57:29.088309 I | multus-validation: starting 3 client DaemonSets
2023-04-11 15:57:31.454986 I | multus-validation: expecting 9 clients
2023-04-11 15:57:33.669491 I | multus-validation: continuing: number of running clients [7] is not the number expected [9]
2023-04-11 15:57:35.866094 I | multus-validation: all 9 clients are running - but may not be ready
2023-04-11 15:57:38.055692 I | multus-validation: continuing: number of ready clients [0] is not the number expected [9]
2023-04-11 15:57:40.247977 I | multus-validation: continuing: number of ready clients [0] is not the number expected [9]
2023-04-11 15:57:42.444894 I | multus-validation: continuing: number of ready clients [0] is not the number expected [9]
2023-04-11 15:57:44.664455 I | multus-validation: continuing: number of ready clients [0] is not the number expected [9]
2023-04-11 15:57:46.855260 I | multus-validation: continuing: number of ready clients [0] is not the number expected [9]
2023-04-11 15:57:49.053469 I | multus-validation: continuing: number of ready clients [0] is not the number expected [9]
2023-04-11 15:57:51.240288 I | multus-validation: continuing: number of ready clients [0] is not the number expected [9]
2023-04-11 15:57:53.450619 I | multus-validation: continuing: number of ready clients [0] is not the number expected [9]
2023-04-11 15:57:55.637100 I | multus-validation: continuing: number of ready clients [0] is not the number expected [9]
2023-04-11 15:57:57.825276 I | multus-validation: continuing: number of ready clients [0] is not the number expected [9]
2023-04-11 15:58:00.031465 I | multus-validation: continuing: number of ready clients [0] is not the number expected [9]
2023-04-11 15:58:02.245487 I | multus-validation: continuing: number of ready clients [0] is not the number expected [9]
2023-04-11 15:58:04.439255 I | multus-validation: continuing: number of ready clients [0] is not the number expected [9]
2023-04-11 15:58:06.663148 I | multus-validation: continuing: number of ready clients [0] is not the number expected [9]
2023-04-11 15:58:08.854578 I | multus-validation: continuing: number of ready clients [0] is not the number expected [9]
2023-04-11 15:58:11.046529 I | multus-validation: continuing: number of ready clients [0] is not the number expected [9]
2023-04-11 15:58:13.228783 I | multus-validation: continuing: number of ready clients [0] is not the number expected [9]
2023-04-11 15:58:15.421473 I | multus-validation: continuing: number of ready clients [5] is not the number expected [9]
2023-04-11 15:58:17.632599 I | multus-validation: continuing: number of ready clients [5] is not the number expected [9]
2023-04-11 15:58:19.821952 I | multus-validation: continuing: number of ready clients [7] is not the number expected [9]
2023-04-11 15:58:22.013965 I | multus-validation: all 9 clients are ready

RESULT: multus validation test succeeded!

cleaning up multus validation test resources in namespace "openshift-storage"
multus validation test resources were successfully cleaned up

Description of your changes:

Which issue is resolved by this Pull Request:
Resolves #

Checklist:

Commit Message Formatting: Commit titles and messages follow guidelines in the developer guide).
Skip Tests for Docs: If this is only a documentation change, add the label skip-ci on the PR.
Reviewed the developer guide on Submitting a Pull Request
Pending release notes updated with breaking and/or notable changes for the next minor release.
Documentation has been updated, if necessary.
Unit tests have been added, if necessary.
Integration tests have been added, if necessary.

design/common/multi-net-multus.md

cmd/rook/ctl/multus/validation/validation.go

subhamkrai · 2023-04-13T09:24:00Z

cmd/rook/ctl/multus/validation/validation.go

+	report := results.SuggestedDebuggingReport()
+
+	// success/failure message
+	fmt.Print("\n")


Suggested change

fmt.Print("\n")

fmt.Println()

The output all uses \n for consistency of reading where newlines are

subhamkrai · 2023-04-13T09:25:33Z

cmd/rook/ctl/multus/validation/validation.go

+	return nil
+}
+func (t timeoutMinutes) Type() string {
+	return "timeoutMinutes"


Suggested change

return "timeoutMinutes"

return "timeout Minutes"

It's a type, which can't have spaces in Go

cmd/rook/ctl/ctl.go

design/common/multi-net-multus.md

pkg/daemon/multus/resources.go

design/common/multi-net-multus.md

dismissing to get this moving more quickly. prior comments don't affect functionality and can be addressed in follow-up if necessary.

Documentation/CRDs/Cluster/ceph-cluster-crd.md

travisn · 2023-04-28T16:48:31Z

cmd/rook/ctl/multus/validation/validation.go

+
+	// 3 mons, 3 osds, 2 mgrs, 1 mds, 1 nfs, 1 rgw, 1 rbdmirror, 1 cephfsmirror,
+	// (2 csi provisioners, 1 csi plugin) x3 for rbd, cephfs, and nfs CSI drivers
+	DefaultDaemonsPerNode = 22


All of those daemons aren't typically on the same node, so it seems the default could be lower

Given the headache we've had with multus, giving false negatives seems a preferable starting point to me than false positives: #12069 (comment)

Reduced to 17 per discussion

travisn · 2023-04-28T16:49:16Z

cmd/rook/ctl/multus/validation/validation.go

+		Short: "Run a Multus validation test for rook.io",
+		Long: `
+Run a validation test that determines whether the current Multus and system
+configurations will support rook.io with Multus.


Suggested change

configurations will support rook.io with Multus.

configurations will support Rook with Multus.

IMO, rook.io is the most clear we can be. It's short, and if users are seeing the help text outside of the Rook context for any reason, they can get a quick primer or reminder at the website..

In the recent quest to have consistent docs I think my mindset is in a different place from others on this topic... I see Rook as the name of the product, and rook.io as a website. The user wants to know if the product is compatible with multus in their environment. They're not trying to see if the website is compatible. And if they're running this tool, seems like they would already know where the top level docs are found. Let's discuss this philosophy more.

I opted to use Rook on all subcommands and use Rook (rook.io) in the help for the main rook command

pkg/daemon/multus/nginx-config.yaml

travisn · 2023-04-28T17:00:54Z

pkg/daemon/multus/nginx-pod.yaml

+  annotations:
+    k8s.v1.cni.cncf.io/networks: "{{ .NetworksAnnotationValue }}"
+spec:
+  # TODO: selectors, affinities, tolerations


Currently the tool will only test nodes where there are no taints?

Right. This was a fine assumption on a brand new OpenShift cluster, but it is a good future to-do that should also be easy low-hanging fruit.

pkg/daemon/multus/resources.go

travisn · 2023-04-28T17:13:29Z

pkg/daemon/multus/validation.go

+
+// clients should all become ready within a pretty short amount of time since they all should start
+// pretty simultaneously
+var flakyThreshold = 20 * time.Second


If any image is being pulled, we may need longer than this even if the image is small.

That's true. A good future improvement would be to allow this to be tuned also, or for the test to ensure the binary is pulled on all nodes first. Because of the multiple ways of moving forward, I figured it would be good to defer for the future.

pkg/daemon/multus/validation.go

cmd/rook/rook/rook.go

travisn

Looks good, just a few final questions that could be discarded or considered for separate PRs...

travisn · 2023-05-01T22:42:16Z

Documentation/CRDs/Cluster/ceph-cluster-crd.md

@@ -222,7 +222,7 @@ Based on the configuration, the operator will do the following:
          public: rook-ceph/rook-public-nw
    ```

-2. If only the `cluster` selector is specified, the internal cluster traffic* will happen on that network. All other traffic to mons, OSDs, and other daemons will be on the default network.
+2. If only the `cluster` selector is specified, the internal cluster traffic\* will happen on that network. All other traffic to mons, OSDs, and other daemons will be on the default network.


The rendering of this looks fine here without the escaping. Where did you say it's not showing up correctly?

Not all markdown renderers are the same underlying implementation. This is the correct syntax to use if one desires a literal asterisk character, and the change will ensure it renders correctly with any rendering changes we or Github might make.

travisn · 2023-05-01T22:44:18Z

cmd/rook/userfacing/client/client.go

@@ -0,0 +1,43 @@
+/*


Do we need the userfacing package? Or could it just be cmd/rook/client/client.go?

I'll just use the rook.NewContext() command. It's kind of a leftover from the krew work anyway

travisn · 2023-05-01T22:45:05Z

cmd/rook/userfacing/multus/validation/validation.go

+var (
+	DefaultValidationNamespace = "rook-ceph"
+
+	// 1 mon, 3 osds, 2 mgrs, 1 mds, 1 nfs, 1 rgw, 1 rbdmirror, 1 cephfsmirror,


Shouldn't have two mgrs on the same node in production

Suggested change

// 1 mon, 3 osds, 2 mgrs, 1 mds, 1 nfs, 1 rgw, 1 rbdmirror, 1 cephfsmirror,

// 1 mon, 3 osds, 1 mgr, 1 mds, 1 nfs, 1 rgw, 1 rbdmirror, 1 cephfsmirror,

travisn · 2023-05-01T22:49:14Z

pkg/daemon/multus/client-daemonset.yaml

+    app.kubernetes.io/instance: "client-{{ .ClientID }}"
+    app.kubernetes.io/component: "client"
+    app.kubernetes.io/part-of: "multus-validation-test"
+    app.kubernetes.io/managed-by: "kubectl-rook-ceph"


Managed by the krew plugin?

maybe in future we can add with krew since we are close to go transition.

but currently, I don't think kubectl-rook-ceph is correct.

Good point. It seems to me that rook might imply it's managed by the operator. What about rook-cli?

travisn · 2023-05-01T22:53:47Z

Documentation/CRDs/Cluster/ceph-cluster-crd.md

+The tool's CLI is designed to be as helpful as possible. Get help text for the multus validation
+tool like so:
+```console
+kubectl --namespace rook-ceph exec -it deploy/rook-ceph-operator -- rook ctl multus validation run --help


Another approach to consider is to create a sample job manifest that could run the job, similar to our osd-purge.yaml, but perhaps as a separate PR if it makes sense.

I'll create a follow-up issue for this. It seems a good way to make sure logs are preserved if the pod running the tool fails. This might be a good time to note that the tool is not idempotent; it will fail if a pre-existing test seems like it might be in progress. This is intentional, and subsequent runs will error with a note of how to run test cleanup if desired.

travisn · 2023-05-01T22:56:26Z

Documentation/CRDs/Cluster/ceph-cluster-crd.md

@@ -287,7 +287,22 @@ spec:
    * This format is required in order to use the NetworkAttachmentDefinition across namespaces.
    * In Openshift, to use a NetworkAttachmentDefinition (NAD) across namespaces, the NAD must be deployed in the `default` namespace. The NAD is then referenced with the namespace: `default/rook-public-nw`

-#### Known limitations with Multus
+##### Validating Multus configuration


This tool seems very useful even beyond multus. Could we generalize it to be a network validation tool, not just specifically for multus? For example, it could test mon and osd ports, or any other general configuration required in the Ceph network config reference. For a separate PR of course...

I guess potentially, but let's talk about what planning for that would look like and what specifics would be good to validate. One of the implied follow-ups that seems more worthwhile to me than port access is to do load testing. That could be useful for standard networking as well, to make sure the network doesn't crumble under the load of Ceph's replication+client traffic. But we would need some RADOS experts to help us craft a test that creates a reasonable estimate of client+replication traffic.

subhamkrai · 2023-05-02T07:22:24Z

pkg/daemon/multus/client-daemonset.yaml

+    app.kubernetes.io/instance: "client-{{ .ClientID }}"
+    app.kubernetes.io/component: "client"
+    app.kubernetes.io/part-of: "multus-validation-test"
+    app.kubernetes.io/managed-by: "kubectl-rook-ceph"


but currently, I don't think kubectl-rook-ceph is correct.

subhamkrai · 2023-05-02T07:25:30Z

pkg/daemon/multus/nginx-config.yaml

+data:
+  server.conf: |
+    server {
+        listen       8080;


Suggested change

listen 8080;

listen 8080;

is 8080 is required, currently I have seen port 8080 conflicts with controller runtime matrix port.

controller runtime isn't running in the same pod as nginx, so there shouldn't be a conflict. Noted though, in case we do see an unexpected issue in the future.

And as a note, this is configured to use the commonly-used 8080 instead of 80 because security policies often prevent pods running with port 80.

Add a more involved multus validation test to the Rook binary. Because this is intended to be end-user runnable, make sure operator-only commands are hidden. Build this into the rook binary instead of creating a separate binary for ease, and because any binary built with the kube api becomes 40+ megabytes. We save quite a bit of space by including this in the Rook binary, which is good for keeping container layers as small as possible. Signed-off-by: Blaine Gardner <blaine.gardner@redhat.com>

test: add multus validation test routine to rook binary (backport #12069)

Name was accidentally modified in PR rook#12069. Signed-off-by: Blaine Gardner <blaine.gardner@redhat.com>

Name was accidentally modified in PR #12069. Signed-off-by: Blaine Gardner <blaine.gardner@redhat.com> (cherry picked from commit d38d4d7)

BlaineEXE added test unit or integration testing multus labels Apr 11, 2023

BlaineEXE requested review from travisn, sp98 and subhamkrai April 11, 2023 22:02

BlaineEXE commented Apr 11, 2023

View reviewed changes

design/common/multi-net-multus.md Show resolved Hide resolved

BlaineEXE force-pushed the multus-golang-tester branch from 3a21355 to 6679a2c Compare April 12, 2023 15:25

subhamkrai previously requested changes Apr 13, 2023

View reviewed changes

BlaineEXE requested a review from subhamkrai April 13, 2023 15:45

BlaineEXE force-pushed the multus-golang-tester branch from 6679a2c to f578e36 Compare April 13, 2023 15:48

BlaineEXE mentioned this pull request Apr 13, 2023

Some vSphere environments have networks that are flaky when testing multus red-hat-storage/ocs-ci#7467

Closed

BlaineEXE force-pushed the multus-golang-tester branch from f578e36 to a8492a1 Compare April 13, 2023 19:55

travisn reviewed Apr 13, 2023

View reviewed changes

BlaineEXE force-pushed the multus-golang-tester branch 4 times, most recently from d74ee32 to 952541e Compare April 18, 2023 08:22

BlaineEXE added the backport-release-1.11 label Apr 28, 2023

travisn requested changes Apr 28, 2023

View reviewed changes

BlaineEXE force-pushed the multus-golang-tester branch 5 times, most recently from 62ea742 to 53be976 Compare May 1, 2023 21:56

BlaineEXE commented May 1, 2023

View reviewed changes

cmd/rook/rook/rook.go Show resolved Hide resolved

BlaineEXE force-pushed the multus-golang-tester branch from 53be976 to 5e296a9 Compare May 1, 2023 22:11

travisn approved these changes May 1, 2023

View reviewed changes

subhamkrai reviewed May 2, 2023

View reviewed changes

BlaineEXE force-pushed the multus-golang-tester branch from 5e296a9 to 0f6e7ee Compare May 2, 2023 16:23

BlaineEXE merged commit 5a1d1f2 into rook:master May 2, 2023
49 of 50 checks passed

BlaineEXE deleted the multus-golang-tester branch May 2, 2023 17:37

mergify bot mentioned this pull request May 2, 2023

test: add multus validation test routine to rook binary (backport #12069) #12169

Merged

travisn added a commit that referenced this pull request May 2, 2023

Merge pull request #12169 from rook/mergify/bp/release-1.11/pr-12069

d8bd4fa

test: add multus validation test routine to rook binary (backport #12069)

BlaineEXE added a commit to BlaineEXE/rook that referenced this pull request May 4, 2023

operator: fix package logger name for rookcli

d38d4d7

Name was accidentally modified in PR rook#12069. Signed-off-by: Blaine Gardner <blaine.gardner@redhat.com>

BlaineEXE mentioned this pull request May 4, 2023

operator: fix package logger name for rookcli #12186

Merged

7 tasks

mergify bot pushed a commit that referenced this pull request May 4, 2023

operator: fix package logger name for rookcli

d69d6fe

Name was accidentally modified in PR #12069. Signed-off-by: Blaine Gardner <blaine.gardner@redhat.com> (cherry picked from commit d38d4d7)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: add multus validation test routine to rook binary #12069

test: add multus validation test routine to rook binary #12069

BlaineEXE commented Apr 11, 2023

subhamkrai Apr 13, 2023

BlaineEXE Apr 13, 2023

subhamkrai Apr 13, 2023

BlaineEXE Apr 13, 2023

travisn Apr 28, 2023

BlaineEXE Apr 28, 2023

BlaineEXE May 1, 2023

travisn Apr 28, 2023

BlaineEXE Apr 28, 2023

travisn May 1, 2023

BlaineEXE May 1, 2023

travisn Apr 28, 2023

BlaineEXE Apr 28, 2023

travisn Apr 28, 2023

BlaineEXE Apr 28, 2023

travisn left a comment

travisn May 1, 2023

BlaineEXE May 2, 2023

travisn May 1, 2023

BlaineEXE May 2, 2023 •

edited

travisn May 1, 2023

travisn May 1, 2023

subhamkrai May 2, 2023

subhamkrai May 2, 2023

BlaineEXE May 2, 2023

travisn May 1, 2023

BlaineEXE May 2, 2023 •

edited

travisn May 1, 2023

BlaineEXE May 2, 2023

subhamkrai May 2, 2023

subhamkrai May 2, 2023

BlaineEXE May 2, 2023

	configurations will support rook.io with Multus.
	configurations will support Rook with Multus.

	// 1 mon, 3 osds, 2 mgrs, 1 mds, 1 nfs, 1 rgw, 1 rbdmirror, 1 cephfsmirror,
	// 1 mon, 3 osds, 1 mgr, 1 mds, 1 nfs, 1 rgw, 1 rbdmirror, 1 cephfsmirror,

test: add multus validation test routine to rook binary #12069

test: add multus validation test routine to rook binary #12069

Conversation

BlaineEXE commented Apr 11, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

travisn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BlaineEXE May 2, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BlaineEXE May 2, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BlaineEXE May 2, 2023 •

edited

BlaineEXE May 2, 2023 •

edited