Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: add multus validation test routine to rook binary #12069

Merged
merged 1 commit into from
May 2, 2023

Conversation

BlaineEXE
Copy link
Member

Add a more involved multus validation test to the Rook binary. Because this is intended to be end-user runnable, make sure operator-only commands are hidden.

Build this into the rook binary instead of creating a separate binary for ease, and because any binary built with the kube api becomes 40+ megabytes. We save quite a bit of space by including this in the Rook binary, which is good for keeping container layers as small as possible.

Sample out:

2023-04-11 15:57:24.547256 I | multus-validation: starting multus validation test with the following config:
2023-04-11 15:57:24.547357 I | multus-validation:   namespace: "openshift-storage"
2023-04-11 15:57:24.547359 I | multus-validation:   public network: "public-net"
2023-04-11 15:57:24.547361 I | multus-validation:   cluster network: "cluster-net"
2023-04-11 15:57:24.547363 I | multus-validation:   daemons per node: 3
2023-04-11 15:57:24.547366 I | multus-validation:   resource timeout: 3m0s
2023-04-11 15:57:25.022325 I | multus-validation: continuing: no web server network info: pod has no network status yet: cannot find network status
2023-04-11 15:57:29.088309 I | multus-validation: starting 3 client DaemonSets
2023-04-11 15:57:31.454986 I | multus-validation: expecting 9 clients
2023-04-11 15:57:33.669491 I | multus-validation: continuing: number of running clients [7] is not the number expected [9]
2023-04-11 15:57:35.866094 I | multus-validation: all 9 clients are running - but may not be ready
2023-04-11 15:57:38.055692 I | multus-validation: continuing: number of ready clients [0] is not the number expected [9]
2023-04-11 15:57:40.247977 I | multus-validation: continuing: number of ready clients [0] is not the number expected [9]
2023-04-11 15:57:42.444894 I | multus-validation: continuing: number of ready clients [0] is not the number expected [9]
2023-04-11 15:57:44.664455 I | multus-validation: continuing: number of ready clients [0] is not the number expected [9]
2023-04-11 15:57:46.855260 I | multus-validation: continuing: number of ready clients [0] is not the number expected [9]
2023-04-11 15:57:49.053469 I | multus-validation: continuing: number of ready clients [0] is not the number expected [9]
2023-04-11 15:57:51.240288 I | multus-validation: continuing: number of ready clients [0] is not the number expected [9]
2023-04-11 15:57:53.450619 I | multus-validation: continuing: number of ready clients [0] is not the number expected [9]
2023-04-11 15:57:55.637100 I | multus-validation: continuing: number of ready clients [0] is not the number expected [9]
2023-04-11 15:57:57.825276 I | multus-validation: continuing: number of ready clients [0] is not the number expected [9]
2023-04-11 15:58:00.031465 I | multus-validation: continuing: number of ready clients [0] is not the number expected [9]
2023-04-11 15:58:02.245487 I | multus-validation: continuing: number of ready clients [0] is not the number expected [9]
2023-04-11 15:58:04.439255 I | multus-validation: continuing: number of ready clients [0] is not the number expected [9]
2023-04-11 15:58:06.663148 I | multus-validation: continuing: number of ready clients [0] is not the number expected [9]
2023-04-11 15:58:08.854578 I | multus-validation: continuing: number of ready clients [0] is not the number expected [9]
2023-04-11 15:58:11.046529 I | multus-validation: continuing: number of ready clients [0] is not the number expected [9]
2023-04-11 15:58:13.228783 I | multus-validation: continuing: number of ready clients [0] is not the number expected [9]
2023-04-11 15:58:15.421473 I | multus-validation: continuing: number of ready clients [5] is not the number expected [9]
2023-04-11 15:58:17.632599 I | multus-validation: continuing: number of ready clients [5] is not the number expected [9]
2023-04-11 15:58:19.821952 I | multus-validation: continuing: number of ready clients [7] is not the number expected [9]
2023-04-11 15:58:22.013965 I | multus-validation: all 9 clients are ready

RESULT: multus validation test succeeded!

cleaning up multus validation test resources in namespace "openshift-storage"
multus validation test resources were successfully cleaned up

Description of your changes:

Which issue is resolved by this Pull Request:
Resolves #

Checklist:

  • Commit Message Formatting: Commit titles and messages follow guidelines in the developer guide).
  • Skip Tests for Docs: If this is only a documentation change, add the label skip-ci on the PR.
  • Reviewed the developer guide on Submitting a Pull Request
  • Pending release notes updated with breaking and/or notable changes for the next minor release.
  • Documentation has been updated, if necessary.
  • Unit tests have been added, if necessary.
  • Integration tests have been added, if necessary.

@BlaineEXE BlaineEXE added test unit or integration testing multus labels Apr 11, 2023
cmd/rook/ctl/multus/validation/validation.go Outdated Show resolved Hide resolved
cmd/rook/ctl/multus/validation/validation.go Outdated Show resolved Hide resolved
report := results.SuggestedDebuggingReport()

// success/failure message
fmt.Print("\n")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fmt.Print("\n")
fmt.Println()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output all uses \n for consistency of reading where newlines are

return nil
}
func (t timeoutMinutes) Type() string {
return "timeoutMinutes"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return "timeoutMinutes"
return "timeout Minutes"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a type, which can't have spaces in Go

cmd/rook/ctl/ctl.go Outdated Show resolved Hide resolved
design/common/multi-net-multus.md Show resolved Hide resolved
design/common/multi-net-multus.md Show resolved Hide resolved
design/common/multi-net-multus.md Show resolved Hide resolved
design/common/multi-net-multus.md Show resolved Hide resolved
design/common/multi-net-multus.md Show resolved Hide resolved
pkg/daemon/multus/resources.go Show resolved Hide resolved
pkg/daemon/multus/resources.go Outdated Show resolved Hide resolved
design/common/multi-net-multus.md Show resolved Hide resolved
@BlaineEXE BlaineEXE force-pushed the multus-golang-tester branch 4 times, most recently from d74ee32 to 952541e Compare April 18, 2023 08:22
@BlaineEXE BlaineEXE dismissed subhamkrai’s stale review April 27, 2023 22:44

dismissing to get this moving more quickly. prior comments don't affect functionality and can be addressed in follow-up if necessary.

Documentation/CRDs/Cluster/ceph-cluster-crd.md Outdated Show resolved Hide resolved

// 3 mons, 3 osds, 2 mgrs, 1 mds, 1 nfs, 1 rgw, 1 rbdmirror, 1 cephfsmirror,
// (2 csi provisioners, 1 csi plugin) x3 for rbd, cephfs, and nfs CSI drivers
DefaultDaemonsPerNode = 22
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of those daemons aren't typically on the same node, so it seems the default could be lower

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the headache we've had with multus, giving false negatives seems a preferable starting point to me than false positives: #12069 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reduced to 17 per discussion

Short: "Run a Multus validation test for rook.io",
Long: `
Run a validation test that determines whether the current Multus and system
configurations will support rook.io with Multus.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
configurations will support rook.io with Multus.
configurations will support Rook with Multus.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, rook.io is the most clear we can be. It's short, and if users are seeing the help text outside of the Rook context for any reason, they can get a quick primer or reminder at the website..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the recent quest to have consistent docs I think my mindset is in a different place from others on this topic... I see Rook as the name of the product, and rook.io as a website. The user wants to know if the product is compatible with multus in their environment. They're not trying to see if the website is compatible. And if they're running this tool, seems like they would already know where the top level docs are found. Let's discuss this philosophy more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opted to use Rook on all subcommands and use Rook (rook.io) in the help for the main rook command

pkg/daemon/multus/nginx-config.yaml Outdated Show resolved Hide resolved
annotations:
k8s.v1.cni.cncf.io/networks: "{{ .NetworksAnnotationValue }}"
spec:
# TODO: selectors, affinities, tolerations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the tool will only test nodes where there are no taints?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. This was a fine assumption on a brand new OpenShift cluster, but it is a good future to-do that should also be easy low-hanging fruit.

pkg/daemon/multus/resources.go Show resolved Hide resolved

// clients should all become ready within a pretty short amount of time since they all should start
// pretty simultaneously
var flakyThreshold = 20 * time.Second
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If any image is being pulled, we may need longer than this even if the image is small.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true. A good future improvement would be to allow this to be tuned also, or for the test to ensure the binary is pulled on all nodes first. Because of the multiple ways of moving forward, I figured it would be good to defer for the future.

pkg/daemon/multus/validation.go Show resolved Hide resolved
@BlaineEXE BlaineEXE force-pushed the multus-golang-tester branch 5 times, most recently from 62ea742 to 53be976 Compare May 1, 2023 21:56
Copy link
Member

@travisn travisn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a few final questions that could be discarded or considered for separate PRs...

@@ -222,7 +222,7 @@ Based on the configuration, the operator will do the following:
public: rook-ceph/rook-public-nw
```

2. If only the `cluster` selector is specified, the internal cluster traffic* will happen on that network. All other traffic to mons, OSDs, and other daemons will be on the default network.
2. If only the `cluster` selector is specified, the internal cluster traffic\* will happen on that network. All other traffic to mons, OSDs, and other daemons will be on the default network.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rendering of this looks fine here without the escaping. Where did you say it's not showing up correctly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all markdown renderers are the same underlying implementation. This is the correct syntax to use if one desires a literal asterisk character, and the change will ensure it renders correctly with any rendering changes we or Github might make.

@@ -0,0 +1,43 @@
/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the userfacing package? Or could it just be cmd/rook/client/client.go?

Copy link
Member Author

@BlaineEXE BlaineEXE May 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll just use the rook.NewContext() command. It's kind of a leftover from the krew work anyway

var (
DefaultValidationNamespace = "rook-ceph"

// 1 mon, 3 osds, 2 mgrs, 1 mds, 1 nfs, 1 rgw, 1 rbdmirror, 1 cephfsmirror,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't have two mgrs on the same node in production

Suggested change
// 1 mon, 3 osds, 2 mgrs, 1 mds, 1 nfs, 1 rgw, 1 rbdmirror, 1 cephfsmirror,
// 1 mon, 3 osds, 1 mgr, 1 mds, 1 nfs, 1 rgw, 1 rbdmirror, 1 cephfsmirror,

app.kubernetes.io/instance: "client-{{ .ClientID }}"
app.kubernetes.io/component: "client"
app.kubernetes.io/part-of: "multus-validation-test"
app.kubernetes.io/managed-by: "kubectl-rook-ceph"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Managed by the krew plugin?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe in future we can add with krew since we are close to go transition.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but currently, I don't think kubectl-rook-ceph is correct.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. It seems to me that rook might imply it's managed by the operator. What about rook-cli?

The tool's CLI is designed to be as helpful as possible. Get help text for the multus validation
tool like so:
```console
kubectl --namespace rook-ceph exec -it deploy/rook-ceph-operator -- rook ctl multus validation run --help
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another approach to consider is to create a sample job manifest that could run the job, similar to our osd-purge.yaml, but perhaps as a separate PR if it makes sense.

Copy link
Member Author

@BlaineEXE BlaineEXE May 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll create a follow-up issue for this. It seems a good way to make sure logs are preserved if the pod running the tool fails. This might be a good time to note that the tool is not idempotent; it will fail if a pre-existing test seems like it might be in progress. This is intentional, and subsequent runs will error with a note of how to run test cleanup if desired.

@@ -287,7 +287,22 @@ spec:
* This format is required in order to use the NetworkAttachmentDefinition across namespaces.
* In Openshift, to use a NetworkAttachmentDefinition (NAD) across namespaces, the NAD must be deployed in the `default` namespace. The NAD is then referenced with the namespace: `default/rook-public-nw`

#### Known limitations with Multus
##### Validating Multus configuration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tool seems very useful even beyond multus. Could we generalize it to be a network validation tool, not just specifically for multus? For example, it could test mon and osd ports, or any other general configuration required in the Ceph network config reference. For a separate PR of course...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess potentially, but let's talk about what planning for that would look like and what specifics would be good to validate. One of the implied follow-ups that seems more worthwhile to me than port access is to do load testing. That could be useful for standard networking as well, to make sure the network doesn't crumble under the load of Ceph's replication+client traffic. But we would need some RADOS experts to help us craft a test that creates a reasonable estimate of client+replication traffic.

app.kubernetes.io/instance: "client-{{ .ClientID }}"
app.kubernetes.io/component: "client"
app.kubernetes.io/part-of: "multus-validation-test"
app.kubernetes.io/managed-by: "kubectl-rook-ceph"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but currently, I don't think kubectl-rook-ceph is correct.

data:
server.conf: |
server {
listen 8080;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
listen 8080;
listen 8080;

is 8080 is required, currently I have seen port 8080 conflicts with controller runtime matrix port.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

controller runtime isn't running in the same pod as nginx, so there shouldn't be a conflict. Noted though, in case we do see an unexpected issue in the future.

And as a note, this is configured to use the commonly-used 8080 instead of 80 because security policies often prevent pods running with port 80.

Add a more involved multus validation test to the Rook binary. Because
this is intended to be end-user runnable, make sure operator-only
commands are hidden.

Build this into the rook binary instead of creating a separate binary
for ease, and because any binary built with the kube api becomes 40+
megabytes. We save quite a bit of space by including this in the Rook
binary, which is good for keeping container layers as small as possible.

Signed-off-by: Blaine Gardner <blaine.gardner@redhat.com>
@BlaineEXE BlaineEXE merged commit 5a1d1f2 into rook:master May 2, 2023
49 of 50 checks passed
@BlaineEXE BlaineEXE deleted the multus-golang-tester branch May 2, 2023 17:37
travisn added a commit that referenced this pull request May 2, 2023
test: add multus validation test routine to rook binary (backport #12069)
BlaineEXE added a commit to BlaineEXE/rook that referenced this pull request May 4, 2023
Name was accidentally modified in PR rook#12069.

Signed-off-by: Blaine Gardner <blaine.gardner@redhat.com>
mergify bot pushed a commit that referenced this pull request May 4, 2023
Name was accidentally modified in PR #12069.

Signed-off-by: Blaine Gardner <blaine.gardner@redhat.com>
(cherry picked from commit d38d4d7)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-release-1.11 multus test unit or integration testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants