-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test: add multus validation test routine to rook binary #12069
Conversation
3a21355
to
6679a2c
Compare
report := results.SuggestedDebuggingReport() | ||
|
||
// success/failure message | ||
fmt.Print("\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fmt.Print("\n") | |
fmt.Println() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The output all uses \n
for consistency of reading where newlines are
return nil | ||
} | ||
func (t timeoutMinutes) Type() string { | ||
return "timeoutMinutes" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return "timeoutMinutes" | |
return "timeout Minutes" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a type, which can't have spaces in Go
6679a2c
to
f578e36
Compare
f578e36
to
a8492a1
Compare
d74ee32
to
952541e
Compare
dismissing to get this moving more quickly. prior comments don't affect functionality and can be addressed in follow-up if necessary.
|
||
// 3 mons, 3 osds, 2 mgrs, 1 mds, 1 nfs, 1 rgw, 1 rbdmirror, 1 cephfsmirror, | ||
// (2 csi provisioners, 1 csi plugin) x3 for rbd, cephfs, and nfs CSI drivers | ||
DefaultDaemonsPerNode = 22 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of those daemons aren't typically on the same node, so it seems the default could be lower
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the headache we've had with multus, giving false negatives seems a preferable starting point to me than false positives: #12069 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reduced to 17 per discussion
Short: "Run a Multus validation test for rook.io", | ||
Long: ` | ||
Run a validation test that determines whether the current Multus and system | ||
configurations will support rook.io with Multus. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
configurations will support rook.io with Multus. | |
configurations will support Rook with Multus. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, rook.io is the most clear we can be. It's short, and if users are seeing the help text outside of the Rook context for any reason, they can get a quick primer or reminder at the website..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the recent quest to have consistent docs I think my mindset is in a different place from others on this topic... I see Rook as the name of the product, and rook.io as a website. The user wants to know if the product is compatible with multus in their environment. They're not trying to see if the website is compatible. And if they're running this tool, seems like they would already know where the top level docs are found. Let's discuss this philosophy more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opted to use Rook
on all subcommands and use Rook (rook.io)
in the help for the main rook
command
annotations: | ||
k8s.v1.cni.cncf.io/networks: "{{ .NetworksAnnotationValue }}" | ||
spec: | ||
# TODO: selectors, affinities, tolerations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently the tool will only test nodes where there are no taints?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. This was a fine assumption on a brand new OpenShift cluster, but it is a good future to-do that should also be easy low-hanging fruit.
|
||
// clients should all become ready within a pretty short amount of time since they all should start | ||
// pretty simultaneously | ||
var flakyThreshold = 20 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If any image is being pulled, we may need longer than this even if the image is small.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true. A good future improvement would be to allow this to be tuned also, or for the test to ensure the binary is pulled on all nodes first. Because of the multiple ways of moving forward, I figured it would be good to defer for the future.
62ea742
to
53be976
Compare
53be976
to
5e296a9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just a few final questions that could be discarded or considered for separate PRs...
@@ -222,7 +222,7 @@ Based on the configuration, the operator will do the following: | |||
public: rook-ceph/rook-public-nw | |||
``` | |||
|
|||
2. If only the `cluster` selector is specified, the internal cluster traffic* will happen on that network. All other traffic to mons, OSDs, and other daemons will be on the default network. | |||
2. If only the `cluster` selector is specified, the internal cluster traffic\* will happen on that network. All other traffic to mons, OSDs, and other daemons will be on the default network. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rendering of this looks fine here without the escaping. Where did you say it's not showing up correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not all markdown renderers are the same underlying implementation. This is the correct syntax to use if one desires a literal asterisk character, and the change will ensure it renders correctly with any rendering changes we or Github might make.
cmd/rook/userfacing/client/client.go
Outdated
@@ -0,0 +1,43 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the userfacing
package? Or could it just be cmd/rook/client/client.go
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll just use the rook.NewContext()
command. It's kind of a leftover from the krew work anyway
var ( | ||
DefaultValidationNamespace = "rook-ceph" | ||
|
||
// 1 mon, 3 osds, 2 mgrs, 1 mds, 1 nfs, 1 rgw, 1 rbdmirror, 1 cephfsmirror, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't have two mgrs on the same node in production
// 1 mon, 3 osds, 2 mgrs, 1 mds, 1 nfs, 1 rgw, 1 rbdmirror, 1 cephfsmirror, | |
// 1 mon, 3 osds, 1 mgr, 1 mds, 1 nfs, 1 rgw, 1 rbdmirror, 1 cephfsmirror, |
app.kubernetes.io/instance: "client-{{ .ClientID }}" | ||
app.kubernetes.io/component: "client" | ||
app.kubernetes.io/part-of: "multus-validation-test" | ||
app.kubernetes.io/managed-by: "kubectl-rook-ceph" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Managed by the krew plugin?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe in future we can add with krew since we are close to go transition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but currently, I don't think kubectl-rook-ceph
is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. It seems to me that rook
might imply it's managed by the operator. What about rook-cli
?
The tool's CLI is designed to be as helpful as possible. Get help text for the multus validation | ||
tool like so: | ||
```console | ||
kubectl --namespace rook-ceph exec -it deploy/rook-ceph-operator -- rook ctl multus validation run --help |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another approach to consider is to create a sample job manifest that could run the job, similar to our osd-purge.yaml, but perhaps as a separate PR if it makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll create a follow-up issue for this. It seems a good way to make sure logs are preserved if the pod running the tool fails. This might be a good time to note that the tool is not idempotent; it will fail if a pre-existing test seems like it might be in progress. This is intentional, and subsequent runs will error with a note of how to run test cleanup if desired.
@@ -287,7 +287,22 @@ spec: | |||
* This format is required in order to use the NetworkAttachmentDefinition across namespaces. | |||
* In Openshift, to use a NetworkAttachmentDefinition (NAD) across namespaces, the NAD must be deployed in the `default` namespace. The NAD is then referenced with the namespace: `default/rook-public-nw` | |||
|
|||
#### Known limitations with Multus | |||
##### Validating Multus configuration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This tool seems very useful even beyond multus. Could we generalize it to be a network validation tool, not just specifically for multus? For example, it could test mon and osd ports, or any other general configuration required in the Ceph network config reference. For a separate PR of course...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess potentially, but let's talk about what planning for that would look like and what specifics would be good to validate. One of the implied follow-ups that seems more worthwhile to me than port access is to do load testing. That could be useful for standard networking as well, to make sure the network doesn't crumble under the load of Ceph's replication+client traffic. But we would need some RADOS experts to help us craft a test that creates a reasonable estimate of client+replication traffic.
app.kubernetes.io/instance: "client-{{ .ClientID }}" | ||
app.kubernetes.io/component: "client" | ||
app.kubernetes.io/part-of: "multus-validation-test" | ||
app.kubernetes.io/managed-by: "kubectl-rook-ceph" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but currently, I don't think kubectl-rook-ceph
is correct.
data: | ||
server.conf: | | ||
server { | ||
listen 8080; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
listen 8080; | |
listen 8080; |
is 8080 is required, currently I have seen port 8080 conflicts with controller runtime matrix port.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
controller runtime isn't running in the same pod as nginx, so there shouldn't be a conflict. Noted though, in case we do see an unexpected issue in the future.
And as a note, this is configured to use the commonly-used 8080 instead of 80 because security policies often prevent pods running with port 80.
Add a more involved multus validation test to the Rook binary. Because this is intended to be end-user runnable, make sure operator-only commands are hidden. Build this into the rook binary instead of creating a separate binary for ease, and because any binary built with the kube api becomes 40+ megabytes. We save quite a bit of space by including this in the Rook binary, which is good for keeping container layers as small as possible. Signed-off-by: Blaine Gardner <blaine.gardner@redhat.com>
5e296a9
to
0f6e7ee
Compare
test: add multus validation test routine to rook binary (backport #12069)
Name was accidentally modified in PR rook#12069. Signed-off-by: Blaine Gardner <blaine.gardner@redhat.com>
Add a more involved multus validation test to the Rook binary. Because this is intended to be end-user runnable, make sure operator-only commands are hidden.
Build this into the rook binary instead of creating a separate binary for ease, and because any binary built with the kube api becomes 40+ megabytes. We save quite a bit of space by including this in the Rook binary, which is good for keeping container layers as small as possible.
Sample out:
Description of your changes:
Which issue is resolved by this Pull Request:
Resolves #
Checklist:
skip-ci
on the PR.