New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ceph: run CSI pods in multus-connected host network namespace #8686
Conversation
d0af72b
to
ba2b882
Compare
|
This pull request has merge conflicts that must be resolved before it can be merged. @renan-campos please rebase it. https://rook.io/docs/rook/master/development-flow.html#updating-your-fork |
ba2b882
to
c9c1456
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of docs to update:
- The multus design doc
- Are any documentation updates needed for user configuration, or this is all under the covers?
| allowPrivilegeEscalation: true | ||
| privileged: true | ||
| runAsUser: 0 | ||
| capabilities: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These capabilities are needed in addition to privileged?
pkg/operator/ceph/csi/spec.go
Outdated
| if err != nil { | ||
| return errors.Wrapf(err, "failed to get daemonset template to set up the csi network") | ||
| } | ||
| err = k8sutil.CreateDaemonSet(csiMultusSetup, namespace, clientset, multusHostSetup) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will also need the ability to apply nodeAffinity, tolerations, and resource limits. See the other CSI settings for example.
|
This pull request has merge conflicts that must be resolved before it can be merged. @renan-campos please rebase it. https://rook.io/docs/rook/master/development-flow.html#updating-your-fork |
c9c1456
to
3b771c5
Compare
This comment has been minimized.
This comment has been minimized.
3e5bdc6
to
e158db4
Compare
pkg/operator/ceph/csi/spec.go
Outdated
|
|
||
| // Only looking for ceph clusters running on the same namespace as the operator. | ||
| // This appears to be the only supported configuration. (See call to applyCephClusterNetworkConfig) | ||
| cephClusters, err := r.context.RookClientset.CephV1().CephClusters(r.opConfig.OperatorNamespace).List(ctx, metav1.ListOptions{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't assume the op namespace. The cephcluster is listed earlier in reconcile(), see if you can carry it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of carrying it, I just made the same client call in the function.
93a39a5
to
de9974b
Compare
|
I have addressed the present comments, and manually tested my solution on a local CRC cluster. I put the changes in separate commits, but will squash as necessary when it is ready to merge. Please re-review at your convenience. Thank you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I can speak a lot to the inner workings of this. I trust that your testing will reveal any issues. Mostly, I have flagged things that make the code follow more closely Rook's internal patterns. Notably, we really want evocative error messages that...
- help the user fix any problems that may be a user configuration issue
- help us developers find the exact place in the code where an error has come from based on returned error text
- contain enough information about the current state of execution that helps us understand what might be going wrong
pkg/daemon/ceph/multus/setup.go
Outdated
| logger.Info("determining multus ip") | ||
| multusIpStr, found := os.LookupEnv(multusIpEnv) | ||
| if !found { | ||
| return fmt.Errorf("environment variable %s not set.", multusIpEnv) | ||
| } | ||
| logger.Infof("multus ip: %s", multusIpStr) | ||
|
|
||
| logger.Info("determining multus link") | ||
| multusLinkName, found := os.LookupEnv(multusLinkEnv) | ||
| if !found { | ||
| return fmt.Errorf("environment variable %s not set.", multusLinkEnv) | ||
| } | ||
| logger.Infof("multus link: %s", multusLinkName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logging is good, but I'm not sure the messages like "determining multus ip" and "determining multus link" are particularly helpful if we are also printing the results of the steps once they are complete.
I think maybe an initial message would be good "setting up multus" or something similar, and then outputting info about the setup results as it goes.
pkg/daemon/ceph/multus/setup.go
Outdated
| logger.Info("determining multus ip") | ||
| multusIpStr, found := os.LookupEnv(multusIpEnv) | ||
| if !found { | ||
| return fmt.Errorf("environment variable %s not set.", multusIpEnv) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We try to use errors.Errorf and not fmt.Errorf
pkg/daemon/ceph/multus/utils.go
Outdated
| logger.Info("finding the multus network connected link") | ||
| link, err := netlink.LinkByName(multusLinkName) | ||
| if err != nil { | ||
| return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add context to all returned error messages with errors.Wrapf("message %s", stuff). Often, utilities Rook uses don't provide useful information for debugging, and we want to make sure from an error message that we can trace exactly where in the code the error originated.
pkg/daemon/ceph/multus/utils.go
Outdated
| logger.Info("determining the IP address of the multus link") | ||
| addrs, err := netlink.AddrList(link, 0) | ||
| if err != nil { | ||
| log.Fatal(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why Fatal here? That's not how we have patterned our error handling in Rook. We want to be able to gracefully recover from failures and retry if it makes sense.
de9974b
to
14fd634
Compare
14fd634
to
521c9cf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few thoughts :)
| @@ -88,6 +88,7 @@ users: | |||
| - system:serviceaccount:rook-ceph:rook-csi-rbd-provisioner-sa # serviceaccount:namespace:operator | |||
| - system:serviceaccount:rook-ceph:rook-csi-cephfs-plugin-sa # serviceaccount:namespace:operator | |||
| - system:serviceaccount:rook-ceph:rook-csi-cephfs-provisioner-sa # serviceaccount:namespace:operator | |||
| - system:serviceaccount:rook-ceph:rook-ceph-multus | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add this to the rook scc (about line 45 above) instead of this csi scc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I need SYS_ADMIN and NET_ADMIN capabilities, which the rook-ceph scc doesn't provide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the service account is really related to the csi driver, how about we name it like this?
| - system:serviceaccount:rook-ceph:rook-ceph-multus | |
| - system:serviceaccount:rook-ceph:rook-csi-multus |
pkg/operator/ceph/csi/spec.go
Outdated
| if cephCluster.Spec.Network.IsMultus() { | ||
| if publicNetwork, ok = cephCluster.Spec.Network.Selectors["public"]; !ok { | ||
| logger.Debug("public network not provided; not performing multus configuration.") | ||
| return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we continue in case another cluster is using multus?
| return nil | |
| continue |
pkg/operator/ceph/csi/spec.go
Outdated
| } | ||
|
|
||
| // Populate the host network namespace with a multus-connected interface. | ||
| tp := templateParam{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it's preferred to give vars readable names and reserve the single letter names for the func receivers, loops, or other similarly simple places.
| tp := templateParam{ | |
| template := templateParam{ |
pkg/operator/ceph/csi/spec.go
Outdated
| cleanupError = true | ||
| continue | ||
| } | ||
| err = k8sutil.WaitForJobCompletion(r.context.Clientset, cleanupJob, 30*time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this wait for job be done in the goroutine below? Then the jobs would be deployed on all applicable nodes and wait for them as a group.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thank you for catching this.
pkg/operator/ceph/csi/spec.go
Outdated
| } | ||
|
|
||
| for _, cephCluster := range cephClusters.Items { | ||
| // Assuming one CephCluster in the r.opConfig.OperatorNamespace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cephcluster is not necessarily in the operator namespace. How about if we just iterate through all cephclusters? There is a ROOK_CURRENT_NAMESPACE_ONLY setting in the operator that we should also honor. If that's true, then we can assume it's in the same namespace as the operator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am iterating through all of the cephclusters, I missed removing that comment. I gather all the cephClusters on line 704.
pkg/operator/ceph/csi/spec.go
Outdated
| return errors.Wrap(err, "failed to list ceph clusters") | ||
| } | ||
|
|
||
| for _, cephCluster := range cephClusters.Items { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Detecting if multus is needed looks like in two different places. How about factoring to a common method?
design/ceph/multus-network.md
Outdated
|
|
||
| If an error occurs in any of the migration jobs, the operator runs a teardown job on all of the nodes where migration was to have occurred. The teardown job runs on the host network namespace and is passed the multus IP of that node. It searches for an interface in the host namespace with the multus IP and removes it. If there is no such interface present, the job is considered complete. | ||
|
|
||
| Restarting the node will cause the multus interface in the host namespace to go away. The holder pod will once again have the interface, and will be remigrated once the job runs again on the node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the operator watching for the restart of all nodes to trigger this job?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. It is not. When I tested this out, the operator was on the node I restarted so it restarted as well and re-ran all the jobs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the latest commit, the csi controller watches the multus daemonset. Therefore the reconcile loop will be triggered when the node restarts.
design/ceph/multus-network.md
Outdated
| We can add annotations to these pods and they can reach out the Ceph public network, then the driver will expose the block or the filesystem normally. | ||
| The CSI pods will run in the node's host network namespace. | ||
|
|
||
| When deploying a CephCluster resource configured to use multus networks, a multus-connected network interface will be added to the host network namespace of all nodes that will run CSI plugin pods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the cluster is started, what happens when a new node is added to the cluster? Will it be a problem that the daemonset for the csi volume plugins will start before the job is triggered to configure the multus interface?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The csi plugin won't work properly if the job that multus interface isn't run first. The code doesn't handle the addition of new nodes, I need to add this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My latest push adds this functionality. When a new node is added, the multus holder pod will be deployed on it because of the daemonset definition; the csi reconcile loop will then deploy the interface migration job on that node.
521c9cf
to
45dac40
Compare
The controller deploys a daemonset that holds the multus interface and controls the jobs responsible for migrating the interface to the host network namespace. When the daemonset pod is deleted, it deploys a teardown job to remove the migrated interface from the hostnet. This code has been tested on a single-node CRC cluster. Signed-off-by: Renan Campos <rcampos@redhat.com>
build/rbac/rbac.yaml
Outdated
| --- | ||
| # Service account for multus actions related to the csi | ||
| apiVersion: v1 | ||
| kind: ServiceAccount | ||
| metadata: | ||
| name: rook-csi-multus | ||
| namespace: rook-ceph | ||
| --- | ||
| kind: Role | ||
| apiVersion: rbac.authorization.k8s.io/v1 | ||
| metadata: | ||
| name: rook-csi-multus | ||
| namespace: rook-ceph # namespace:operator | ||
| rules: | ||
| - apiGroups: [""] | ||
| resources: ["pods"] | ||
| verbs: ["get", "update"] | ||
| - apiGroups: ["batch"] | ||
| resources: ["jobs"] | ||
| verbs: ["create", "get", "delete"] | ||
| --- | ||
| kind: RoleBinding | ||
| apiVersion: rbac.authorization.k8s.io/v1 | ||
| metadata: | ||
| name: rook-csi-multus | ||
| namespace: rook-ceph # namespace:operator | ||
| roleRef: | ||
| apiGroup: rbac.authorization.k8s.io | ||
| kind: Role | ||
| name: rook-csi-multus | ||
| subjects: | ||
| - kind: ServiceAccount | ||
| name: rook-csi-multus | ||
| namespace: rook-ceph # namespace:operator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The multus daemonset creates jobs for interface migration and cleanup. Information about the interface being migrated is gathered and placed in the pod metadata. This is the reason for this additional service account.
| @@ -702,6 +702,7 @@ spec: | |||
| allowedCapabilities: | |||
| # required by CSI | |||
| - SYS_ADMIN | |||
| - NET_ADMIN | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pod Security Policy is a deprecated resource, and I don't know if this is used. However, the pods running multus migration needs the NET_ADMIN capability to perform network related operations. This capability is added to the appropriate SecurityContextConstraint below, and added here so that they match.
| multusControllerCmd, | ||
| multusSetupCmd, | ||
| multusTeardownCmd, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
multusControllerCmd runs a daemon that creates a job to migrate a multus interface into the host network namespace. If the daemon receives a SIGTEM, a cleanup job is run. multusSetupCmd is the migration job; multusTeadownCmd is the teardown job.
|
thank you for your patience, this pr is ready for rereview. |
f34d22d
to
69539bb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's been a while since we discussed this feature, can you join the Rook huddle soon or setup another meeting to discuss? thanks!
| return nil | ||
| } | ||
|
|
||
| if err := opcontroller.AddFinalizerWithNameIfNotPresent(r.opManagerContext, r.client, &cephCluster, multusFinalizer); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if there are multiple resources to cleanup when a cluster CR is deleted, we only need a single finalizer so the operator can trigger all the cleanup code, so I don't believe we need this new finalizer.
| } | ||
| } | ||
|
|
||
| if !cluster.DeletionTimestamp.IsZero() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of triggering the teardown from the deletion timestamp, how about from within the stopDrivers() method when the entire csi driver is also stopped? There is just one daemonset to remove from the same namespace as the operator, right? If so, then no need to trigger this from the cluster CR.
| logger.Errorf("failed to run teardown job: %v", err) | ||
| // Sleep so that the developer can view the log before the pod is destroyed. | ||
| // Pods being deleted have a grace period of 30 seconds after the SIGTERM. | ||
| time.Sleep(30 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not include a sleep for production clusters
| <-signalChan | ||
| logger.Info("Running teardown job") | ||
| logger.Infof("Removing multus interface %q", jobParams.MigratedInterface) | ||
| err = multus.RunTeardownJob(k8sClient, jobParams) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jobs are very heavy-weight, so running a job during pod teardown just doesn't seem right. Let's discuss.
Yes, I'll make diagrams and join tomorrow's to discuss if there is time; if not we can discuss the time for separate meeting then. |
|
This pull request has merge conflicts that must be resolved before it can be merged. @renan-campos please rebase it. https://rook.io/docs/rook/latest/development-flow.html#updating-your-fork |
|
I'm closing this PR. I don't the time, competence, or desire for this to succeed. |
When using multus networks, I/O operations on CephCluster volumes are blocked
when the node's CSI plugin pods are restarted. A similar bug was found
when using the default network configuration, and was solved by moving
the CSI plugin pods to the host network namespace. By being on the host
network namespace, the restarted CSI pods have the same network
configuration as its predecessor and reconnects to the volume. This
PR applies the same principle to solve the issue for multus
configurations by placing a multus-connected interface in the host
network namespace, and running the CSI plugin pods there.
To add a multus-connected interface to the host namespace, a new
daemonset is deployed when setting up the CSI drivers. This daemonset is
connected to the multus network, and starts a priviledged initContainer
that takes the multus-connected interface and moves it to the host
network namespace. This daemonset then runs indefinitely to keep the IP
address reserved. The program running in the daemonset pods sleep continuously,
to take as little resources as possible.
Signed-off-by: Renan Campos rcampos@redhat.com
Description of your changes:
Which issue is resolved by this Pull Request:
Resolves #
Checklist:
make codegen) has been run to update object specifications, if necessary.