Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ceph: run CSI pods in multus-connected host network namespace #8686

Closed
wants to merge 29 commits into from

Conversation

renan-campos
Copy link
Contributor

When using multus networks, I/O operations on CephCluster volumes are blocked
when the node's CSI plugin pods are restarted. A similar bug was found
when using the default network configuration, and was solved by moving
the CSI plugin pods to the host network namespace. By being on the host
network namespace, the restarted CSI pods have the same network
configuration as its predecessor and reconnects to the volume. This
PR applies the same principle to solve the issue for multus
configurations by placing a multus-connected interface in the host
network namespace, and running the CSI plugin pods there.

To add a multus-connected interface to the host namespace, a new
daemonset is deployed when setting up the CSI drivers. This daemonset is
connected to the multus network, and starts a priviledged initContainer
that takes the multus-connected interface and moves it to the host
network namespace. This daemonset then runs indefinitely to keep the IP
address reserved. The program running in the daemonset pods sleep continuously,
to take as little resources as possible.

Signed-off-by: Renan Campos rcampos@redhat.com

Description of your changes:

Which issue is resolved by this Pull Request:
Resolves #

Checklist:

  • Commit Message Formatting: Commit titles and messages follow guidelines in the developer guide.
  • Skip Tests for Docs: Add the flag for skipping the build if this is only a documentation change. See here for the flag.
  • Skip Unrelated Tests: Add a flag to run tests for a specific storage provider. See test options.
  • Reviewed the developer guide on Submitting a Pull Request
  • Documentation has been updated, if necessary.
  • Unit tests have been added, if necessary.
  • Integration tests have been added, if necessary.
  • Pending release notes updated with breaking and/or notable changes, if necessary.
  • Upgrade from previous release is tested and upgrade user guide is updated, if necessary.
  • Code generation (make codegen) has been run to update object specifications, if necessary.

@mergify
Copy link

mergify bot commented Sep 9, 2021

This pull request has merge conflicts that must be resolved before it can be merged. @renan-campos please rebase it. https://rook.io/docs/rook/master/development-flow.html#updating-your-fork

pkg/operator/ceph/csi/template/multus-host-setup.yaml Outdated Show resolved Hide resolved
pkg/operator/ceph/csi/template/multus-host-setup.yaml Outdated Show resolved Hide resolved
cmd/if-mover/main.go Outdated Show resolved Hide resolved
cmd/if-mover/main.go Outdated Show resolved Hide resolved
cmd/if-mover/main.go Outdated Show resolved Hide resolved
cmd/if-mover/main.go Outdated Show resolved Hide resolved
pkg/operator/ceph/csi/spec.go Outdated Show resolved Hide resolved
pkg/operator/ceph/csi/spec.go Outdated Show resolved Hide resolved
pkg/operator/ceph/csi/spec.go Outdated Show resolved Hide resolved
pkg/operator/ceph/csi/spec.go Outdated Show resolved Hide resolved
Copy link
Member

@travisn travisn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of docs to update:

cmd/coma/main.go Outdated Show resolved Hide resolved
cmd/if-mover/main.go Outdated Show resolved Hide resolved
cmd/if-mover/main.go Outdated Show resolved Hide resolved
allowPrivilegeEscalation: true
privileged: true
runAsUser: 0
capabilities:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These capabilities are needed in addition to privileged?

if err != nil {
return errors.Wrapf(err, "failed to get daemonset template to set up the csi network")
}
err = k8sutil.CreateDaemonSet(csiMultusSetup, namespace, clientset, multusHostSetup)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will also need the ability to apply nodeAffinity, tolerations, and resource limits. See the other CSI settings for example.

@mergify
Copy link

mergify bot commented Sep 17, 2021

This pull request has merge conflicts that must be resolved before it can be merged. @renan-campos please rebase it. https://rook.io/docs/rook/master/development-flow.html#updating-your-fork

@renan-campos

This comment has been minimized.

@renan-campos renan-campos changed the title ceph: run CSI pods in multus-connected host network namespace [WIP] ceph: run CSI pods in multus-connected host network namespace Sep 27, 2021
cmd/rook/ceph/multus.go Outdated Show resolved Hide resolved
cmd/rook/ceph/multus.go Outdated Show resolved Hide resolved
cmd/rook/ceph/multus.go Outdated Show resolved Hide resolved
pkg/daemon/ceph/multus/setup.go Show resolved Hide resolved
pkg/daemon/ceph/multus/setup.go Outdated Show resolved Hide resolved
pkg/operator/ceph/csi/spec.go Outdated Show resolved Hide resolved

// Only looking for ceph clusters running on the same namespace as the operator.
// This appears to be the only supported configuration. (See call to applyCephClusterNetworkConfig)
cephClusters, err := r.context.RookClientset.CephV1().CephClusters(r.opConfig.OperatorNamespace).List(ctx, metav1.ListOptions{})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't assume the op namespace. The cephcluster is listed earlier in reconcile(), see if you can carry it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of carrying it, I just made the same client call in the function.

pkg/operator/ceph/csi/spec.go Outdated Show resolved Hide resolved
pkg/operator/ceph/csi/spec.go Outdated Show resolved Hide resolved
pkg/operator/ceph/csi/spec.go Outdated Show resolved Hide resolved
@leseb leseb added this to In progress in v1.8 via automation Sep 28, 2021
@renan-campos renan-campos force-pushed the bz1979561 branch 3 times, most recently from 93a39a5 to de9974b Compare October 1, 2021 16:21
@renan-campos renan-campos changed the title [WIP] ceph: run CSI pods in multus-connected host network namespace ceph: run CSI pods in multus-connected host network namespace Oct 1, 2021
@renan-campos
Copy link
Contributor Author

I have addressed the present comments, and manually tested my solution on a local CRC cluster. I put the changes in separate commits, but will squash as necessary when it is ready to merge. Please re-review at your convenience. Thank you.

Copy link
Member

@BlaineEXE BlaineEXE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I can speak a lot to the inner workings of this. I trust that your testing will reveal any issues. Mostly, I have flagged things that make the code follow more closely Rook's internal patterns. Notably, we really want evocative error messages that...

  1. help the user fix any problems that may be a user configuration issue
  2. help us developers find the exact place in the code where an error has come from based on returned error text
  3. contain enough information about the current state of execution that helps us understand what might be going wrong

Comment on lines 27 to 39
logger.Info("determining multus ip")
multusIpStr, found := os.LookupEnv(multusIpEnv)
if !found {
return fmt.Errorf("environment variable %s not set.", multusIpEnv)
}
logger.Infof("multus ip: %s", multusIpStr)

logger.Info("determining multus link")
multusLinkName, found := os.LookupEnv(multusLinkEnv)
if !found {
return fmt.Errorf("environment variable %s not set.", multusLinkEnv)
}
logger.Infof("multus link: %s", multusLinkName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logging is good, but I'm not sure the messages like "determining multus ip" and "determining multus link" are particularly helpful if we are also printing the results of the steps once they are complete.

I think maybe an initial message would be good "setting up multus" or something similar, and then outputting info about the setup results as it goes.

logger.Info("determining multus ip")
multusIpStr, found := os.LookupEnv(multusIpEnv)
if !found {
return fmt.Errorf("environment variable %s not set.", multusIpEnv)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We try to use errors.Errorf and not fmt.Errorf

logger.Info("finding the multus network connected link")
link, err := netlink.LinkByName(multusLinkName)
if err != nil {
return err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add context to all returned error messages with errors.Wrapf("message %s", stuff). Often, utilities Rook uses don't provide useful information for debugging, and we want to make sure from an error message that we can trace exactly where in the code the error originated.

logger.Info("determining the IP address of the multus link")
addrs, err := netlink.AddrList(link, 0)
if err != nil {
log.Fatal(err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why Fatal here? That's not how we have patterned our error handling in Rook. We want to be able to gracefully recover from failures and retry if it makes sense.

v1.8 automation moved this from In progress to Review in progress Oct 1, 2021
Copy link
Member

@travisn travisn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few thoughts :)

@@ -88,6 +88,7 @@ users:
- system:serviceaccount:rook-ceph:rook-csi-rbd-provisioner-sa # serviceaccount:namespace:operator
- system:serviceaccount:rook-ceph:rook-csi-cephfs-plugin-sa # serviceaccount:namespace:operator
- system:serviceaccount:rook-ceph:rook-csi-cephfs-provisioner-sa # serviceaccount:namespace:operator
- system:serviceaccount:rook-ceph:rook-ceph-multus
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add this to the rook scc (about line 45 above) instead of this csi scc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I need SYS_ADMIN and NET_ADMIN capabilities, which the rook-ceph scc doesn't provide.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the service account is really related to the csi driver, how about we name it like this?

Suggested change
- system:serviceaccount:rook-ceph:rook-ceph-multus
- system:serviceaccount:rook-ceph:rook-csi-multus

if cephCluster.Spec.Network.IsMultus() {
if publicNetwork, ok = cephCluster.Spec.Network.Selectors["public"]; !ok {
logger.Debug("public network not provided; not performing multus configuration.")
return nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we continue in case another cluster is using multus?

Suggested change
return nil
continue

}

// Populate the host network namespace with a multus-connected interface.
tp := templateParam{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it's preferred to give vars readable names and reserve the single letter names for the func receivers, loops, or other similarly simple places.

Suggested change
tp := templateParam{
template := templateParam{

cleanupError = true
continue
}
err = k8sutil.WaitForJobCompletion(r.context.Clientset, cleanupJob, 30*time.Second)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this wait for job be done in the goroutine below? Then the jobs would be deployed on all applicable nodes and wait for them as a group.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thank you for catching this.

}

for _, cephCluster := range cephClusters.Items {
// Assuming one CephCluster in the r.opConfig.OperatorNamespace
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cephcluster is not necessarily in the operator namespace. How about if we just iterate through all cephclusters? There is a ROOK_CURRENT_NAMESPACE_ONLY setting in the operator that we should also honor. If that's true, then we can assume it's in the same namespace as the operator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am iterating through all of the cephclusters, I missed removing that comment. I gather all the cephClusters on line 704.

return errors.Wrap(err, "failed to list ceph clusters")
}

for _, cephCluster := range cephClusters.Items {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Detecting if multus is needed looks like in two different places. How about factoring to a common method?


If an error occurs in any of the migration jobs, the operator runs a teardown job on all of the nodes where migration was to have occurred. The teardown job runs on the host network namespace and is passed the multus IP of that node. It searches for an interface in the host namespace with the multus IP and removes it. If there is no such interface present, the job is considered complete.

Restarting the node will cause the multus interface in the host namespace to go away. The holder pod will once again have the interface, and will be remigrated once the job runs again on the node.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the operator watching for the restart of all nodes to trigger this job?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. It is not. When I tested this out, the operator was on the node I restarted so it restarted as well and re-ran all the jobs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the latest commit, the csi controller watches the multus daemonset. Therefore the reconcile loop will be triggered when the node restarts.

pkg/daemon/ceph/multus/teardown.go Outdated Show resolved Hide resolved
pkg/operator/ceph/csi/spec.go Outdated Show resolved Hide resolved
We can add annotations to these pods and they can reach out the Ceph public network, then the driver will expose the block or the filesystem normally.
The CSI pods will run in the node's host network namespace.

When deploying a CephCluster resource configured to use multus networks, a multus-connected network interface will be added to the host network namespace of all nodes that will run CSI plugin pods.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the cluster is started, what happens when a new node is added to the cluster? Will it be a problem that the daemonset for the csi volume plugins will start before the job is triggered to configure the multus interface?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The csi plugin won't work properly if the job that multus interface isn't run first. The code doesn't handle the addition of new nodes, I need to add this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My latest push adds this functionality. When a new node is added, the multus holder pod will be deployed on it because of the daemonset definition; the csi reconcile loop will then deploy the interface migration job on that node.

Comment on lines 1230 to 1263
---
# Service account for multus actions related to the csi
apiVersion: v1
kind: ServiceAccount
metadata:
name: rook-csi-multus
namespace: rook-ceph
---
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: rook-csi-multus
namespace: rook-ceph # namespace:operator
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "update"]
- apiGroups: ["batch"]
resources: ["jobs"]
verbs: ["create", "get", "delete"]
---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: rook-csi-multus
namespace: rook-ceph # namespace:operator
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: rook-csi-multus
subjects:
- kind: ServiceAccount
name: rook-csi-multus
namespace: rook-ceph # namespace:operator
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The multus daemonset creates jobs for interface migration and cleanup. Information about the interface being migrated is gathered and placed in the pod metadata. This is the reason for this additional service account.

@@ -702,6 +702,7 @@ spec:
allowedCapabilities:
# required by CSI
- SYS_ADMIN
- NET_ADMIN
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pod Security Policy is a deprecated resource, and I don't know if this is used. However, the pods running multus migration needs the NET_ADMIN capability to perform network related operations. This capability is added to the appropriate SecurityContextConstraint below, and added here so that they match.

Comment on lines +62 to +64
multusControllerCmd,
multusSetupCmd,
multusTeardownCmd,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

multusControllerCmd runs a daemon that creates a job to migrate a multus interface into the host network namespace. If the daemon receives a SIGTEM, a cleanup job is run. multusSetupCmd is the migration job; multusTeadownCmd is the teardown job.

@renan-campos
Copy link
Contributor Author

thank you for your patience, this pr is ready for rereview.

Copy link
Member

@travisn travisn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's been a while since we discussed this feature, can you join the Rook huddle soon or setup another meeting to discuss? thanks!

return nil
}

if err := opcontroller.AddFinalizerWithNameIfNotPresent(r.opManagerContext, r.client, &cephCluster, multusFinalizer); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if there are multiple resources to cleanup when a cluster CR is deleted, we only need a single finalizer so the operator can trigger all the cleanup code, so I don't believe we need this new finalizer.

}
}

if !cluster.DeletionTimestamp.IsZero() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of triggering the teardown from the deletion timestamp, how about from within the stopDrivers() method when the entire csi driver is also stopped? There is just one daemonset to remove from the same namespace as the operator, right? If so, then no need to trigger this from the cluster CR.

logger.Errorf("failed to run teardown job: %v", err)
// Sleep so that the developer can view the log before the pod is destroyed.
// Pods being deleted have a grace period of 30 seconds after the SIGTERM.
time.Sleep(30 * time.Second)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not include a sleep for production clusters

<-signalChan
logger.Info("Running teardown job")
logger.Infof("Removing multus interface %q", jobParams.MigratedInterface)
err = multus.RunTeardownJob(k8sClient, jobParams)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jobs are very heavy-weight, so running a job during pod teardown just doesn't seem right. Let's discuss.

@renan-campos
Copy link
Contributor Author

It's been a while since we discussed this feature, can you join the Rook huddle soon or setup another meeting to discuss? thanks!

Yes, I'll make diagrams and join tomorrow's to discuss if there is time; if not we can discuss the time for separate meeting then.

@mergify
Copy link

mergify bot commented Jan 11, 2022

This pull request has merge conflicts that must be resolved before it can be merged. @renan-campos please rebase it. https://rook.io/docs/rook/latest/development-flow.html#updating-your-fork

@renan-campos
Copy link
Contributor Author

I'm closing this PR. I don't the time, competence, or desire for this to succeed.

v1.8 automation moved this from Review in progress to Done Jan 13, 2022
@BlaineEXE BlaineEXE self-assigned this Jan 14, 2022
@BlaineEXE BlaineEXE moved this from Done to In progress in v1.8 Jan 14, 2022
@travisn travisn removed this from In progress in v1.8 Feb 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants