Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

csi: update multus design to mitigate csi-plugin pod restart #9903

Merged
merged 1 commit into from Apr 12, 2022

Conversation

leseb
Copy link
Member

@leseb leseb commented Mar 14, 2022

Description of your changes:

This is the latest version of the design to mitigate the bug encountered
when restarting the csi-plugin pod.

Signed-off-by: Sébastien Han seb@redhat.com

Which issue is resolved by this Pull Request:
Resolves #

Checklist:

  • Commit Message Formatting: Commit titles and messages follow guidelines in the developer guide.
  • Skip Tests for Docs: If this is only a documentation change, add the label skip-ci on the PR.
  • Reviewed the developer guide on Submitting a Pull Request
  • Pending release notes updated with breaking and/or notable changes for the next minor release.
  • Documentation has been updated, if necessary.
  • Unit tests have been added, if necessary.
  • Integration tests have been added, if necessary.

design/ceph/multus-network.md Outdated Show resolved Hide resolved
design/ceph/multus-network.md Outdated Show resolved Hide resolved
design/ceph/multus-network.md Outdated Show resolved Hide resolved
design/ceph/multus-network.md Outdated Show resolved Hide resolved
Copy link
Member

@BlaineEXE BlaineEXE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I think this looks good.

One overall comment is that it might be good to explain for those less up-to-speed how this design fixes the documented issue.

design/ceph/multus-network.md Show resolved Hide resolved
design/ceph/multus-network.md Outdated Show resolved Hide resolved
design/ceph/multus-network.md Outdated Show resolved Hide resolved
design/ceph/multus-network.md Outdated Show resolved Hide resolved
@leseb
Copy link
Member Author

leseb commented Mar 17, 2022

Since cephcsi uses go-ceph and go-ceph uses cgo we can hardly get a a statically compiled cephcsi binary today. I started a discussion here: ceph/ceph-csi#2946. This puts that design at risk, to say the least...

leseb added a commit to leseb/rook that referenced this pull request Mar 17, 2022
Implementation of the design proposed in
rook#9903.

Intruction of a new binary to watch and execute a given binary.

Example run:

```
[leseb@tarox~/go/src/github.com/rook/rook/build/supertini][multus-plugin-restart-fix !?] k logs -f csi-cephfsplugin-62hl5
2022/03/16 15:01:12.344243 passed arguments [/usr/local/bin/supertini /var/lib/kubelet/plugins/rook-ceph.cephfs.csi.ceph.com/cephcsi]
2022/03/16 15:01:12.345234 start watching for binary "cephcsi" changes in "/var/lib/kubelet/plugins/rook-ceph.cephfs.csi.ceph.com"
2022/03/16 15:01:12.345263 binary directory path "/var/lib/kubelet/plugins/rook-ceph.cephfs.csi.ceph.com" exists
2022/03/16 15:01:12.345365 starting command ["/var/lib/kubelet/plugins/rook-ceph.cephfs.csi.ceph.com/cephcsi" "--nodeid=minikube" "--type=cephfs" "--endpoint=unix:///csi/csi.sock" "--v=0" "--nodeserver=true" "--drivername=rook-ceph.cephfs.csi.ceph.com" "--pidlimit=-1" "--metricsport=9091" "--forcecephkernelclient=true" "--metricspath=/metrics" "--enablegrpcmetrics=false"]
2022/03/16 15:01:12.346304 started child process 12
2022/03/16 15:02:38.773859 "/var/lib/kubelet/plugins/rook-ceph.cephfs.csi.ceph.com/cephcsi": REMOVE
2022/03/16 15:02:38.773900 "/var/lib/kubelet/plugins/rook-ceph.cephfs.csi.ceph.com/cephcsi": CREATE
2022/03/16 15:02:38.773976 starting command ["/var/lib/kubelet/plugins/rook-ceph.cephfs.csi.ceph.com/cephcsi" "--nodeid=minikube" "--type=cephfs" "--endpoint=unix:///csi/csi.sock" "--v=0" "--nodeserver=true" "--drivername=rook-ceph.cephfs.csi.ceph.com" "--pidlimit=-1" "--metricsport=9091" "--forcecephkernelclient=true" "--metricspath=/metrics" "--enablegrpcmetrics=false"]
2022/03/16 15:02:38.782069 process 12 was killed exiting go routine
2022/03/16 15:02:38.817829 started child process 56
```

Signed-off-by: Sébastien Han <seb@redhat.com>
leseb added a commit to leseb/rook that referenced this pull request Mar 17, 2022
Implementation of the design proposed in
rook#9903.

Intruction of a new binary to watch and execute a given binary.

Example run:

```
[leseb@tarox~/go/src/github.com/rook/rook/build/supertini][multus-plugin-restart-fix !?] k logs -f csi-cephfsplugin-62hl5
2022/03/16 15:01:12.344243 passed arguments [/usr/local/bin/supertini /var/lib/kubelet/plugins/rook-ceph.cephfs.csi.ceph.com/cephcsi]
2022/03/16 15:01:12.345234 start watching for binary "cephcsi" changes in "/var/lib/kubelet/plugins/rook-ceph.cephfs.csi.ceph.com"
2022/03/16 15:01:12.345263 binary directory path "/var/lib/kubelet/plugins/rook-ceph.cephfs.csi.ceph.com" exists
2022/03/16 15:01:12.345365 starting command ["/var/lib/kubelet/plugins/rook-ceph.cephfs.csi.ceph.com/cephcsi" "--nodeid=minikube" "--type=cephfs" "--endpoint=unix:///csi/csi.sock" "--v=0" "--nodeserver=true" "--drivername=rook-ceph.cephfs.csi.ceph.com" "--pidlimit=-1" "--metricsport=9091" "--forcecephkernelclient=true" "--metricspath=/metrics" "--enablegrpcmetrics=false"]
2022/03/16 15:01:12.346304 started child process 12
2022/03/16 15:02:38.773859 "/var/lib/kubelet/plugins/rook-ceph.cephfs.csi.ceph.com/cephcsi": REMOVE
2022/03/16 15:02:38.773900 "/var/lib/kubelet/plugins/rook-ceph.cephfs.csi.ceph.com/cephcsi": CREATE
2022/03/16 15:02:38.773976 starting command ["/var/lib/kubelet/plugins/rook-ceph.cephfs.csi.ceph.com/cephcsi" "--nodeid=minikube" "--type=cephfs" "--endpoint=unix:///csi/csi.sock" "--v=0" "--nodeserver=true" "--drivername=rook-ceph.cephfs.csi.ceph.com" "--pidlimit=-1" "--metricsport=9091" "--forcecephkernelclient=true" "--metricspath=/metrics" "--enablegrpcmetrics=false"]
2022/03/16 15:02:38.782069 process 12 was killed exiting go routine
2022/03/16 15:02:38.817829 started child process 56
```

Signed-off-by: Sébastien Han <seb@redhat.com>
leseb added a commit to leseb/rook that referenced this pull request Mar 18, 2022
Implementation of the design proposed in
rook#9903.

Intruction of a new binary to watch and execute a given binary.

Example run:

```
[leseb@tarox~/go/src/github.com/rook/rook/build/supertini][multus-plugin-restart-fix !?] k logs -f csi-cephfsplugin-62hl5
2022/03/16 15:01:12.344243 passed arguments [/usr/local/bin/supertini /var/lib/kubelet/plugins/rook-ceph.cephfs.csi.ceph.com/cephcsi]
2022/03/16 15:01:12.345234 start watching for binary "cephcsi" changes in "/var/lib/kubelet/plugins/rook-ceph.cephfs.csi.ceph.com"
2022/03/16 15:01:12.345263 binary directory path "/var/lib/kubelet/plugins/rook-ceph.cephfs.csi.ceph.com" exists
2022/03/16 15:01:12.345365 starting command ["/var/lib/kubelet/plugins/rook-ceph.cephfs.csi.ceph.com/cephcsi" "--nodeid=minikube" "--type=cephfs" "--endpoint=unix:///csi/csi.sock" "--v=0" "--nodeserver=true" "--drivername=rook-ceph.cephfs.csi.ceph.com" "--pidlimit=-1" "--metricsport=9091" "--forcecephkernelclient=true" "--metricspath=/metrics" "--enablegrpcmetrics=false"]
2022/03/16 15:01:12.346304 started child process 12
2022/03/16 15:02:38.773859 "/var/lib/kubelet/plugins/rook-ceph.cephfs.csi.ceph.com/cephcsi": REMOVE
2022/03/16 15:02:38.773900 "/var/lib/kubelet/plugins/rook-ceph.cephfs.csi.ceph.com/cephcsi": CREATE
2022/03/16 15:02:38.773976 starting command ["/var/lib/kubelet/plugins/rook-ceph.cephfs.csi.ceph.com/cephcsi" "--nodeid=minikube" "--type=cephfs" "--endpoint=unix:///csi/csi.sock" "--v=0" "--nodeserver=true" "--drivername=rook-ceph.cephfs.csi.ceph.com" "--pidlimit=-1" "--metricsport=9091" "--forcecephkernelclient=true" "--metricspath=/metrics" "--enablegrpcmetrics=false"]
2022/03/16 15:02:38.782069 process 12 was killed exiting go routine
2022/03/16 15:02:38.817829 started child process 56
```

Signed-off-by: Sébastien Han <seb@redhat.com>
@leseb leseb added this to Review in Progress in v1.9 via automation Mar 23, 2022
Copy link

@idryomov idryomov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a Kubernetes expert but this seems too fragile and really over-engineered to me. I'm ignoring the Multus aspect because it seems largely irrelevant here and focusing on the core issue of dealing with the csi-plugin pod configured to pod networking (i.e. hostNetwork: false).

The proposed design essentially boils down to working around the fact that Kubernetes has no support for sharing a network namespace between pods. But if you are introducing a long-running "network holder" pod anyway, why not just make the csi-plugin use that pod's network namespace for mapping and mounting operations? I'm thinking of something along the following lines:

  • The new long-running pod does nothing, its only payload is the pause process which keeps its network alive.

  • In the today's csi-plugin pod, instead of invoking rbd map ..., invoke nsenter -n -t <pid of pause in long-running pod> rbd map ... (and same for mount -t ceph ...). The csi-plugin pod is already privileged, it already adds CAP_SYS_ADMIN and is already deployed with hostPID: true so it should just work with no additional changes (CAP_SYS_ADMIN is required for manipulating RBD devices and would always need to be there on the entity that does that regardless). The only snag is passing the required pid into the csi-plugin pod but there are multiple ways to do it. For example, the long-running pod could create a symlink to an equivalent of /proc/$$/ns/net in the hostPath volume shared with the csi-plugin pod and then nsenter -n -t <pid of pause in long-running pod> ... would become nsenter --net=/path/to/symlink ....

With the above, it should be possible to terminate or restart/upgrade today's csi-plugin pod configured to pod networking with no impact on kernel client I/O. No attempting to statically link Ceph shared libraries or re-reimplement rbd map and other bits from scratch and hot-swap these binaries inside of a running container from another container (yuck!) required.

@BlaineEXE
Copy link
Member

@idryomov I think your analysis is astute. What you propose will still require some changes to Ceph-CSI, but those changes do seem like a simpler way to achieve what we ultimately need, which is to have a stable, non-host network namespace for fiile/rbd mounts. IMO, I think Ilya's proposal is probably a better one unless we find some issue where it falls short. @leseb wdyt?

@leseb
Copy link
Member Author

leseb commented Mar 28, 2022

I agree that @idryomov's proposal is the best and simplest so far. I can see a little bit of a timing issue if the "holder" is not available yet and cephcsi starts receiving requests before that but that's probably fine since with the internal retry.
Additionally, I think there is an issue with nsenter, at least for Multus. In the current design, the monitors are using a service IP which is bound to the default SDN nic present in the container, then the second interface is where the data path is. So essentially the mount will need access to both (mons and osd public network). If we execute the nsenter only on the multus network we won't be able to reach the mons. Unless I'm missing something in how nsenter works. @idryomov what do you think?
Thanks!

@BlaineEXE
Copy link
Member

I agree that @idryomov's proposal is the best and simplest so far. I can see a little bit of a timing issue if the "holder" is not available yet and cephcsi starts receiving requests before that but that's probably fine since with the internal retry. Additionally, I think there is an issue with nsenter, at least for Multus. In the current design, the monitors are using a service IP which is bound to the default SDN nic present in the container, then the second interface is where the data path is. So essentially the mount will need access to both (mons and osd public network). If we execute the nsenter only on the multus network we won't be able to reach the mons. Unless I'm missing something in how nsenter works. @idryomov what do you think? Thanks!

From my understanding, the holder should get the multus public network if multus is enabled. If the holder network namespace has access to the cluster via multus, it should just work.

That said, the reality is that we have multus issues where the mons don't actually listen on the multus network. Because of this, there is some possibility that it "just work". I think this is a workaround we should fix if it becomes necessary, but IMO this is separate from the ideal case here. Also, from my experience with the CSI/Multus work, even pods using a multus network are still given access to Kubernetes' normal pod network. If that is the case, then I think there should be no connection issues from the holder's network namespace.

@idryomov
Copy link

If we execute the nsenter only on the multus network we won't be able to reach the mons. Unless I'm missing something in how nsenter works. @idryomov what do you think?

nsenter is "executed" on a network namespace, not on an individual network (interface). If, with multus enabled, the holder's network namespace ends up having two network interfaces, it is totally fine. As long as the holder can talk to both monitors and OSDs, the kernel client would be able to do the same (and don't forget about MDSes for CephFS).

I don't foresee any issues here because if a multus-enabled pod couldn't talk to all Ceph daemons, nothing would have worked and you would have never gotten as far as worrying about the restart/upgrade case :-)

@leseb
Copy link
Member Author

leseb commented Mar 29, 2022

If we execute the nsenter only on the multus network we won't be able to reach the mons. Unless I'm missing something in how nsenter works. @idryomov what do you think?

nsenter is "executed" on a network namespace, not on an individual network (interface). If, with multus enabled, the holder's network namespace ends up having two network interfaces, it is totally fine. As long as the holder can talk to both monitors and OSDs, the kernel client would be able to do the same (and don't forget about MDSes for CephFS).

I don't foresee any issues here because if a multus-enabled pod couldn't talk to all Ceph daemons, nothing would have worked and you would have never gotten as far as worrying about the restart/upgrade case :-)

My bad, I had in mind one net ns = one nic 🤦🏻 which is obviously not the case 👍🏻 . A network namespace is logically another copy of the network stack, with its own routes, firewall rules, and network devices.

@leseb leseb force-pushed the multus-design-change-for-csi branch 2 times, most recently from 5a661c4 to 95022e8 Compare March 31, 2022 08:04
@leseb leseb requested review from travisn and BlaineEXE March 31, 2022 08:04
@leseb leseb force-pushed the multus-design-change-for-csi branch from 95022e8 to f98fb9c Compare March 31, 2022 13:24

The initial implementation of this design is limited to supporting a single CephCluster with
Multus.
Until we stop using HostNetwork entirely it is impossible to support multiple CephClusters with and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to support multiple cephclusters? Not sure we need to worry about it, or at least it's much lower priority.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rook supports deploying multiple clusters so it's natural to consider it :)

Copy link
Member

@BlaineEXE BlaineEXE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with a note for what I think is a typo.

design/ceph/multus-network.md Outdated Show resolved Hide resolved
@leseb leseb merged commit ebe2e10 into rook:master Apr 12, 2022
@leseb leseb deleted the multus-design-change-for-csi branch April 12, 2022 13:55
travisn added a commit that referenced this pull request Apr 13, 2022
csi: update multus design to mitigate csi-plugin pod restart (backport #9903)
@travisn travisn moved this from Review in Progress to Done in v1.9 Apr 13, 2022
leseb added a commit to leseb/rook that referenced this pull request Apr 14, 2022
Implementation of the design proposed in
rook#9903.

Signed-off-by: Sébastien Han <seb@redhat.com>
leseb added a commit to leseb/rook that referenced this pull request Apr 14, 2022
Implementation of the design proposed in
rook#9903.

Signed-off-by: Sébastien Han <seb@redhat.com>
leseb added a commit to leseb/rook that referenced this pull request Apr 14, 2022
Implementation of the design proposed in
rook#9903.

Signed-off-by: Sébastien Han <seb@redhat.com>
leseb added a commit to leseb/rook that referenced this pull request Apr 14, 2022
Implementation of the design proposed in
rook#9903.

Signed-off-by: Sébastien Han <seb@redhat.com>
leseb added a commit to leseb/rook that referenced this pull request Apr 14, 2022
Implementation of the design proposed in
rook#9903.

Signed-off-by: Sébastien Han <seb@redhat.com>
leseb added a commit to leseb/rook that referenced this pull request Apr 15, 2022
Implementation of the design proposed in
rook#9903.

Signed-off-by: Sébastien Han <seb@redhat.com>
leseb added a commit to leseb/rook that referenced this pull request Apr 19, 2022
Implementation of the design proposed in
rook#9903.

Signed-off-by: Sébastien Han <seb@redhat.com>
leseb added a commit to leseb/rook that referenced this pull request Apr 20, 2022
Implementation of the design proposed in
rook#9903.

Signed-off-by: Sébastien Han <seb@redhat.com>
leseb added a commit to leseb/rook that referenced this pull request Apr 25, 2022
Implementation of the design proposed in
rook#9903.

Signed-off-by: Sébastien Han <seb@redhat.com>
leseb added a commit to leseb/rook that referenced this pull request Apr 25, 2022
Implementation of the design proposed in
rook#9903.

Signed-off-by: Sébastien Han <seb@redhat.com>
leseb added a commit to leseb/rook that referenced this pull request Apr 25, 2022
Implementation of the design proposed in
rook#9903.

Signed-off-by: Sébastien Han <seb@redhat.com>
leseb added a commit to leseb/rook that referenced this pull request Apr 25, 2022
Implementation of the design proposed in
rook#9903.

Signed-off-by: Sébastien Han <seb@redhat.com>
leseb added a commit to leseb/rook that referenced this pull request Apr 25, 2022
Implementation of the design proposed in
rook#9903.

Signed-off-by: Sébastien Han <seb@redhat.com>
leseb added a commit to leseb/rook that referenced this pull request Apr 25, 2022
Implementation of the design proposed in
rook#9903.

Signed-off-by: Sébastien Han <seb@redhat.com>
leseb added a commit to leseb/rook that referenced this pull request Apr 25, 2022
Implementation of the design proposed in
rook#9903.

Signed-off-by: Sébastien Han <seb@redhat.com>
leseb added a commit to leseb/rook that referenced this pull request Apr 26, 2022
Implementation of the design proposed in
rook#9903.

Signed-off-by: Sébastien Han <seb@redhat.com>
leseb added a commit to leseb/rook that referenced this pull request Apr 26, 2022
Implementation of the design proposed in
rook#9903.

Signed-off-by: Sébastien Han <seb@redhat.com>
leseb added a commit to leseb/rook that referenced this pull request Apr 26, 2022
Implementation of the design proposed in
rook#9903.

Signed-off-by: Sébastien Han <seb@redhat.com>
leseb added a commit to leseb/rook that referenced this pull request Apr 26, 2022
Implementation of the design proposed in
rook#9903.

Signed-off-by: Sébastien Han <seb@redhat.com>
leseb added a commit to leseb/rook that referenced this pull request Apr 26, 2022
Implementation of the design proposed in
rook#9903.

Signed-off-by: Sébastien Han <seb@redhat.com>
leseb added a commit to leseb/rook that referenced this pull request Apr 26, 2022
Implementation of the design proposed in
rook#9903.

Signed-off-by: Sébastien Han <seb@redhat.com>
leseb added a commit to leseb/rook that referenced this pull request Apr 26, 2022
Implementation of the design proposed in
rook#9903.

Signed-off-by: Sébastien Han <seb@redhat.com>
leseb added a commit to leseb/rook that referenced this pull request Apr 26, 2022
Implementation of the design proposed in
rook#9903.

Signed-off-by: Sébastien Han <seb@redhat.com>
mergify bot pushed a commit that referenced this pull request Apr 27, 2022
Implementation of the design proposed in
#9903.

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 73b1347)
leseb added a commit to red-hat-storage/rook that referenced this pull request May 24, 2022
Implementation of the design proposed in
rook#9903.

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 73b1347)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
v1.9
Done
Development

Successfully merging this pull request may close these issues.

None yet

4 participants