New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
csi: update multus design to mitigate csi-plugin pod restart #9903
Conversation
30552ae
to
cb38223
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I think this looks good.
One overall comment is that it might be good to explain for those less up-to-speed how this design fixes the documented issue.
|
Since |
Implementation of the design proposed in rook#9903. Intruction of a new binary to watch and execute a given binary. Example run: ``` [leseb@tarox~/go/src/github.com/rook/rook/build/supertini][multus-plugin-restart-fix !?] k logs -f csi-cephfsplugin-62hl5 2022/03/16 15:01:12.344243 passed arguments [/usr/local/bin/supertini /var/lib/kubelet/plugins/rook-ceph.cephfs.csi.ceph.com/cephcsi] 2022/03/16 15:01:12.345234 start watching for binary "cephcsi" changes in "/var/lib/kubelet/plugins/rook-ceph.cephfs.csi.ceph.com" 2022/03/16 15:01:12.345263 binary directory path "/var/lib/kubelet/plugins/rook-ceph.cephfs.csi.ceph.com" exists 2022/03/16 15:01:12.345365 starting command ["/var/lib/kubelet/plugins/rook-ceph.cephfs.csi.ceph.com/cephcsi" "--nodeid=minikube" "--type=cephfs" "--endpoint=unix:///csi/csi.sock" "--v=0" "--nodeserver=true" "--drivername=rook-ceph.cephfs.csi.ceph.com" "--pidlimit=-1" "--metricsport=9091" "--forcecephkernelclient=true" "--metricspath=/metrics" "--enablegrpcmetrics=false"] 2022/03/16 15:01:12.346304 started child process 12 2022/03/16 15:02:38.773859 "/var/lib/kubelet/plugins/rook-ceph.cephfs.csi.ceph.com/cephcsi": REMOVE 2022/03/16 15:02:38.773900 "/var/lib/kubelet/plugins/rook-ceph.cephfs.csi.ceph.com/cephcsi": CREATE 2022/03/16 15:02:38.773976 starting command ["/var/lib/kubelet/plugins/rook-ceph.cephfs.csi.ceph.com/cephcsi" "--nodeid=minikube" "--type=cephfs" "--endpoint=unix:///csi/csi.sock" "--v=0" "--nodeserver=true" "--drivername=rook-ceph.cephfs.csi.ceph.com" "--pidlimit=-1" "--metricsport=9091" "--forcecephkernelclient=true" "--metricspath=/metrics" "--enablegrpcmetrics=false"] 2022/03/16 15:02:38.782069 process 12 was killed exiting go routine 2022/03/16 15:02:38.817829 started child process 56 ``` Signed-off-by: Sébastien Han <seb@redhat.com>
Implementation of the design proposed in rook#9903. Intruction of a new binary to watch and execute a given binary. Example run: ``` [leseb@tarox~/go/src/github.com/rook/rook/build/supertini][multus-plugin-restart-fix !?] k logs -f csi-cephfsplugin-62hl5 2022/03/16 15:01:12.344243 passed arguments [/usr/local/bin/supertini /var/lib/kubelet/plugins/rook-ceph.cephfs.csi.ceph.com/cephcsi] 2022/03/16 15:01:12.345234 start watching for binary "cephcsi" changes in "/var/lib/kubelet/plugins/rook-ceph.cephfs.csi.ceph.com" 2022/03/16 15:01:12.345263 binary directory path "/var/lib/kubelet/plugins/rook-ceph.cephfs.csi.ceph.com" exists 2022/03/16 15:01:12.345365 starting command ["/var/lib/kubelet/plugins/rook-ceph.cephfs.csi.ceph.com/cephcsi" "--nodeid=minikube" "--type=cephfs" "--endpoint=unix:///csi/csi.sock" "--v=0" "--nodeserver=true" "--drivername=rook-ceph.cephfs.csi.ceph.com" "--pidlimit=-1" "--metricsport=9091" "--forcecephkernelclient=true" "--metricspath=/metrics" "--enablegrpcmetrics=false"] 2022/03/16 15:01:12.346304 started child process 12 2022/03/16 15:02:38.773859 "/var/lib/kubelet/plugins/rook-ceph.cephfs.csi.ceph.com/cephcsi": REMOVE 2022/03/16 15:02:38.773900 "/var/lib/kubelet/plugins/rook-ceph.cephfs.csi.ceph.com/cephcsi": CREATE 2022/03/16 15:02:38.773976 starting command ["/var/lib/kubelet/plugins/rook-ceph.cephfs.csi.ceph.com/cephcsi" "--nodeid=minikube" "--type=cephfs" "--endpoint=unix:///csi/csi.sock" "--v=0" "--nodeserver=true" "--drivername=rook-ceph.cephfs.csi.ceph.com" "--pidlimit=-1" "--metricsport=9091" "--forcecephkernelclient=true" "--metricspath=/metrics" "--enablegrpcmetrics=false"] 2022/03/16 15:02:38.782069 process 12 was killed exiting go routine 2022/03/16 15:02:38.817829 started child process 56 ``` Signed-off-by: Sébastien Han <seb@redhat.com>
Implementation of the design proposed in rook#9903. Intruction of a new binary to watch and execute a given binary. Example run: ``` [leseb@tarox~/go/src/github.com/rook/rook/build/supertini][multus-plugin-restart-fix !?] k logs -f csi-cephfsplugin-62hl5 2022/03/16 15:01:12.344243 passed arguments [/usr/local/bin/supertini /var/lib/kubelet/plugins/rook-ceph.cephfs.csi.ceph.com/cephcsi] 2022/03/16 15:01:12.345234 start watching for binary "cephcsi" changes in "/var/lib/kubelet/plugins/rook-ceph.cephfs.csi.ceph.com" 2022/03/16 15:01:12.345263 binary directory path "/var/lib/kubelet/plugins/rook-ceph.cephfs.csi.ceph.com" exists 2022/03/16 15:01:12.345365 starting command ["/var/lib/kubelet/plugins/rook-ceph.cephfs.csi.ceph.com/cephcsi" "--nodeid=minikube" "--type=cephfs" "--endpoint=unix:///csi/csi.sock" "--v=0" "--nodeserver=true" "--drivername=rook-ceph.cephfs.csi.ceph.com" "--pidlimit=-1" "--metricsport=9091" "--forcecephkernelclient=true" "--metricspath=/metrics" "--enablegrpcmetrics=false"] 2022/03/16 15:01:12.346304 started child process 12 2022/03/16 15:02:38.773859 "/var/lib/kubelet/plugins/rook-ceph.cephfs.csi.ceph.com/cephcsi": REMOVE 2022/03/16 15:02:38.773900 "/var/lib/kubelet/plugins/rook-ceph.cephfs.csi.ceph.com/cephcsi": CREATE 2022/03/16 15:02:38.773976 starting command ["/var/lib/kubelet/plugins/rook-ceph.cephfs.csi.ceph.com/cephcsi" "--nodeid=minikube" "--type=cephfs" "--endpoint=unix:///csi/csi.sock" "--v=0" "--nodeserver=true" "--drivername=rook-ceph.cephfs.csi.ceph.com" "--pidlimit=-1" "--metricsport=9091" "--forcecephkernelclient=true" "--metricspath=/metrics" "--enablegrpcmetrics=false"] 2022/03/16 15:02:38.782069 process 12 was killed exiting go routine 2022/03/16 15:02:38.817829 started child process 56 ``` Signed-off-by: Sébastien Han <seb@redhat.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a Kubernetes expert but this seems too fragile and really over-engineered to me. I'm ignoring the Multus aspect because it seems largely irrelevant here and focusing on the core issue of dealing with the csi-plugin pod configured to pod networking (i.e. hostNetwork: false).
The proposed design essentially boils down to working around the fact that Kubernetes has no support for sharing a network namespace between pods. But if you are introducing a long-running "network holder" pod anyway, why not just make the csi-plugin use that pod's network namespace for mapping and mounting operations? I'm thinking of something along the following lines:
-
The new long-running pod does nothing, its only payload is the
pauseprocess which keeps its network alive. -
In the today's
csi-pluginpod, instead of invokingrbd map ..., invokensenter -n -t <pid of pause in long-running pod> rbd map ...(and same formount -t ceph ...). Thecsi-pluginpod is already privileged, it already addsCAP_SYS_ADMINand is already deployed withhostPID: trueso it should just work with no additional changes (CAP_SYS_ADMINis required for manipulating RBD devices and would always need to be there on the entity that does that regardless). The only snag is passing the required pid into thecsi-pluginpod but there are multiple ways to do it. For example, the long-running pod could create a symlink to an equivalent of/proc/$$/ns/netin thehostPathvolume shared with thecsi-pluginpod and thennsenter -n -t <pid of pause in long-running pod> ...would becomensenter --net=/path/to/symlink ....
With the above, it should be possible to terminate or restart/upgrade today's csi-plugin pod configured to pod networking with no impact on kernel client I/O. No attempting to statically link Ceph shared libraries or re-reimplement rbd map and other bits from scratch and hot-swap these binaries inside of a running container from another container (yuck!) required.
|
@idryomov I think your analysis is astute. What you propose will still require some changes to Ceph-CSI, but those changes do seem like a simpler way to achieve what we ultimately need, which is to have a stable, non-host network namespace for fiile/rbd mounts. IMO, I think Ilya's proposal is probably a better one unless we find some issue where it falls short. @leseb wdyt? |
|
I agree that @idryomov's proposal is the best and simplest so far. I can see a little bit of a timing issue if the "holder" is not available yet and |
From my understanding, the holder should get the multus public network if multus is enabled. If the holder network namespace has access to the cluster via multus, it should just work. That said, the reality is that we have multus issues where the mons don't actually listen on the multus network. Because of this, there is some possibility that it "just work". I think this is a workaround we should fix if it becomes necessary, but IMO this is separate from the ideal case here. Also, from my experience with the CSI/Multus work, even pods using a multus network are still given access to Kubernetes' normal pod network. If that is the case, then I think there should be no connection issues from the holder's network namespace. |
I don't foresee any issues here because if a multus-enabled pod couldn't talk to all Ceph daemons, nothing would have worked and you would have never gotten as far as worrying about the restart/upgrade case :-) |
My bad, I had in mind one net ns = one nic 🤦🏻 which is obviously not the case 👍🏻 . A network namespace is logically another copy of the network stack, with its own routes, firewall rules, and network devices. |
5a661c4
to
95022e8
Compare
95022e8
to
f98fb9c
Compare
|
|
||
| The initial implementation of this design is limited to supporting a single CephCluster with | ||
| Multus. | ||
| Until we stop using HostNetwork entirely it is impossible to support multiple CephClusters with and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to support multiple cephclusters? Not sure we need to worry about it, or at least it's much lower priority.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rook supports deploying multiple clusters so it's natural to consider it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved with a note for what I think is a typo.
csi: update multus design to mitigate csi-plugin pod restart (backport #9903)
Implementation of the design proposed in rook#9903. Signed-off-by: Sébastien Han <seb@redhat.com>
Implementation of the design proposed in rook#9903. Signed-off-by: Sébastien Han <seb@redhat.com>
Implementation of the design proposed in rook#9903. Signed-off-by: Sébastien Han <seb@redhat.com>
Implementation of the design proposed in rook#9903. Signed-off-by: Sébastien Han <seb@redhat.com>
Implementation of the design proposed in rook#9903. Signed-off-by: Sébastien Han <seb@redhat.com>
Implementation of the design proposed in rook#9903. Signed-off-by: Sébastien Han <seb@redhat.com>
Implementation of the design proposed in rook#9903. Signed-off-by: Sébastien Han <seb@redhat.com>
Implementation of the design proposed in rook#9903. Signed-off-by: Sébastien Han <seb@redhat.com>
Implementation of the design proposed in rook#9903. Signed-off-by: Sébastien Han <seb@redhat.com>
Implementation of the design proposed in rook#9903. Signed-off-by: Sébastien Han <seb@redhat.com>
Implementation of the design proposed in rook#9903. Signed-off-by: Sébastien Han <seb@redhat.com>
Implementation of the design proposed in rook#9903. Signed-off-by: Sébastien Han <seb@redhat.com>
Implementation of the design proposed in rook#9903. Signed-off-by: Sébastien Han <seb@redhat.com>
Implementation of the design proposed in rook#9903. Signed-off-by: Sébastien Han <seb@redhat.com>
Implementation of the design proposed in rook#9903. Signed-off-by: Sébastien Han <seb@redhat.com>
Implementation of the design proposed in rook#9903. Signed-off-by: Sébastien Han <seb@redhat.com>
Implementation of the design proposed in rook#9903. Signed-off-by: Sébastien Han <seb@redhat.com>
Implementation of the design proposed in rook#9903. Signed-off-by: Sébastien Han <seb@redhat.com>
Implementation of the design proposed in rook#9903. Signed-off-by: Sébastien Han <seb@redhat.com>
Implementation of the design proposed in rook#9903. Signed-off-by: Sébastien Han <seb@redhat.com>
Implementation of the design proposed in rook#9903. Signed-off-by: Sébastien Han <seb@redhat.com>
Implementation of the design proposed in rook#9903. Signed-off-by: Sébastien Han <seb@redhat.com>
Implementation of the design proposed in rook#9903. Signed-off-by: Sébastien Han <seb@redhat.com>
Description of your changes:
This is the latest version of the design to mitigate the bug encountered
when restarting the
csi-pluginpod.Signed-off-by: Sébastien Han seb@redhat.com
Which issue is resolved by this Pull Request:
Resolves #
Checklist:
skip-cion the PR.