cmd/containerboot: fix unclean shutdown #10035

irbekrm · 2023-10-31T20:08:14Z

This PR fixes a bug where containerboot does not shut down cleanly when SIGTERM is received.

It ensures that:

containerboot's long running tailscaled watch shuts down when SIGTERM is received
the watch shuts down before tailscaled

I have verified that the fix works following the same steps used to reproduce the issue, see #10090

Make sure that tailscaled watcher returns when SIGTERM is received and also that it shuts down before tailscaled exits. Updates #10090 Signed-off-by: Irbe Krumina <irbe@tailscale.com>

irbekrm · 2023-11-02T19:04:43Z

cmd/containerboot/main.go

+					// only start doing this once we've stopped shelling out to things
+					// `tailscale up`, otherwise this goroutine can reap the CLI subprocesses
+					// and wedge bringup.
+					reaper := func() {


Gave this function a name so I can refer to it in a code comment above- I haven't changed the contents.

irbekrm · 2023-11-02T19:05:31Z

cmd/containerboot/main.go

+			if n.NetMap != nil {
+				addrs := n.NetMap.SelfNode.Addresses().AsSlice()
+				newCurrentIPs := deephash.Hash(&addrs)
+				ipsHaveChanged := newCurrentIPs != currentIPs
+				if cfg.ProxyTo != "" && len(addrs) > 0 && ipsHaveChanged {
+					log.Printf("Installing proxy rules")
+					if err := installIngressForwardingRule(ctx, cfg.ProxyTo, addrs, nfr); err != nil {
+						log.Fatalf("installing ingress proxy rules: %v", err)
 					}
 				}
-			}
-			if cfg.TailnetTargetIP != "" && ipsHaveChanged && len(addrs) > 0 {
-				if err := installEgressForwardingRule(ctx, cfg.TailnetTargetIP, addrs, nfr); err != nil {
-					log.Fatalf("installing egress proxy rules: %v", err)
+				if cfg.ServeConfigPath != "" && len(n.NetMap.DNS.CertDomains) > 0 {
+					cd := n.NetMap.DNS.CertDomains[0]
+					prev := certDomain.Swap(ptr.To(cd))
+					if prev == nil || *prev != cd {
+						select {
+						case certDomainChanged <- true:
+						default:
+						}
+					}
 				}
-			}
-			currentIPs = newCurrentIPs
+				if cfg.TailnetTargetIP != "" && ipsHaveChanged && len(addrs) > 0 {
+					if err := installEgressForwardingRule(ctx, cfg.TailnetTargetIP, addrs, nfr); err != nil {
+						log.Fatalf("installing egress proxy rules: %v", err)
+					}
+				}
+				currentIPs = newCurrentIPs

-			deviceInfo := []any{n.NetMap.SelfNode.StableID(), n.NetMap.SelfNode.Name()}
-			if cfg.InKubernetes && cfg.KubernetesCanPatch && cfg.KubeSecret != "" && deephash.Update(&currentDeviceInfo, &deviceInfo) {
-				if err := storeDeviceInfo(ctx, cfg.KubeSecret, n.NetMap.SelfNode.StableID(), n.NetMap.SelfNode.Name(), n.NetMap.SelfNode.Addresses().AsSlice()); err != nil {
-					log.Fatalf("storing device ID in kube secret: %v", err)
+				deviceInfo := []any{n.NetMap.SelfNode.StableID(), n.NetMap.SelfNode.Name()}
+				if cfg.InKubernetes && cfg.KubernetesCanPatch && cfg.KubeSecret != "" && deephash.Update(&currentDeviceInfo, &deviceInfo) {
+					if err := storeDeviceInfo(ctx, cfg.KubeSecret, n.NetMap.SelfNode.StableID(), n.NetMap.SelfNode.Name(), n.NetMap.SelfNode.Addresses().AsSlice()); err != nil {
+						log.Fatalf("storing device ID in kube secret: %v", err)
+					}
 				}
 			}
-		}
-		if !startupTasksDone {
-			if (!wantProxy || currentIPs != deephash.Sum{}) && (!wantDeviceInfo || currentDeviceInfo != deephash.Sum{}) {
-				// This log message is used in tests to detect when all
-				// post-auth configuration is done.
-				log.Println("Startup complete, waiting for shutdown signal")
-				startupTasksDone = true
-
-				// Reap all processes, since we are PID1 and need to collect zombies. We can
-				// only start doing this once we've stopped shelling out to things
-				// `tailscale up`, otherwise this goroutine can reap the CLI subprocesses
-				// and wedge bringup.
-				go func() {
-					for {
-						var status unix.WaitStatus
-						pid, err := unix.Wait4(-1, &status, 0, nil)
-						if errors.Is(err, unix.EINTR) {
-							continue
-						}
-						if err != nil {
-							log.Fatalf("Waiting for exited processes: %v", err)
-						}
-						if pid == daemonPid {
-							log.Printf("Tailscaled exited")
-							os.Exit(0)
+			if !startupTasksDone {
+				if (!wantProxy || currentIPs != deephash.Sum{}) && (!wantDeviceInfo || currentDeviceInfo != deephash.Sum{}) {
+					// This log message is used in tests to detect when all
+					// post-auth configuration is done.
+					log.Println("Startup complete, waiting for shutdown signal")
+					startupTasksDone = true
+
+					// Reap all processes, since we are PID1 and need to collect zombies. We can
+					// only start doing this once we've stopped shelling out to things
+					// `tailscale up`, otherwise this goroutine can reap the CLI subprocesses
+					// and wedge bringup.
+					reaper := func() {
+						for {
+							var status unix.WaitStatus
+							pid, err := unix.Wait4(-1, &status, 0, nil)
+							if errors.Is(err, unix.EINTR) {
+								continue
+							}
+							if err != nil {
+								log.Fatalf("Waiting for exited processes: %v", err)
+							}
+							if pid == daemonProcess.Pid {
+								log.Printf("Tailscaled exited")
+								os.Exit(0)
+							}
 						}
+
 					}
-				}()
+					go reaper()
+				}


This is not a code change, just the indentation

cmd/containerboot/main.go

knyar · 2023-11-03T11:25:34Z

cmd/containerboot/main.go

+		for {
+			n, err := w.Next()
+			if err != nil {
+				errChan <- err


I know currently it does not matter (since we are exiting the process anyway), but perhaps we should return from this loop on error, and on ctx.Done()?

I've added a break for the error case. That should cover ctx.Done() too since we pass the same context when creating the watcher which should then return an error when context is done
https://github.com/tailscale/tailscale/blob/v1.52.1/client/tailscale/localclient.go#L1350-L1351

knyar · 2023-11-03T11:34:34Z

cmd/containerboot/main.go

+			// kill tailscaled and let reaper clean up child
+			// processes.
+			killTailscaled()
+			return


The comment above says that we need to let the reaper clean up child processes, but I don't think we have any synchronization in place that would actually guarantee this. As I understand, return here will race with the reaper, and there's a chance that the containerboot process will terminate before it gets the SIGCHLD about tailscaled.

I suspect we don't actually need to reap tailscaled after it's done, since we are exiting anyway, and the whole container will terminate imminently. So perhaps simply return here will be sufficient, letting killTailscaled run as part of defer?

I've swapped the return with a break + wait for the reaper.

I read around a bit and my understanding is that the container orchestrator won't necessarily reap zombies

krallin/tini#8 (comment) (couldn't find a more authoritative source)

the container orchestrator won't necessarily reap zombies

I believe if containerboot runs as PID 1 in a given PID namespace, and exits, then the kernel will destroy the PID namespace alongside any unreaped zombies. If we share the PID namespace with other processes, then it will be init's job to reap it, so it's good to try to do that ourselves. Adding a wg to wait for the reaper makes sense, thanks!

cmd/containerboot/main.go

deandre · 2023-11-14T17:05:16Z

@irbekrm is there a timeline for when this will be merged and deployed? got a customer asking about it.

DentonGentry · 2023-11-14T17:40:00Z

It is too late for 1.54.0, platform testing has already substantially completed.
This can go in after 1.54 branches, and be suitable for delivery in 1.56.

Signed-off-by: Irbe Krumina <irbe@tailscale.com>

irbekrm · 2023-11-15T05:26:03Z

This can go in after 1.54 branches, and be suitable for delivery in 1.56.

I think we should be able to cut a new tag once it merges though (so after the 1.54)

irbekrm · 2023-11-15T05:49:11Z

Thanks for the review @knyar ! I think I've addressed the comments, PTAL

irbekrm · 2023-11-20T10:34:01Z

@deandre the latest containerboot image tailscale/tailscale:unstable-v1.55.46 will have the fix

irbekrm force-pushed the irbekrm/fixshutdown branch from 6898c75 to cb9e41c Compare October 31, 2023 20:08

irbekrm marked this pull request as draft November 1, 2023 05:59

irbekrm force-pushed the irbekrm/fixshutdown branch 4 times, most recently from ca71177 to bfc979c Compare November 2, 2023 18:58

irbekrm marked this pull request as ready for review November 2, 2023 18:59

cmd/containerboot: shut down cleanly on SIGTERM

88f32ee

Make sure that tailscaled watcher returns when SIGTERM is received and also that it shuts down before tailscaled exits. Updates #10090 Signed-off-by: Irbe Krumina <irbe@tailscale.com>

irbekrm force-pushed the irbekrm/fixshutdown branch from bfc979c to 88f32ee Compare November 2, 2023 18:59

irbekrm changed the title ~~WIP: cmd/containerboot: fix shutdown~~ cmd/containerboot: fix shutdown Nov 2, 2023

irbekrm changed the title ~~cmd/containerboot: fix shutdown~~ cmd/containerboot: fix unclean shutdown Nov 2, 2023

irbekrm commented Nov 2, 2023

View reviewed changes

irbekrm requested review from maisem and knyar November 2, 2023 19:11

knyar reviewed Nov 3, 2023

View reviewed changes

Code review feedback

e911c1c

Signed-off-by: Irbe Krumina <irbe@tailscale.com>

irbekrm force-pushed the irbekrm/fixshutdown branch from 492ec4b to e911c1c Compare November 15, 2023 05:24

knyar approved these changes Nov 15, 2023

View reviewed changes

irbekrm merged commit 664ebb1 into main Nov 16, 2023
45 checks passed

irbekrm deleted the irbekrm/fixshutdown branch November 16, 2023 19:23

irbekrm mentioned this pull request Nov 20, 2023

containerboot does not shut down cleanly when SIGTERM received #10090

Closed

irbekrm mentioned this pull request Apr 27, 2024

cmd/containerboot: reaper races with iptables commands to reap child processes #11893

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/containerboot: fix unclean shutdown #10035

cmd/containerboot: fix unclean shutdown #10035

irbekrm commented Oct 31, 2023 •

edited

irbekrm Nov 2, 2023

irbekrm Nov 2, 2023

knyar Nov 3, 2023

irbekrm Nov 15, 2023

knyar Nov 3, 2023

irbekrm Nov 15, 2023

irbekrm Nov 15, 2023

knyar Nov 15, 2023

deandre commented Nov 14, 2023

DentonGentry commented Nov 14, 2023 •

edited

irbekrm commented Nov 15, 2023

irbekrm commented Nov 15, 2023

irbekrm commented Nov 20, 2023

cmd/containerboot: fix unclean shutdown #10035

cmd/containerboot: fix unclean shutdown #10035

Conversation

irbekrm commented Oct 31, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

deandre commented Nov 14, 2023

DentonGentry commented Nov 14, 2023 • edited

irbekrm commented Nov 15, 2023

irbekrm commented Nov 15, 2023

irbekrm commented Nov 20, 2023

irbekrm commented Oct 31, 2023 •

edited

DentonGentry commented Nov 14, 2023 •

edited