Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Windows] Stop-Service rke2 does not stops all rke2 services #2204

Closed
mdrahman-suse opened this issue Dec 2, 2021 · 7 comments
Closed

[Windows] Stop-Service rke2 does not stops all rke2 services #2204

mdrahman-suse opened this issue Dec 2, 2021 · 7 comments
Assignees
Labels
kind/bug Something isn't working priority/low

Comments

@mdrahman-suse
Copy link
Contributor

Extension of (#1755)

Environmental Info:
RKE2 Version:

rke2.exe version v1.22.4-rc2+rke2r1 (79dc33a)
go version go1.16.10b7

Node(s) CPU architecture, OS, and Version:

Server: Linux ip-172-31-47-190 5.4.0-1009-aws #9-Ubuntu SMP Sun Apr 12 19:46:01 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Agent: Windows 10 (1809)

Cluster Configuration:

1 server, 1 agent

Describe the bug:

Stop-Service rke2 does not stop all the rke2 services cleanly. One or two services remains in running state randomly

PS C:\Users\Administrator> Stop-Service rke2
PS C:\Users\Administrator> Get-Process kubelet,kube-proxy,containerd,rke2
Get-Process : Cannot find a process with the name "kubelet". Verify the process name and call the cmdlet again.
At line:1 char:1
+ Get-Process kubelet,kube-proxy,containerd,rke2
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : ObjectNotFound: (kubelet:String) [Get-Process], ProcessCommandException
    + FullyQualifiedErrorId : NoProcessFoundForGivenName,Microsoft.PowerShell.Commands.GetProcessCommand

Get-Process : Cannot find a process with the name "kube-proxy". Verify the process name and call the cmdlet again.
At line:1 char:1
+ Get-Process kubelet,kube-proxy,containerd,rke2
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : ObjectNotFound: (kube-proxy:String) [Get-Process], ProcessCommandException
    + FullyQualifiedErrorId : NoProcessFoundForGivenName,Microsoft.PowerShell.Commands.GetProcessCommand

Get-Process : Cannot find a process with the name "rke2". Verify the process name and call the cmdlet again.
At line:1 char:1
+ Get-Process kubelet,kube-proxy,containerd,rke2
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : ObjectNotFound: (rke2:String) [Get-Process], ProcessCommandException
    + FullyQualifiedErrorId : NoProcessFoundForGivenName,Microsoft.PowerShell.Commands.GetProcessCommand


Handles  NPM(K)    PM(K)      WS(K)     CPU(s)     Id  SI ProcessName
-------  ------    -----      -----     ------     --  -- -----------
    423      21    54756      65720     744.72   5220   0 containerd

Some progress was made as part of ticket (#1755) but additional code changes are needed to cleanly stop all the rke2 services

Steps To Reproduce:

  • Installed RKE2 on server and agent nodes
  • Joined the agent node to the cluster
  • Ensure agent is joined and rke2 services are running on the server and agent
  • Run Stop-Service rke2

Expected behavior:

  • Get-Process -Name containerd,kubelet,kube-proxy,rke2,calico-node should enlist none of these processes are running

Actual behavior:

One or two services are always running and not stopped cleanly

Additional context / logs:

(#1755) This ticket has some more details on the work

@HarrisonWAffel
Copy link
Contributor

This issue relates to how cancelation of the context created from the signals package within k3s is processed throughout rke2 and k3s. k3’s pkg/agent/run.go will block on the completion of the context using a simple channel receive, and because no other work is being done, this function will handle the cancelation much faster than executables spawned from exec.CommandContext (1, 2. 3).

Since the rke2 agent will return as soon as the k3s agent returns, there is not enough time for the exec.CommandContext statements to process the cancelation and thus executables (kubelet, containerd, etc.) will not be cleaned up. This is effectively a race condition, and explains the apparent randomness of processes left running after the rke2 service is stopped.

Ideally there would be a way to ensure that both rke2 and k3s fully process the context cancelation before rke2 exits. Currently, k3s will perform an os.Exit(1) if containerd is killed, which complicates this as that will also take out rke2.

@brandond
Copy link
Contributor

brandond commented Dec 21, 2023

Yeah... if we did this properly we would set up wait groups and properly sequence the shutdown. I have made attempts at this in the past, but unfortunately there is a lot of code in Kubernetes, Wrangler, and K3s that calls exit or panic on error, and effectively breaks the ability to do a clean exit when the top-level context is cancelled.

@HarrisonWAffel
Copy link
Contributor

HarrisonWAffel commented Jan 4, 2024

For this specific issue the behavior of k3s is really what’s getting in the way. I can think of two solutions

  1. We come up with a way to indicate to k3s that it should not exit when containerd exits. This would give rke2 the final say over when the binary terminates, giving us a bit of time to ensure that all processes are removed before we denote the windows service as stopped.

    1. I'm thinking we could add an exported function (something like containerd.ExitFunc) which by default will call os.Exit, but could be overridden by the rke2 windows service to perform process cleanup when the top level context has completed.
  2. Each time the windows service starts or stops, a small PowerShell script is run which kills the processes. This way we ensure that they’re all removed before the binary exits. In the event that processes are still left behind (due to a potential race condition with containerd/k3s exiting before the script completes), we clean them up on startup before attempting to run new instances.

    1. This feels like more of a brute force approach, and there is the small potential for processes to stick around even after the service has exited (while I haven't seen this occur in my testing, I can't prove it wont happen).
    2. I’ve tested this approach out and, when combined with a brief pause after the k3s agent returns, it consistently removes the errant processes and allowed for a proper restart of the rke2 service.

@HarrisonWAffel
Copy link
Contributor

We talked about this over slack, the better solution for this issue would be enhancing the k3s Executor interface so that custom executors (like the rke2 windows pebinaryexecutor) can define how containerd should behave.

@manuelbuil
Copy link
Contributor

@HarrisonWAffel
Copy link
Contributor

@manuelbuil Yep, until this ticket is handled that is the only way to get out of the crash loop (other than a full restart of the machine)

@fmoral2
Copy link
Contributor

fmoral2 commented Feb 16, 2024

Validated on Version:

-$  rke2.exe version v1.29.2-rc1+rke2r1 (2bb7020162174863547a0b4773b74acf6fdab71c)

Environment Details

Infrastructure
Cloud EC2 instance

Node(s) CPU architecture, OS, and Version:
SUSE Linux Enterprise Server 15 SP4

Cluster Configuration:
1 server node
1 agent
1 windows

Steps to validate the fix

  1. Install rke2 with windows agent node
  2. Validate that after Stop-Service rke2 no services are running

Reproduction Issue:

 
 rke2 -v
rke2.exe version v1.26.13-rc2+rke2r1 (759709e78f0b5138a2d632aa5665d2b2c5dcdc10)
go version go1.20.13

PS C:\Users\Administrator> Stop-Service rke2


PS C:\Users\Administrator> Get-Process -Name containerd,kubelet,kube-proxy,rke2,calico-node

 

Handles  NPM(K)    PM(K)      WS(K)     CPU(s)     Id  SI ProcessName
-------  ------    -----      -----     ------     --  -- -----------
    355      18    31696      39412       3.45   1188   0 calico-node
    187      15    34768      35884      10.72   7292   0 containerd
    364      23    51152      70420       9.05   6316   0 kubelet
    196      17    30848      36504       0.33   4948   0 kube-proxy
    484      37    51372      82324      18.20   3944   0 rke2

 

Validation Results:

       

   rke2 -v
rke2.exe version v1.29.2-rc1+rke2r1 (2bb7020162174863547a0b4773b74acf6fdab71c)
go version go1.21.7
  

PS C:\Users\Administrator> Stop-Service rke2



PS C:\Users\Administrator>  Get-Process -Name containerd,kubelet,kube-proxy,rke2,calico-node


Get-Process : Cannot find a process with the name "containerd". Verify the process name and call the cmdlet again.
At line:1 char:2
+  Get-Process -Name containerd,kubelet,kube-proxy,rke2,calico-node
+  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : ObjectNotFound: (containerd:String) [Get-Process], ProcessCommandException
    + FullyQualifiedErrorId : NoProcessFoundForGivenName,Microsoft.PowerShell.Commands.GetProcessCommand

Get-Process : Cannot find a process with the name "kubelet". Verify the process name and call the cmdlet again.
At line:1 char:2
+  Get-Process -Name containerd,kubelet,kube-proxy,rke2,calico-node
+  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : ObjectNotFound: (kubelet:String) [Get-Process], ProcessCommandException
    + FullyQualifiedErrorId : NoProcessFoundForGivenName,Microsoft.PowerShell.Commands.GetProcessCommand

Get-Process : Cannot find a process with the name "kube-proxy". Verify the process name and call the cmdlet again.
At line:1 char:2
+  Get-Process -Name containerd,kubelet,kube-proxy,rke2,calico-node
+  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : ObjectNotFound: (kube-proxy:String) [Get-Process], ProcessCommandException
    + FullyQualifiedErrorId : NoProcessFoundForGivenName,Microsoft.PowerShell.Commands.GetProcessCommand

Get-Process : Cannot find a process with the name "rke2". Verify the process name and call the cmdlet again.
At line:1 char:2
+  Get-Process -Name containerd,kubelet,kube-proxy,rke2,calico-node
+  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : ObjectNotFound: (rke2:String) [Get-Process], ProcessCommandException
    + FullyQualifiedErrorId : NoProcessFoundForGivenName,Microsoft.PowerShell.Commands.GetProcessCommand

Get-Process : Cannot find a process with the name "calico-node". Verify the process name and call the cmdlet again.
At line:1 char:2
+  Get-Process -Name containerd,kubelet,kube-proxy,rke2,calico-node
+  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : ObjectNotFound: (calico-node:String) [Get-Process], ProcessCommandException
    + FullyQualifiedErrorId : NoProcessFoundForGivenName,Microsoft.PowerShell.Commands.GetProcessCommand

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working priority/low
Projects
No open projects
Development

No branches or pull requests

9 participants