Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubelet fails to pull docker.io/rancher/pause:3.6 on Windows 11 Pro node #3915

Closed
hach-que opened this issue Feb 16, 2023 · 12 comments
Closed

Comments

@hach-que
Copy link

hach-que commented Feb 16, 2023

NOTE: If you're coming to this issue and you just want something that works, see #3915 (comment).

Environmental Info:
RKE2 Version: v1.24.9+rke2r2 (installed via Rancher)

Node(s) CPU architecture, OS, and Version:

> kubectl get nodes
NAME        STATUS     ROLES                              AGE    VERSION
hawkeye     Ready      worker                             51m    v1.24.9
hulk        Ready      worker                             43m    v1.24.9
mobius      NotReady   worker                             42m    v1.24.9
ms-marvel   NotReady   worker                             43m    v1.24.9
sentry      Ready      control-plane,etcd,master,worker   102m   v1.24.9+rke2r2
> kubectl get nodes -o=custom-columns='NAME:metadata.name,OS IMAGE:status.nodeInfo.osImage,ARCH:status.nodeInfo.architecture,KERNEL VERSION:status.nodeInfo.kernelVersion'
NAME        OS IMAGE             ARCH    KERNEL VERSION
hawkeye     Windows 10 Pro       amd64   10.0.19044.2604
hulk        Windows 10 Pro       amd64   10.0.19044.2604
mobius      Windows 10 Pro       amd64   10.0.22000.1574
ms-marvel   Windows 10 Pro       amd64   10.0.22000.1574
sentry      Ubuntu 20.04.3 LTS   amd64   5.4.0-137-generic

(a few of these went NotReady in my attempts to workaround this bug, but it was present even when all nodes were Ready)

Cluster Configuration:
A single Linux master node and 4 Windows 10/11 Pro nodes. I had to slightly modify the feature checks in the install script to use Get-WindowsOptionalFeature instead of Get-WindowsFeature, but other than that everything seemed to install properly.

Describe the bug:
When scheduling nodes, the kubelet can't seem to pull the pause image, even though there's a windows/amd64 version of it:

Warning  FailedCreatePodSandBox  8s (x5 over 75s)       kubelet            Failed to create pod sandbox: rpc error: code = NotFound desc = failed to get sandbox image "index.docker.io/rancher/pause:3.6": failed to pull image "index.docker.io/rancher/pause:3.6": failed to pull and unpack image "docker.io/rancher/pause:3.6": no match for platform in manifest: not found

Steps To Reproduce:

  • Installed RKE2: via Rancher, using an Ubuntu VM as the first node
  • On a Windows machine that you want to add to the cluster:
    • Use this version of the install script (instead of using the one on GitHub), renaming the extension since GitHub won't let me upload a file with a ps1 extension: install-patched.txt
    • You'll also need to change REPLACE_ME with the hostname of the Rancher server.
  • Run kubectl run win-test --image=mcr.microsoft.com/windows/server:ltsc2022

Expected behavior:
It should correctly pull the windows/amd64 version of the docker.io/rancher/pause:3.6 image. I can do docker pull docker.io/rancher/pause:3.6 when running against the Docker Engine for Windows on the same machine, so this is an RKE2/Kubernetes specific bug.

Actual behavior:
kubelet fails to pull the image.

I can successfully run both mcr.microsoft.com/windows/server:ltsc2022 and mcr.microsoft.com/windows/nanoserver:ltsc2022 through Docker on the machine, so this is not some kind of fundamental OS incompatibility. It just seems like kubelet is failing to pull the image properly. Unfortunately I couldn't find anything in the logs to indicate what type of platform kubelet is trying to pull for (i.e. is it incorrectly detecting it as Linux or something like that?). I also couldn't find a way to override the platform that kubelet tries to pull images for, so I can't force it to windows/amd64.

PS C:\WINDOWS\system32> docker run --isolation=process mcr.microsoft.com/windows/nanoserver:ltsc2022
Unable to find image 'mcr.microsoft.com/windows/nanoserver:ltsc2022' locally
ltsc2022: Pulling from windows/nanoserver
546fcac75d6a: Pull complete
Digest: sha256:786a24be2bd1945bee9701f95a71d8573ace8641c112dc27206f826bef0229c1
Status: Downloaded newer image for mcr.microsoft.com/windows/nanoserver:ltsc2022
Microsoft Windows [Version 10.0.22000.1547]
(c) Microsoft Corporation. All rights reserved.

C:\>
@hach-que
Copy link
Author

hach-que commented Feb 16, 2023

Inspecting the manifest for docker.io/rancher/pause:3.6, it has the following Windows manifests:

      {
         "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
         "size": 1157,
         "digest": "sha256:ed2fc02d3aafe19133e701596ca01200ec767593672721c4be9b14ae8cdb114d",
         "platform": {
            "architecture": "amd64",
            "os": "windows",
            "os.version": "10.0.17763.2114"
         }
      },
      {
         "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
         "size": 1157,
         "digest": "sha256:c0b7a0c3e6b86dea4b8354fc1feb873ffc85e0bb814231903b912f144da51802",
         "platform": {
            "architecture": "amd64",
            "os": "windows",
            "os.version": "10.0.20348.169"
         }
      }

When checking how containerd evaluates platform compatibility though, it looks like it goes off string prefix. So even though the image is compatible with 10.0.22000 at runtime, they'll never be picked by containerd because 10.0.20348.169 isn't a prefix of 10.0.22000.

There's probably two paths I can see to resolve this in the immediate:

  • Is there a way to build my own patched containerd and use that with RKE2 on the Windows nodes instead of the default containerd build? I couldn't see a containerd executable on disk, so I'm guessing it might get compiled into one of the other RKE2 binaries that end up as part of a Windows install.
    • This would be my preferred path, since it means I can write a patch that I can then submit upstream to containerd to get this fixed properly.
  • Is there a way to override the pause container URL on a cluster? That way I could push my own manifest to Docker hub that lies about the os.version so that it will pass the string prefix check. I probably need to do the same of the Windows base image, but I can do that in my own private repository.
    • In my earlier testing I already tried adding the kubelet arg pod-infra-container-image=mcr.microsoft.com/oss/kubernetes/pause:3.6 under Cluster Configuration / Advanced in Rancher, and that didn't seem to have an effect.
    • I also tried setting RKE2_PAUSE_IMAGE to mcr.microsoft.com/oss/kubernetes/pause:3.6 under Cluster Configuration / Agent Environment Vars in Rancher and that also had no effect.

@hach-que
Copy link
Author

Looks like containerd.exe ends up at C:\var\lib\rancher\rke2\data\v1.24.9-rke2r2-windows-amd64-bc939b774232\bin, but that very much looks like a "we'll manage it for you" location, so it'd be ideal if there was some way of overriding the path used for containerd (as well as information on what this containerd is built from - is it just the upstream containerd as-is?)

@brandond
Copy link
Contributor

We don't technically support RKE2 on Windows 11 Pro, so we're not mirroring the image for that OS version.

If you want to change the pause image to pull directly from MCR, you should be able to set pause-image in the rke2 config.yaml on each node you want to change it on. If you're going to set the RKE2_PAUSE_IMAGE env var it needs to be set for the RKE2 process itself, not the cluster agent, and there's not an easy way to do that on Windows.

@hach-que
Copy link
Author

hach-que commented Feb 16, 2023

There isn't an OS image for Windows 11 in any registry; the OS image that you are officially meant to use on Windows 11 is ltsc2022. It's just a case that containerd isn't applying the correct checks for this use case.

I couldn't figure out where the C:\var\lib\rancher\rke2\data\v1.24.9-rke2r2-windows-amd64-bc939b774232 folder is initialized from. It looks like a Docker tag, but rancher/system-agent-installer-rke2:v1.24.9-rke2r2 isn't set up for the OS either, so I don't know how the contents of that folder got populated.

I would really like to override the containerd binary that's being used so I can test a fix for this.

In the meantime I'll test adding this entry to the machineSelectorConfig section of the cluster YAML in Rancher and see if that changes what pause image it uses. If it does then that at least gives me a viable path to a workaround in the meantime:

      - config:
          pause-image: mcr.microsoft.com/oss/kubernetes/pause:3.6
        machineLabelSelector:
          matchExpressions:
            - key: cattle.io/os
              operator: In
              values:
                - windows
          matchLabels:
            cattle.io/os: windows

@brandond
Copy link
Contributor

brandond commented Feb 16, 2023

I couldn't figure out where the C:\var\lib\rancher\rke2\data\v1.24.9-rke2r2-windows-amd64-bc939b774232 folder is initialized from.

This comes from the rancher/rke2-runtime image on Docker Hub, with a tag matching the running RKE2 version. You can override this with the --runtime-image flag.

I would really like to override the containerd binary that's being used so I can test a fix for this.

You can't, short of replacing the whole runtime image. You can however install and start containerd on your own, and point RKE2 at its socket with the --container-runtime-endpoint flag.

matchLabels:
   cattle.io/os: windows

I would probably recommend using the standard kubernetes.io/os label, instead of the custom cattle-namespaced one.

@hach-que
Copy link
Author

Ah I see, Docker Hub doesn't show the Windows version in it's UI for that image:

image

but it does actually exist in the manifest:

image

Presumably because it doesn't have an OS version filter on it. I'm guessing the image isn't a real image you can run, but instead just contains the files which are extracted out onto host for execution? If that's the case I should still at least be able to use docker build to put my own containerd.exe file on top of the image and then use my own custom runtime image instead.

In this case I just need to know where containerd.exe is being built from - is it just the upstream containerd or is there a build script somewhere I can look at for how the rke2-runtime image is prepared?


Unfortunately, it looks like setting pause-image at the Rancher level doesn't work. None of these attempts result in pause-image propagating down to the nodes:

      - config:
          pause-image: mcr.microsoft.com/oss/kubernetes/pause:3.6
        machineLabelSelector:
          matchLabels:
            cattle.io/os: windows
      - config:
          pause-image: mcr.microsoft.com/oss/kubernetes/pause:3.6
        machineLabelSelector:
          matchLabels:
            kubernetes.io/os: windows

The file remains unchanged even after Rancher rolls out the changes:

image

@hach-que
Copy link
Author

Even this doesn't work:

    machineSelectorConfig:
      - config:
          protect-kernel-defaults: false
      - config:
          pause-image: mcr.microsoft.com/oss/kubernetes/pause:3.6

so maybe Rancher just can't apply the pause-image setting?

@brandond
Copy link
Contributor

In this case I just need to know where containerd.exe is being built from

We build it at https://github.com/rancher/image-build-containerd and push to rancher/hardened-containerd on Docker Hub. The binaries are then copied from that image into the rke2-runtime image in the RKE2 Dockerfile: https://github.com/rancher/rke2/blob/master/Dockerfile.windows#L100

Unfortunately, it looks like setting pause-image at the Rancher level doesn't work.

I'm not sure I can help with that. I would probably just create c:/etc/rancher/rke2/config.yaml on the node itself, and put pause-image: mcr.microsoft.com/oss/kubernetes/pause:3.6 in there.

@hach-que
Copy link
Author

hach-que commented Feb 16, 2023

I'm not sure I can help with that. I would probably just create c:/etc/rancher/rke2/config.yaml on the node itself, and put pause-image: mcr.microsoft.com/oss/kubernetes/pause:3.6 in there.

Yeah I think I might just have to do this. Rancher doesn't seem capable of setting that onto nodes, which is a little disappointing. In any case I can always patch the install script to write out the file manually and then apply it to all nodes once I have a working configuration.

@hach-que
Copy link
Author

Huzzah, setting it manually into C:\etc\rancher\rke2\config.yaml.d\60-pause-image.yaml as part of the install script and letting RKE2 restart on the node works:

image

This at least gives me a path forward to make my own manifest that targets the right OS version, to see if that at least gets things over the line in the short term.

@hach-que
Copy link
Author

hach-que commented Feb 16, 2023

Ok so putting a guide together for anyone else who runs into this situation and how you can work around things for now:

Build some images with fixed OS versions (or use mine)

You only need to do this step if you don't want to use the images I provide. The images I provide are currently:

  • registry.redpoint.games/redpointgames/containers-for-windows-11/server:ltsc2022 (base image for Windows apps)
  • registry.redpoint.games/redpointgames/containers-for-windows-11/nanoserver:ltsc2022 (base image for nanoserver)
  • registry.redpoint.games/redpointgames/containers-for-windows-11/rke2-pause:3.6 (pause container)

These are identical to the originals; they just have a fixed up OS version in the metadata so they'll be pulled on Windows 11 hosts.

If you'd prefer to make your own images and push them to a registry, then create the following files:

ltsc2022.Dockerfile:

FROM mcr.microsoft.com/windows/server:ltsc2022

ltsc2022-nano.Dockerfile:

FROM mcr.microsoft.com/windows/nanoserver:ltsc2022

rke2-pause.Dockerfile:

FROM rancher/pause:3.6@sha256:c0b7a0c3e6b86dea4b8354fc1feb873ffc85e0bb814231903b912f144da51802

then on a Windows 11 machine with the Docker Engine for Windows installed, run something similar to the following. You'll need to replace the hachque/rke2-ltsc2022-win11 and hachque/rke2-pause-win11 with your own image paths.

docker build . -f ltsc2022.Dockerfile --tag hachque/rke2-ltsc2022-win11:ltsc2022-10.0.22000
docker push hachque/rke2-ltsc2022-win11:ltsc2022-10.0.22000
docker manifest create --amend hachque/rke2-ltsc2022-win11:ltsc2022 hachque/rke2-ltsc2022-win11:ltsc2022-10.0.22000
docker manifest annotate hachque/rke2-ltsc2022-win11:ltsc2022 hachque/rke2-ltsc2022-win11:ltsc2022-10.0.22000 --os-version 10.0.22000
docker manifest push hachque/rke2-ltsc2022-win11:ltsc2022

docker build . -f ltsc2022-nano.Dockerfile --tag hachque/rke2-ltsc2022-win11:ltsc2022-nano-10.0.22000
docker push hachque/rke2-ltsc2022-win11:ltsc2022-nano-10.0.22000
docker manifest create --amend hachque/rke2-ltsc2022-win11:ltsc2022-nano hachque/rke2-ltsc2022-win11:ltsc2022-nano-10.0.22000
docker manifest annotate hachque/rke2-ltsc2022-win11:ltsc2022-nano hachque/rke2-ltsc2022-win11:ltsc2022-nano-10.0.22000 --os-version 10.0.22000
docker manifest push hachque/rke2-ltsc2022-win11:ltsc2022-nano

docker build . -f rke2-pause.Dockerfile --tag hachque/rke2-pause-win11:3.6-10.0.22000
docker push hachque/rke2-pause-win11:3.6-10.0.22000
docker manifest create --amend hachque/rke2-pause-win11:3.6 hachque/rke2-pause-win11:3.6-10.0.22000
docker manifest annotate hachque/rke2-pause-win11:3.6 hachque/rke2-pause-win11:3.6-10.0.22000 --os-version 10.0.22000
docker manifest push hachque/rke2-pause-win11:3.6

Patch the install script to use Get-WindowsOptionalFeature and to set up the pause-image override

Download the patched script here: install-patched.txt

You need to replace the REPLACE_ME with the hostname of your Rancher server that will be managing the cluster.

You also then need to add the following code after Invoke-WinsInstaller happens. Change the pause-image name if you're using your own pause image.

if (!(Test-Path "C:\etc\rancher\rke2\config.yaml.d")) {
    New-Item -ItemType Directory -Path "C:\etc\rancher\rke2\config.yaml.d"
}
Set-Content -Path "C:\etc\rancher\rke2\config.yaml.d\60-pause-image.yaml" -Value @"
pause-image: registry.redpoint.games/redpointgames/containers-for-windows-11/rke2-pause:3.6
"@

Then run the install script on all your nodes. It'll take several minutes for them to update and restart, so don't get too eager on running the test below.

Test that Windows pods fetch images and work properly

After the nodes have restarted, you can create a Windows pod with:

kubectl run win-test --image=registry.redpoint.games/redpointgames/containers-for-windows-11/nanoserver:ltsc2022

If it is pulling correctly, you should see events like this when you do kubectl describe pod win-test

Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  61s   default-scheduler  Successfully assigned default/win-test-echo to mobius
  Normal  Pulling    60s   kubelet            Pulling image "registry.redpoint.games/redpointgames/containers-for-windows-11/nanoserver:ltsc2022"

If you're using the full image, it will take a while to pull (with no progress) because it's several GBs. I'd recommend testing with the nano image to start with. Once the image pulls and the pod starts, running kubectl logs win-test should now give you the output from a cmd prompt, which indicates everything is working:

> kubectl logs win-test
Microsoft Windows [Version 10.0.22000.1547]
(c) Microsoft Corporation. All rights reserved.

C:\>

@hach-que
Copy link
Author

I'm going to close this issue out, as I ended up writing my own Kubernetes manager and doing a pull request against containerd to support 2022 containers on Windows 11 properly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants