Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

imagePull from private repo is slow second time #217

Closed
infa-ddeore opened this issue Oct 2, 2023 · 34 comments
Closed

imagePull from private repo is slow second time #217

infa-ddeore opened this issue Oct 2, 2023 · 34 comments

Comments

@infa-ddeore
Copy link

infa-ddeore commented Oct 2, 2023

imagepull second time from another node in same cluster should be fast but it took same time as of downloading from remote repo, looks like caching didnt work

here are the logs from spegel pod for xxx.yyy.io/app-image-k8:dev_123 image (masked repo and image details)

any pointers to look into what can be the issue?

{"level":"error","ts":1696247240.4278097,"caller":"gin@v0.0.9/logger.go:62","msg":"","path":"/v2/app-image-k8/manifests/dev_123","status":404,"method":"HEAD","latency":5.000973923,"ip":"10.14.130.153","handler":"mirror","error":"could not resolve mirror for key: xxx.yyy.io/app-image-k8:dev_123","stacktrace":"github.com/xenitab/pkg/gin.Logger.func1\n\t/go/pkg/mod/github.com/xenitab/pkg/gin@v0.0.9/logger.go:62\ngithub.com/gin-gonic/gin.(*Context).Next\n\t/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/context.go:174\ngithub.com/gin-gonic/gin.(*Engine).handleHTTPRequest\n\t/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/gin.go:620\ngithub.com/gin-gonic/gin.(*Engine).ServeHTTP\n\t/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/gin.go:576\nnet/http.serverHandler.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2936\nnet/http.(*conn).serve\n\t/usr/local/go/src/net/http/server.go:1995"}

{"level":"error","ts":1696247731.0288842,"caller":"registry/registry.go:211","msg":"mirror failed attempting next","error":"expected mirror to respond with 200 OK but received: 500 Internal Server Error","stacktrace":"github.com/xenitab/spegel/internal/registry.(*Registry).handleMirror.func2\n\t/build/internal/registry/registry.go:211\nnet/http/httputil.(*ReverseProxy).modifyResponse\n\t/usr/local/go/src/net/http/httputil/reverseproxy.go:324\nnet/http/httputil.(*ReverseProxy).ServeHTTP\n\t/usr/local/go/src/net/http/httputil/reverseproxy.go:490\ngithub.com/xenitab/spegel/internal/registry.(*Registry).handleMirror\n\t/build/internal/registry/registry.go:217\ngithub.com/xenitab/spegel/internal/registry.(*Registry).registryHandler\n\t/build/internal/registry/registry.go:137\ngithub.com/gin-gonic/gin.(*Context).Next\n\t/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/context.go:174\ngithub.com/xenitab/spegel/internal/registry.(*Registry).metricsHandler\n\t/build/internal/registry/registry.go:271\ngithub.com/gin-gonic/gin.(*Context).Next\n\t/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/context.go:174\ngithub.com/gin-gonic/gin.CustomRecoveryWithWriter.func1\n\t/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/recovery.go:102\ngithub.com/gin-gonic/gin.(*Context).Next\n\t/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/context.go:174\ngithub.com/slok/go-http-metrics/middleware/gin.Handler.func1.1\n\t/go/pkg/mod/github.com/slok/go-http-metrics@v0.10.0/middleware/gin/gin.go:17\ngithub.com/slok/go-http-metrics/middleware.Middleware.Measure\n\t/go/pkg/mod/github.com/slok/go-http-metrics@v0.10.0/middleware/middleware.go:117\ngithub.com/slok/go-http-metrics/middleware/gin.Handler.func1\n\t/go/pkg/mod/github.com/slok/go-http-metrics@v0.10.0/middleware/gin/gin.go:16\ngithub.com/gin-gonic/gin.(*Context).Next\n\t/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/context.go:174\ngithub.com/xenitab/pkg/gin.Logger.func1\n\t/go/pkg/mod/github.com/xenitab/pkg/gin@v0.0.9/logger.go:28\ngithub.com/gin-gonic/gin.(*Context).Next\n\t/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/context.go:174\ngithub.com/gin-gonic/gin.(*Engine).handleHTTPRequest\n\t/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/gin.go:620\ngithub.com/gin-gonic/gin.(*Engine).ServeHTTP\n\t/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/gin.go:576\nnet/http.serverHandler.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2936\nnet/http.(*conn).serve\n\t/usr/local/go/src/net/http/server.go:1995"}

@infa-ddeore
Copy link
Author

i see the caching isnt working at all, same image from different node took ~1m59s time always from different nodes, that means the image was downloaded from remote and not from local cache

@phillebaba
Copy link
Member

phillebaba commented Oct 3, 2023

Could you share your values.yaml used to deploy Spegel?

It would be good also if you could share the logs from the Spegel instance that returns the 500. There may be some issue caused by the registry name that has not been observed before. When 500 is returned it means that something wrong is happening when the instance is attempting to serve the cached layer.

@phillebaba
Copy link
Member

@infa-ddeore considering that the request that is failing is a HEAD and you are receiving a 500 my best guess is that something is happening when getting the size.
https://github.com/XenitAB/spegel/blob/8cf145f7cd0fb02b2a6be3ab13c255aae930745c/internal/registry/registry.go#L251-L256

To help you further I need the error message from the other Spegel instance. I have seen that OCI artifacts can differ a bit depending on how they are built, it may be that a specific field is missing from the artifact for example.

@infa-ddeore
Copy link
Author

here is values.yaml, masked private repo locations

serviceMonitor:
  enabled: true

spegel:
  registries:
    - https://602401143452.dkr.ecr-fips.us-east-1.amazonaws.com
    - https://602401143452.dkr.ecr.us-west-2.amazonaws.com
    - https://containers.dl.xxx.io
    - https://xxx.artifacts.xxx.yyy
    - https://xxx.artifacts.aaa.yyy
    - https://docker.io
    - https://zzz.xxx.io
    - https://zzzz.xxx.io
    - https://qqq.xxx.io
    - https://qqqqq.xxx.io
    - https://qqqqqqq.xxx.io
    - https://qqq.xxx.io
    - https://qqqqqqq.xxx.io
    - https://k8s.gcr.io
    - https://public.ecr.aws
    - https://quay.io
    - https://registry.k8s.io
    - https://qqqqq.xxx.io

i downloaded httpd image from dockerhub with and without spegel, spegel is actually adding more delay in image download compared to direct

attaching first.log and second.log from the pods where image was pulled

  1. image pull at first time took 1m3.800998679s
  2. imagepull at second time from cache took 22.28909299s
  3. imagepull directly without spegel took only 7.022743234s

so spegel is adding more delay in image download instead of helping, all 404 have ~5 seconds time, looks like some timeout

first.log
second.log

@infa-ddeore
Copy link
Author

any pointers on the issue?

@phillebaba
Copy link
Member

phillebaba commented Oct 6, 2023

It is obvious now that I did not think about the default value for the lookup timeout, especially for images with lots of layers. I should have set a different default value that is a lot lower, which I am currently using myself. This is probably a good time for me to get going on performance tests to find an optimal timeout value. For now I will just change the default lookup timeout to something a lot lower similar to what I am using.

@infa-ddeore
Copy link
Author

ok, is that causing slowness in image download? non spegel cluster took 7 seconds for an image where spegel cluster took 1m3s without cache and with cache it was 22s, even with cache the download was slower, is it due to the timeout?

@tico24
Copy link

tico24 commented Oct 9, 2023

If it helps, I am experiencing the same issue.

EKS 1.28. A mix of arm and amd64 nodes.

I thought the issue might be related to the mix of CPU architectures. I used a nodeselector to only use arm nodes, but got the same issue.

It's as if the spegel pods are not able to communicate with one another, but I can't work out what (if any) firewall/security group rules I may need to add.

I tried downgrading to spegel 0.0.11, but this didn't change things.

edit: to clarify, I already expose nodeports:

    egress_nodes_ephemeral_ports_tcp = {
      description                = "To node 1025-65535"
      protocol                   = "tcp"
      from_port                  = 1025
      to_port                    = 65535
      type                       = "egress"
      source_node_security_group = true
    }

@infa-ddeore
Copy link
Author

i also have all ports open between worker nodes so that doesnt look like an issue in my case

i tried same in AKS but no issue there, i could see image is served from within the cluster, no 5xx errors, neither the timeouts of 5s

so looks like issue is specific to EKS

@tico24
Copy link

tico24 commented Oct 10, 2023

I downgraded AWS CNI from 0.15.x to 0.14.x and 0.13.x in the hopes that maybe the newer feature whereby it obeys networkPolicies might be causing the issue.

Unfortunately it is still not working with AWS CNI 0.13.

spegel-78c2f registry {"level":"info","ts":1696918978.0747795,"caller":"gin@v0.0.9/logger.go:53","msg":"","path":"/v2/library/ubuntu/manifests/latest","status":200,"method":"HEAD","latency":0.003605546,"ip":"10.207.61.216","handler":"mirror"}                                                                                                                                                     spegel-2kz5z registry {"level":"info","ts":1696918978.074557,"caller":"gin@v0.0.9/logger.go:53","msg":"","path":"/v2/library/ubuntu/manifests/latest","status":200,"method":"HEAD","latency":0.001763525,"ip":"10.207.61.216","handler":"manifest"}
spegel-78c2f registry {"level":"info","ts":1696918978.0826175,"caller":"gin@v0.0.9/logger.go:53","msg":"","path":"/v2/library/ubuntu/manifests/sha256:9b8dec3bf938bc80fbe758d856e96fdfab5f56c39d44b0cff351e847bb1b01ea","status":200,"method":"GET","latency":0.002599054,"ip":"10.207.61.216","handler":"mirror"}                                                                                     spegel-2kz5z registry {"level":"info","ts":1696918978.0823524,"caller":"gin@v0.0.9/logger.go:53","msg":"","path":"/v2/library/ubuntu/manifests/sha256:9b8dec3bf938bc80fbe758d856e96fdfab5f56c39d44b0cff351e847bb1b01ea","status":200,"method":"GET","latency":0.0013331,"ip":"10.207.61.216","handler":"manifest"}
spegel-78c2f registry {"level":"error","ts":1696918983.0943055,"caller":"gin@v0.0.9/logger.go:62","msg":"","path":"/v2/library/ubuntu/manifests/sha256:f58b48967ecc767fc238144ffdb7eb668cefcc8438de8f8a59c4cefbbf29b323","status":404,"method":"GET","latency":5.000980193,"ip":"10.207.61.216","handler":"mirror","error":"could not resolve mirror for key: sha256:f58b48967ecc767fc238144ffdb7eb668cefcc8438de8f8a59c4cefbbf29b323","stacktrace":"github.com/xenitab/pkg/gin.Logger.func1\n\t/go/pkg/mod/github.com/xenitab/pkg/gin@v0.0.9/logger.go:62\ngithub.com/gin-gonic/gin.(*Context).Next\n\t/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/context.go:174\ngithub.com/gin-gonic/gin.(*Engine).handleHTTPRequest\n\t/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/gin.go:620\ngithub.com/gin-gonic/gin.(
*Engine).ServeHTTP\n\t/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/gin.go:576\nnet/http.serverHandler.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2936\nnet/http.(*conn).serve\n\t/usr/local/go/src/net/http/server.go:1995"}                                                                                                                                                                     spegel-2kz5z registry {"level":"info","ts":1696918983.0951376,"caller":"registry/registry.go:174","msg":"handling mirror request from external node","path":"/v2/library/ubuntu/manifests/sha256:f58b48967ecc767fc238144ffdb7eb668cefcc8438de8f8a59c4cefbbf29b323","ip":"10.207.61.216"}
spegel-2kz5z registry {"level":"error","ts":1696918988.095955,"caller":"gin@v0.0.9/logger.go:62","msg":"","path":"/v2/library/ubuntu/manifests/sha256:f58b48967ecc767fc238144ffdb7eb668cefcc8438de8f8a59c4cefbbf29b323","status":404,"method":"GET","latency":5.000852283,"ip":"10.207.61.216","handler":"mirror","error":"could not resolve mirror for key: sha256:f58b48967ecc767fc238144ffdb7eb668cefcc8438de8f8a59c4cefbbf29b323","stacktrace":"github.com/xenitab/pkg/gin.Logger.func1\n\t/go/pkg/mod/github.com/xenitab/pkg/gin@v0.0.9/logger.go:62\ngithub.com/gin-gonic/gin.(*Context).Next\n\t/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/context.go:174\ngithub.com/gin-gonic/gin.(*Engine).handleHTTPRequest\n\t/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/gin.go:620\ngithub.com/gin-gonic/gin.(*
Engine).ServeHTTP\n\t/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/gin.go:576\nnet/http.serverHandler.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2936\nnet/http.(*conn).serve\n\t/usr/local/go/src/net/http/server.go:1995"}                                                                                                                                                                      spegel-2kz5z registry {"level":"info","ts":1696918993.2968013,"caller":"registry/registry.go:174","msg":"handling mirror request from external node","path":"/v2/library/ubuntu/blobs/sha256:bf9e5b7213b4e0d99cddc039011cc60bfd76ed5ef63bbd837ab0b8416c305c39","ip":"10.207.61.216"}
spegel-78c2f registry {"level":"error","ts":1696918993.2962723,"caller":"gin@v0.0.9/logger.go:62","msg":"","path":"/v2/library/ubuntu/blobs/sha256:bf9e5b7213b4e0d99cddc039011cc60bfd76ed5ef63bbd837ab0b8416c305c39","status":404,"method":"GET","latency":5.00064066,"ip":"10.207.61.216","handler":"mirror","error":"could not resolve mirror for key: sha256:bf9e5b7213b4e0d99cddc039011cc60bfd76ed5ef63bbd837ab0b8416c305c39","stacktrace":"github.com/xenitab/pkg/gin.Logger.func1\n\t/go/pkg/mod/github.com/xenitab/pkg/gin@v0.0.9/logger.go:62\ngithub.com/gin-gonic/gin.(*Context).Next\n\t/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/context.go:174\ngithub.com/gin-gonic/gin.(*Engine).handleHTTPRequest\n\t/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/gin.go:620\ngithub.com/gin-gonic/gin.(*Engi
ne).ServeHTTP\n\t/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/gin.go:576\nnet/http.serverHandler.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2936\nnet/http.(*conn).serve\n\t/usr/local/go/src/net/http/server.go:1995"}                                                                                                                                                                          spegel-2kz5z registry {"level":"error","ts":1696918998.2970965,"caller":"gin@v0.0.9/logger.go:62","msg":"","path":"/v2/library/ubuntu/blobs/sha256:bf9e5b7213b4e0d99cddc039011cc60bfd76ed5ef63bbd837ab0b8416c305c39","status":404,"method":"GET","latency":5.000317961,"ip":"10.207.61.216","handler":"mirror","error":"could not resolve mirror for key: sha256:bf9e5b7213b4e0d99cddc039011cc60bfd76ed
5ef63bbd837ab0b8416c305c39","stacktrace":"github.com/xenitab/pkg/gin.Logger.func1\n\t/go/pkg/mod/github.com/xenitab/pkg/gin@v0.0.9/logger.go:62\ngithub.com/gin-gonic/gin.(*Context).Next\n\t/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/context.go:174\ngithub.com/gin-gonic/gin.(*Engine).handleHTTPRequest\n\t/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/gin.go:620\ngithub.com/gin-gonic/gin.(*Engine).ServeHTTP\n\t/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/gin.go:576\nnet/http.serverHandler.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2936\nnet/http.(*conn).serve\n\t/usr/local/go/src/net/http/server.go:1995"}
spegel-2kz5z registry {"level":"info","ts":1696919003.6734476,"caller":"registry/registry.go:174","msg":"handling mirror request from external node","path":"/v2/library/ubuntu/blobs/sha256:c391327a0f1df5790bcb00ac010a1a3924c9e4387ce36b290bc16fd9f4cc5740","ip":"10.207.61.216"}                                                                                                                   spegel-78c2f registry {"level":"error","ts":1696919003.6729622,"caller":"gin@v0.0.9/logger.go:62","msg":"","path":"/v2/library/ubuntu/blobs/sha256:c391327a0f1df5790bcb00ac010a1a3924c9e4387ce36b290bc16fd9f4cc5740","status":404,"method":"GET","latency":5.001155381,"ip":"10.207.61.216","handler":"mirror","error":"could not resolve mirror for key: sha256:c391327a0f1df5790bcb00ac010a1a3924c9e4
387ce36b290bc16fd9f4cc5740","stacktrace":"github.com/xenitab/pkg/gin.Logger.func1\n\t/go/pkg/mod/github.com/xenitab/pkg/gin@v0.0.9/logger.go:62\ngithub.com/gin-gonic/gin.(*Context).Next\n\t/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/context.go:174\ngithub.com/gin-gonic/gin.(*Engine).handleHTTPRequest\n\t/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/gin.go:620\ngithub.com/gin-gonic/gin.(*Engine).ServeHTTP\n\t/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/gin.go:576\nnet/http.serverHandler.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2936\nnet/http.(*conn).serve\n\t/usr/local/go/src/net/http/server.go:1995"}
spegel-2kz5z registry {"level":"error","ts":1696919008.6743145,"caller":"gin@v0.0.9/logger.go:62","msg":"","path":"/v2/library/ubuntu/blobs/sha256:c391327a0f1df5790bcb00ac010a1a3924c9e4387ce36b290bc16fd9f4cc5740","status":404,"method":"GET","latency":5.000890982,"ip":"10.207.61.216","handler":"mirror","error":"could not resolve mirror for key: sha256:c391327a0f1df5790bcb00ac010a1a3924c9e4387ce36b290bc16fd9f4cc5740","stacktrace":"github.com/xenitab/pkg/gin.Logger.func1\n\t/go/pkg/mod/github.com/xenitab/pkg/gin@v0.0.9/logger.go:62\ngithub.com/gin-gonic/gin.(*Context).Next\n\t/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/context.go:174\ngithub.com/gin-gonic/gin.(*Engine).handleHTTPRequest\n\t/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/gin.go:620\ngithub.com/gin-gonic/gin.(*Eng
ine).ServeHTTP\n\t/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/gin.go:576\nnet/http.serverHandler.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2936\nnet/http.(*conn).serve\n\t/usr/local/go/src/net/http/server.go:1995"}                                                                                                                                                                         spegel-78c2f registry {"level":"info","ts":1696919009.5786955,"caller":"state/state.go:49","msg":"received image event","image":"docker.io/library/ubuntu:latest@sha256:9b8dec3bf938bc80fbe758d856e96fdfab5f56c39d44b0cff351e847bb1b01ea"}
spegel-78c2f registry {"level":"info","ts":1696919009.5839336,"caller":"state/state.go:49","msg":"received image event","image":"docker.io/library/ubuntu:latest@sha256:9b8dec3bf938bc80fbe758d856e96fdfab5f56c39d44b0cff351e847bb1b01ea"}                                                                                                                                                             spegel-78c2f registry {"level":"info","ts":1696919009.5862403,"caller":"state/state.go:49","msg":"received image event","image":"docker.io/library/ubuntu"}

@phillebaba
Copy link
Member

I have found a regression causing this issue and am working on a fix. In the meantime the best solution is to reduce the mirrorResolveTimeout to a low value like "50ms".

@tico24
Copy link

tico24 commented Oct 10, 2023

Just to clarify, this speeds things up simply by timing out faster. But Spegel still isn't actually mirroring, right?

@phillebaba
Copy link
Member

phillebaba commented Oct 10, 2023

Well the issue that you are seeing is that layers that do not exist within the cluster will require the timeout to occur before allowing Containerd to pull from the original registry. This was introduced partly during some refactoring of the mirror component of the registry. I have a fix which would exit a lot faster when the layer cannot be found. Will try to get a new version out soon.

Layers that do exist in the cluster will not cause these types of issues, so you should see this during the first pull. Then there is always that situation that the layer has not been advertised yet causing the same issues.

@tico24
Copy link

tico24 commented Oct 10, 2023

The layers definitely exist in the cluster. Logs are from the second pull (different node, same cluster)

@phillebaba
Copy link
Member

I may have been looking at the wrong problem. The thing is that I am not seeing any major differences between my EKS and AKS deployments, but I am runnign Calico in both so it may be caused by the AWS CNI.

The issues that you are seeing may be two parts. One is that it cant find any peers due to network access issues. Another is that requests are waiting the max timeout when no peers are available, when in reality it should immediately fail because it has not peers.

The PR I created is meant to fix the latter, I will see if I can find an easy way to debug peer discovery issues.

@tico24
Copy link

tico24 commented Oct 10, 2023

In case it's useful, here's the exact steps I perform to test this.

  1. Install spegel (via helm chart) and wait for everything to settle.
  2. Run kubectl run k8s-test --rm -i --tty --image ubuntu /bin/bash This pulls the ubuntu image on a node and starts a pod.
  3. When loaded, I corden the node that the pod has started on.
  4. ctrl+d in the command line to kill the pod and wait for it to properly die.
  5. Run kubectl run k8s-test --rm -i --tty --image ubuntu /bin/bash and observe the spegel logs

I also just tried the above with the following netpols, but no change. Still not mirroring.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-all-ingress
spec:
  podSelector: {}
  ingress:
  - {}
  policyTypes:
  - Ingress
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-all-egress
spec:
  podSelector: {}
  egress:
  - {}
  policyTypes:
  - Egress

@infa-nang
Copy link

infa-nang commented Oct 11, 2023

@phillebaba I'm also facing the same issue in EKS (v1.27) with AWS CNI.

But what I found was, spegel was able to successfully mirror all non-blob layers, but when the mirror request comes-in for an blob layer spegel returns 500 stating content digest sha256 not found <sha256_value> and this issue is because the containerd library spegel is using returns 404 for all blob layers.

I ssh'd into the EKS node and used the ctr cli, even with ctr I'm facing the same issue. I can able to get index/manifest/config layers using ./ctr content get sha256:<sha256_vaule_of_non_blob_layer>, but when I try to get a blob layer with digest ctr returns not found.
I've used "quay.io/prometheus/alertmanager:v0.24.0" as reference, but the behavior is same will all registries.

  • verifying "alertmanager" image is already present in the node
    debug1

  • regctl cli output for list of layers present for "linux/amd" arch
    debug2

  • I was able to get the content of the image's "manifest" type digest
    debug3

  • When I try to retrieve a blob layer ctr returns not found, which the same behaviour we getting from containerd's golang package
    debug4

  • But, if I use code with function similar to this https://github.com/XenitAB/spegel/blob/4a190529ade0eabbfcdc107c8173ef39b6c2f3b8/internal/oci/containerd.go#L150, I was able to get the list of all layers, but still not able to retrieve the blob layer's data
    debug5.

If I use ctr to pull the image like ctr image pull quay.io/quay/busybox:latest then everything is working fine, I can able to retrieve the content of blob layers. But if I use an image which is already present in the machine (pulled via kubelet), then I can't able to get the blob data.

If I try to repull the same alertmanager image via ctr It freshly pulls all the blob layers, but it reports all the non-blob layers as "exists"
image

@infa-nang
Copy link

discard_unpacked_layers = true in the containerd config is the issue.
which repulls the blob layer every time making spegel slow

https://github.com/awslabs/amazon-eks-ami/blob/915ce2222e692a4d7b5904ab020d3c28e1e3e511/files/containerd-config.toml#L10

@tico24
Copy link

tico24 commented Oct 16, 2023

Just confirming that my world is much happier after tweaking the containerd config. If it's useful, here's how I did it using the eks terraform module:

      pre_bootstrap_user_data = <<-EOT
        #!/bin/bash
        cat /etc/containerd/config.toml | sed -E 's/discard_unpacked_layers = true/discard_unpacked_layers = false/g' > /etc/containerd/config-new.toml

        set -ex
        cat <<-EOF > /etc/profile.d/bootstrap.sh
        export CONTAINERD_CONFIG_FILE=/etc/containerd/config-new.toml
        EOF
        # Source extra environment variables in bootstrap script
        sed -i '/^set -o errexit/a\\nsource /etc/profile.d/bootstrap.sh' /etc/eks/bootstrap.sh
      EOT

@danielloader
Copy link

danielloader commented Oct 17, 2023

For Karpenter:

Details here.

apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
metadata:
  name: default
spec:
  amiFamily: AL2
  subnetSelector:
    karpenter.sh/discovery: ${CLUSTER_NAME}
  securityGroupSelector:
    karpenter.sh/discovery: ${CLUSTER_NAME}
  tags:
    karpenter.sh/discovery: ${CLUSTER_NAME}
  userData: |
    #!/bin/bash
    set -ex
    cat /etc/containerd/config.toml | sed -E 's/discard_unpacked_layers = true/discard_unpacked_layers = false/g' > /etc/containerd/config-new.toml

    cat <<-EOF > /etc/profile.d/bootstrap.sh
    export CONTAINERD_CONFIG_FILE=/etc/containerd/config-new.toml
    EOF
    # Source extra environment variables in bootstrap script
    sed -i '/^set -o errexit/a source /etc/profile.d/bootstrap.sh' /etc/eks/bootstrap.sh

Note: This doesn't work any more.

@phillebaba
Copy link
Member

phillebaba commented Oct 17, 2023

@infa-nang thanks for the great detective work determining the root cause of this issue!

The kind cluster used for the e2e tests sets the same configuration.
https://github.com/XenitAB/spegel/blob/4a190529ade0eabbfcdc107c8173ef39b6c2f3b8/e2e/kind-config.yaml#L7-L11

There are now three main tasks that need to be done.

  • Document the containerd configuration in the compatibility guide to help others.
  • Document configuration required to override setting in EKS.
  • Implement a check and maybe log a warning if this is detected, alternatively exit.
  • Create a performance test to document expected performance from Spegel.

@phillebaba
Copy link
Member

It would be great if we could compile some examples for how to override the new default setting. That way end users do not have to go searching for a new solution every time.

@danielloader could you share how you solved this for Karpenter?

@tico24 in your example you create a new config file. Is this acutally required? I think that it would be nice to have an example that overrides the existing configuration file.

@tico24
Copy link

tico24 commented Oct 18, 2023

@tico24 in your example you create a new config file. Is this acutally required? I think that it would be nice to have an example that overrides the existing configuration file.

Yup, it's required. The old configfile gets re-written to default after this script runs. Blame Jeff Bezos for that one.

@phillebaba
Copy link
Member

Great, is it ok if I add it to the docs as a suggested solution?

@tico24
Copy link

tico24 commented Oct 18, 2023

Of course. At the moment it's quite specific to that terraform module. It can probably be adjusted to work for any Managed Node Group.

Ultimately, I believe you need the following in your managed nodegroup launch template config:

Content-Type: multipart/mixed; boundary="//"
MIME-Version: 1.0

--//
Content-Transfer-Encoding: 7bit
Content-Type: text/x-shellscript
Mime-Version: 1.0

#!/bin/bash
cat /etc/containerd/config.toml | sed -E 's/discard_unpacked_layers = true/discard_unpacked_layers = false/g' > /etc/containerd/config-new.toml

set -ex
cat <<-EOF > /etc/profile.d/bootstrap.sh
export CONTAINERD_CONFIG_FILE=/etc/containerd/config-new.toml
EOF
# Source extra environment variables in bootstrap script
sed -i '/^set -o errexit/a\\nsource /etc/profile.d/bootstrap.sh' /etc/eks/bootstrap.sh
--//--

This goes under advanced details > user data
image

@tico24
Copy link

tico24 commented Oct 18, 2023

The alternative would be to build custom AMIs.. but I started to go down that route and decided I'd rather not :)

@danielloader
Copy link

danielloader commented Oct 18, 2023

@phillebaba Updated above.
@tico24 I had to omit the newline character from the sed inline amend to get it to work, not sure if karpenter was mangling it when launching the instance but confirmed on the instance via SSM shell, that field had a prefix n character on the bootstrap.sh file.

Additionally I can confirm spegel's logs are much more 200 and much less 500/404 so that's encouraging. Container start speed is better anecdotally. I'm also using nvme backed instances with the containerd layer cache on the local nvme disk for even more performance.

Another caveat here is I've only deployed AL2 backed AMIs, it's going to need to be tested against other types...

@danielloader
Copy link

Additionally for historical context for people coming to this issue in the future: awslabs/amazon-eks-ami#1359

@phillebaba
Copy link
Member

I am closing this as v0.0.14 contains #237 that verifies the correct configuration. Please open a new issue for any other performance related issues.

@infa-ddeore
Copy link
Author

thanks @phillebaba, so it just checks whether discardUnpackedLayers is set to true and gives error?

@phillebaba
Copy link
Member

Yes the check is similar to the one for config path being set. Checking Containerd configuration, especially in managed Kuberentes, is cumbersome. It also requires the end user to know that the configuration needs to be checked. My guess is that 99% of users want Spegel to exit if the configuration is wrong.

@danielloader
Copy link

Update for the thread, I've had kubelet fail to start with the above patch on the latest 1.29 AMI - amazon-eks-node-1.29-v20240129.

Not sure why yet just thought I'd raise it as commenting out the userData patch got the nodes up and running.

@danielloader
Copy link

danielloader commented Feb 5, 2024

Redacted log attached:
spegel-userdata.log

@tico24
Copy link

tico24 commented Feb 5, 2024

This probably warrants a new issue (seeming as this one is closed).

I also have the same issue. I can run the commands manually on the node and everything seems fine. This is almost certainly an upstream aws issue, but good to raise for awareness.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants