Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting high volumes of 404 and 500 errors. Any insight? #469

Open
tyu0912 opened this issue May 8, 2024 · 16 comments
Open

Getting high volumes of 404 and 500 errors. Any insight? #469

tyu0912 opened this issue May 8, 2024 · 16 comments
Labels
bug Something isn't working

Comments

@tyu0912
Copy link

tyu0912 commented May 8, 2024

Spegel version

v0.0.22

Kubernetes distribution

EKS

Kubernetes version

1.29.0

CNI

AWS VPC CNI

Describe the bug

UPDATE 1: After some further investigation, it seems that adding my team's internal ECR registry caused a rise in errors and performance to degrade a bit (below). Seems like things aren't going as they should be... wondering what's happening and how to fix...🤔


Hello, apologies if this is very simple. I'm new to this tool. I just deployed Spegel after modifying the containerd settings in my EKS cluster according to the compatibility guide here: https://github.com/spegel-org/spegel/blob/main/docs/COMPATIBILITY.md.

The pods are running so I think on some levels it is working according to https://github.com/spegel-org/spegel/blob/main/docs/FAQ.md but I noticed that when a new node comes up I am getting a lot of 404/500 errors which leads to believe the mirroring isn't working as expected. I'm not familiar with what these errors indicate. Is this normal?

I did find #217 but not sure if it's related. Using containerd config dump, confirmed the settings should be ok.

sh-4.2$ containerd config dump | grep config_path -B 5

    [plugins."io.containerd.grpc.v1.cri".image_decryption]
      key_model = "node"

    [plugins."io.containerd.grpc.v1.cri".registry]
      config_path = "/etc/containerd/certs.d:/etc/docker/certs.d"
sh-4.2$ containerd config dump | grep discard_unpacked_layers -B 5
      setup_serially = false

    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "runc"
      disable_snapshot_annotations = true
      discard_unpacked_layers = false

Any insight is much appreciated! Thank you. 😄

Error 1

{"time":"2024-05-08T20:28:40.362320827Z","level":"ERROR","source":{"function":"github.com/spegel-org/spegel/pkg/registry.(*Registry).handle.func1","file":"/build/pkg/registry/registry.go","line":135},"msg":"","err":"content digest sha256:1e3d9b7d145208fa8fa3ee1c9612d0adaac7255f1bbc9ddea7e461e0b317805c: not found","path":"/v2/jimmidyson/configmap-reload/blobs/sha256:1e3d9b7d145208fa8fa3ee1c9612d0adaac7255f1bbc9ddea7e461e0b317805c","status":500,"method":"GET","latency":"574.095µs","ip":"<some-ip>"}

Error 2

{"time":"2024-05-08T20:28:39.25413767Z","level":"ERROR","source":{"function":"github.com/spegel-org/spegel/pkg/registry.(*Registry).handleMirror.func1","file":"/build/pkg/registry/registry.go","line":272},"msg":"proxy failed attempting next","key":"sha256:fe5ca62666f04366c8e7f605aa82997d71320183e99962fa76b3209fdfbb8b58","path":"/v2/jimmidyson/configmap-reload/blobs/sha256:fe5ca62666f04366c8e7f605aa82997d71320183e99962fa76b3209fdfbb8b58","ip":"<some-ip>","err":"expected mirror to respond with 200 OK but received: 500 Internal Server Error"}

Error 3

{"time":"2024-05-08T20:28:36.85743246Z","level":"ERROR","source":{"function":"github.com/spegel-org/spegel/pkg/registry.(*Registry).handle.func1","file":"/build/pkg/registry/registry.go","line":135},"msg":"","err":"request closed for key: sha256:566294b429823a40b102147b6742390ede89f1f30f9dd6c1f742aae4defe4504","path":"/v2/jetstack/cert-manager-webhook/blobs/sha256:566294b429823a40b102147b6742390ede89f1f30f9dd6c1f742aae4defe4504","status":404,"method":"GET","latency":"5.021653989s","ip":"<some-ip>"}
@tyu0912 tyu0912 added the bug Something isn't working label May 8, 2024
@tyu0912 tyu0912 changed the title Getting various 404 and 500 errors. Any insight? Getting 404 and 500 errors. Any insight? May 8, 2024
@tyu0912
Copy link
Author

tyu0912 commented May 8, 2024

OK, so most of the 500 errors disappeared after I added some more registries. Should've caught that! I am still getting a few 404s though. Are these normal? Example below:

{"time":"2024-05-08T21:19:19.740555437Z","level":"ERROR","source":{"function":"github.com/spegel-org/spegel/pkg/registry.(*Registry).handle.func1","file":"/build/pkg/registry/registry.go","line":135},"msg":"","err":"request closed for key: <registry>/<image-tag>","path":"<the-path>","status":404,"method":"HEAD","latency":"5.001274834s","ip":"10.2.116.31"}

@bittrance
Copy link
Contributor

The 404 should not really be at error level, but the logging is a bit simplistic. That is just Spegel telling the caller that it cannot help and that the caller should try the next configured mirror for that registry, so that is in some sense normal.

@tyu0912
Copy link
Author

tyu0912 commented May 8, 2024

Thanks @bittrance . That's reassuring. Out of curiosity, do you know about this one too? expected mirror to respond with 200 OK but received: 500 Internal Server Error . Did another look and these ones seem to spike every so often. I thought they were gone before...

image

{"time":"2024-05-08T20:28:39.25413767Z","level":"ERROR","source":{"function":"github.com/spegel-org/spegel/pkg/registry.(*Registry).handleMirror.func1","file":"/build/pkg/registry/registry.go","line":272},"msg":"proxy failed attempting next","key":"sha256:fe5ca62666f04366c8e7f605aa82997d71320183e99962fa76b3209fdfbb8b58","path":"/v2/jimmidyson/configmap-reload/blobs/sha256:fe5ca62666f04366c8e7f605aa82997d71320183e99962fa76b3209fdfbb8b58","ip":"<some-ip>","err":"expected mirror to respond with 200 OK but received: 500 Internal Server Error"}

@tyu0912
Copy link
Author

tyu0912 commented May 8, 2024

Maybe it's normal actually? The msg does say proxy failed attempting next. Curious why it's failing though... is it just because the layer is not found? 🤔 Feels like there might be a deeper reason.

@bittrance
Copy link
Contributor

We had reports of the "expected 200 but was 500" before. I think those errors implies that the Spegel cluster is internally inconsistent. I think the error occurs because some Spegel node advertised the blob hash, but when the acting mirror proxied the request to it, it errored. I'm not sure whether that is exactly an error or just eventual consistency in action.

In your errors above, is it possible that Error 1 on one node could be resulting in Error 2 on another node? The timestamps suggests not, but maybe they are for different requests?

@tyu0912
Copy link
Author

tyu0912 commented May 8, 2024

Hmm, I'm not sure. I do see some errors like the below that might actually suggest that error 2 causes error 1 but it's quite noisy so it's a bit hard to tell. One interesting thing (might be coincidental) is that the ips are the same for the bottom 3 logs, 10.2.118.207...

{"time":"2024-05-08T21:59:23.818027756Z","level":"ERROR","source":{"function":"github.com/spegel-org/spegel/pkg/registry.(*Registry).handle.func1","file":"/build/pkg/registry/registry.go","line":135},"msg":"","err":"content digest sha256:ce9698c36317f0d048a2e692511e8a1ec371fa047493b809f71efe3af526b366: not found","path":"/v2/xxxx/jupyterlab/blobs/sha256:ce9698c36317f0d048a2e692511e8a1ec371fa047493b809f71efe3af526b366","status":500,"method":"GET","latency":"1.377626ms","ip":"10.2.118.195"}

{"time":"2024-05-08T21:58:28.740682271Z","level":"ERROR","source":{"function":"github.com/spegel-org/spegel/pkg/registry.(*Registry).handle.func1","file":"/build/pkg/registry/registry.go","line":135},"msg":"","err":"content digest sha256:0375ff77c15f4d4479b0d09f1c25651c547f9d4e5587b1950d1897900866135f: not found","path":"/v2/xxxx/jupyterlab/blobs/sha256:0375ff77c15f4d4479b0d09f1c25651c547f9d4e5587b1950d1897900866135f","status":500,"method":"GET","latency":"4.20002ms","ip":"10.2.118.207"}

{"time":"2024-05-08T21:58:28.560816717Z","level":"ERROR","source":{"function":"github.com/spegel-org/spegel/pkg/registry.(*Registry).handleMirror.func1","file":"/build/pkg/registry/registry.go","line":272},"msg":"proxy failed attempting next","key":"sha256:7b8992c144c770ea03fa9e25aec36807a3744fdff8f9902bc35e4eca3ddd0386","path":"/v2/xxxxx/jupyterlab/blobs/sha256:7b8992c144c770ea03fa9e25aec36807a3744fdff8f9902bc35e4eca3ddd0386","ip":"10.2.118.207","err":"expected mirror to respond with 200 OK but received: 500 Internal Server Error"}

{"time":"2024-05-08T21:58:28.560575057Z","level":"ERROR","source":{"function":"github.com/spegel-org/spegel/pkg/registry.(*Registry).handleMirror.func2","file":"/build/pkg/registry/registry.go","line":277},"msg":"mirror failed attempting next","key":"sha256:7b8992c144c770ea03fa9e25aec36807a3744fdff8f9902bc35e4eca3ddd0386","path":"/v2/xxxx/jupyterlab/blobs/sha256:7b8992c144c770ea03fa9e25aec36807a3744fdff8f9902bc35e4eca3ddd0386","ip":"10.2.118.207","err":"expected mirror to respond with 200 OK but received: 500 Internal Server Error"}

@tyu0912
Copy link
Author

tyu0912 commented May 8, 2024

Wondering if your theory on eventual consistency is correct. It does seem the errors are dying down a bit now.
image

The way I'm doing the containerd update is through daemonsets. Is it possible the node already has some container layers from other services before containerd restarts and that's somehow causing issues? Maybe not as I'd expect a spike in errors when a new node comes up but doesn't to be the case...

@tyu0912
Copy link
Author

tyu0912 commented May 8, 2024

Another interesting observation. It seems I might not be getting the best performance either because of these errors...

Below is a graph of kube_pod_status_ready_time - kube_pod_status_scheduled_time. After I added my teams internal ECR registry ~5:30pm we can see the increased activity for the various pods. Eyeballing, the average seems a bit higher than before I started implementing Spegel. It is also around this time as well when the spike in error logs showed up.

image

@tyu0912 tyu0912 changed the title Getting 404 and 500 errors. Any insight? Getting high volumes of 404 and 500 errors. Any insight? May 9, 2024
@phillebaba
Copy link
Member

phillebaba commented May 9, 2024

I will start off by clarifying something about the Spegel. It is a best effort cache, meaning that images may or may not exist on other nodes when and image is pulled. On top of that images may not be discovered due to the distributed nature of the DHT.

Having said that I do not think that this is an issue that you are facing. I made some changes before v0.0.22 to make sure that 404 are returned when an image is not found and 500 errors are returned for other types of errors. The error content digest sha256:ce9698c36317f0d048a2e692511e8a1ec371fa047493b809f71efe3af526b366: not found that you are seeing is interesting however. This means that traffic is being routed to the node but the node is not able to find the layer. It is probably an indication of a larger problem as the layer should not have been advertised if it is not present on the node.

I saw that you wrote that the errors seems to come specifically from images pulled from ECR. I am not sure why this would be the case, but there might be something specific that is done when EKS pulls images from ECR?

Could you share your values configuration used to deploy Spegel?

@tyu0912
Copy link
Author

tyu0912 commented May 9, 2024

I will start off by clarifying something about the Spegel. It is a best effort cache, meaning that images may or may not exist on other nodes when and image is pulled. On top of that images may not be discovered due to the distributed nature of the DHT.

Understood. Thanks for clarifying @phillebaba

I saw that you wrote that the errors seems to come specifically from images pulled from ECR. I am not sure why this would be the case, but there might be something specific that is done when EKS pulls images from ECR?

That sounds plausible to me. Here are my values. Appreciate the help!

serviceMonitor:
  enabled: true

spegel:
  registries:
    - https://cgr.dev
    - https://docker.io
    - https://ghcr.io
    - https://quay.io
    - https://mcr.microsoft.com
    - https://public.ecr.aws
    - https://gcr.io
    - https://registry.k8s.io
    - https://k8s.gcr.io
    - https://lscr.io
    - https://<my-account-number>.dkr.ecr.<my-region>.amazonaws.com

@danielloader
Copy link

Have you applied the userData patch to the EKS nodes so that containerd keeps the intermediate layers after a pull?

@danielsand
Copy link

danielsand commented May 21, 2024

had the same problem - fixed it by allowing d39mqg4b1dx9z1.cloudfront.net DNS wise - Had blocked it out of safety concerns.

CF host belongs to registry.k8s.io

# Cloudfront endpoint for the distribution shielding
# the S3 buckets hosting the container images blobs
cloudfront.registry:
  type: CNAME
  value: d39mqg4b1dx9z1.cloudfront.net.
# Sandbox OCI redirector service. (@ameukam,@BenTheElder,@thockin,@dims)

https://github.com/kubernetes/k8s.io/blob/main/dns/zone-configs/k8s.io._0_base.yaml

error in logs

{"time":"2024-05-21T11:06:31.240376472+02:00","level":"ERROR","source":{"function":"github.com/spegel-org/spegel/pkg/registry.(*Registry).handleMirror.func1","file":"/build/pkg/registry/registry.go","line":272},"msg":"proxy failed attempting next","key":"sha256:e006012862ce06d8bdbba2a6a1ddc0d2eb5cb2fa6565e1500ed58dc016bde00a","path":"/v2/nfd/node-feature-discovery/blobs/sha256:e006012862ce06d8bdbba2a6a1ddc0d2eb5cb2fa6565e1500ed58dc016bde00a","ip":"<REDACTED>","err":"expected mirror to respond with 200 OK but received: 500 Internal Server Error"}
{"time":"2024-05-21T11:06:31.242107206+02:00","level":"ERROR","source":{"function":"github.com/spegel-org/spegel/pkg/registry.(*Registry).handleMirror.func2","file":"/build/pkg/registry/registry.go","line":277},"msg":"mirror failed attempting next","key":"sha256:e006012862ce06d8bdbba2a6a1ddc0d2eb5cb2fa6565e1500ed58dc016bde00a","path":"/v2/nfd/node-feature-discovery/blobs/sha256:e006012862ce06d8bdbba2a6a1ddc0d2eb5cb2fa6565e1500ed58dc016bde00a","ip":"<REDACTED>","err":"expected mirror to respond with 200 OK but received: 500 Internal Server Error"}

hope this helps - cheers.

@thockin
Copy link

thockin commented May 21, 2024

registry.k8s.io does not offer a fixed list of backend - if you need to filter aggressively, you need to pull from a mirror you control.

@BenTheElder
Copy link

https://registry.k8s.io#stability has pointers to mirroring guides and other details.

@phillebaba
Copy link
Member

I do not understand how changing DNS settings would affect Spegel. Spegel is a best effort mirror that does not do any upstream requests to any other registries. This specific expected mirror to respond with 200 OK but received: 500 Internal Server Error just means that a request was forwarded to a node that should have had the layer but did not. This may be caused by some internal Containerd errors, or because the layer actually does not exist on disk.

I refactored the logging in #488 and am working on #494 to help users understand a bit better what is going on when they see an error.

I was hoping to get #472 done before the next release but that has been blocked due to some account access preventing me from running the benchmarks. So might just cut a release now to see if the update error logs gives some more insight into the actual problem.

@danielsand
Copy link

@phillebaba

The 500 will most likely show up at the point when spegel checks for the S3 buckets hosting the container image blob pointers and landed on my end in an DNS error.

It's further down the road code wise.

and for the other comments.... yeah the "DOH ! READ DOCS !!11" signs are visible.

For a public K8S OCI registry the security of this setup is questionable at best. (supply chain)
it is what it is

cheers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants