Thanos Receive: "Out of bounds metric" #4831

Kampe · 2021-11-03T16:39:44Z

Hello!

We run a handful of "edge" sites in a hub and spoke topology and feed metrics from remote promethei to a centralized receiver hash-ring. We see issues pop up quite often relating to "Out Of Bounds" metrics in receiver. The prometheus remote write will begin to return 500s to prometheus and Reciever latency climbs, impacting other tenants. We're unsure if the metrics are too far in the future, or in the past, however we assume the past as it's likely we could have experienced network or latency issues with shipment of remote metrics to cloud. It seems this is related to TSDB and Prometheus WAL limitations? prometheus/prometheus#9607

The shipping prometheus itself is setup with the prometheus-operator, and has been configured with disableCompaction: true to ensure it doesn't ship old blocks to Thanos, too (even though we'd LOVE (read: need) to have them...)

Worse? When prometheus fails to remote write, it just keeps retrying, getting into a loop with the receiver until the block is discarded. This is 100% not intended and there probably needs to be a feature/flag added on one side of the system here to help prevent this (as data caps are a real thing and this gets into 100s of GBs quick if unchecked).

Related and similar: #4767 - However we certainly have unique labels on our items and rarely are we getting 409 conflicts, but constantly 500s.

Thanos Version: 0.22.0
Prometheus Version: 2.30.3

Object Storage Provider:
GCP

What happened:
Thanos Receive will on occasion get into a fit with the samples a prometheus sends up to it for storage throwing Out of Bounds errors, causing cascading issues for the whole Thanos ingestion of metrics globally within the entire system.

What you expected to happen:
Thanos Receive accepts the samples and stores them in the bucket without issue.

How to reproduce it (as minimally and precisely as possible):
We see this most with shipments of kube-state-metrics, however you can get any prometheus into this issue with a receiver if you wait long enough.

Full logs to relevant components:
Receiver Logs:

component=receive component=receive-writer msg="Error on ingesting out-of-order samples" numDropped=47
component=receive component=receive-writer msg="Out of bounds metric" lset="{__name__=\"apiserver_request_duration_seconds_bucket\", component=\"apiserver\", endpoint=\"https-metrics\", env=\"prd\", group=\"admissionregistration.k8s.io\", instance=\"172.22.3.33:10250\", job=\"kubelet\", le=\"0.45\", metrics_path=\"/metrics\", namespace=\"kube-system\", prometheus=\"monitoring/edge\", prometheus_replica=\"prometheus-edge-0\", resource=\"validatingwebhookconfigurations\", scope=\"cluster\", service=\"kubelet\", verb=\"WATCH\", version=\"v1\"}" sample="unsupported value type"

Prometheus Logs:

component=remote level=warn remote_name=319b4e url=http://cloud-gateway:10908/api/v1/receive msg="Failed to send batch, retrying" err="Post \"http://cloud-gateway:10908/api/v1/receive\": context deadline exceeded"

Work Around:
Kill all reporting promethei and their relevant statefulsets, then Receive is happy, but hopefully this can help illustrate how this isn't scalable given (N) tenants :(

The text was updated successfully, but these errors were encountered:

Kampe · 2021-11-05T19:35:49Z

Also, this happens constantly when rolling out updates to receivers as we can see while it takes one down to replace and the hash ring can no longer contact a member of the ring:

level=debug ts=2021-11-05T19:31:59.654455194Z caller=handler.go:351 component=receive component=receive-handler msg="failed to handle request" err="backing off forward request for endpoint thanos-receive-2.thanos-receive.monitoring.svc.cluster.local:10901: target not available"
level=debug ts=2021-11-05T19:31:59.654733445Z caller=handler.go:351 component=receive component=receive-handler msg="failed to handle request" err="backing off forward request for endpoint thanos-receive-2.thanos-receive.monitoring.svc.cluster.local:10901: target not available"
level=debug ts=2021-11-05T19:31:59.654990715Z caller=handler.go:351 component=receive component=receive-handler msg="failed to handle request" err="backing off forward request for endpoint thanos-receive-2.thanos-receive.monitoring.svc.cluster.local:10901: target not available"
level=debug ts=2021-11-05T19:31:59.655249505Z caller=handler.go:351 component=receive component=receive-handler msg="failed to handle request" err="backing off forward request for endpoint thanos-receive-2.thanos-receive.monitoring.svc.cluster.local:10901: target not available"
level=debug ts=2021-11-05T19:31:59.655500684Z caller=handler.go:351 component=receive component=receive-handler msg="failed to handle request" err="backing off forward request for endpoint thanos-receive-2.thanos-receive.monitoring.svc.cluster.local:10901: target not available"
level=debug ts=2021-11-05T19:31:59.655757974Z caller=handler.go:351 component=receive component=receive-handler msg="failed to handle request" err="backing off forward request for endpoint thanos-receive-2.thanos-receive.monitoring.svc.cluster.local:10901: target not available"
level=debug ts=2021-11-05T19:31:59.676009613Z caller=writer.go:88 component=receive component=receive-writer msg="Out of order sample" lset="{__name__=\"kube_deployment_status_observed_generation\", cluster_id=\"hardware-cluster\", cluster_type=\"edge\", container=\"kube-state-metrics\", deployment=\"rfs-edge-redis\", endpoint=\"http-metrics\", env=\"stg\", exported_namespace=\"default\", instance=\"10.42.0.254:8080\", job=\"kube-state-metrics\", namespace=\"monitoring\", pod=\"kube-state-metrics-67998f59d4-qfz6n\", prometheus=\"monitoring/edge\", prometheus_replica=\"prometheus-edge-1\", service=\"kube-state-metrics\", site_id=\"hardware-cluster\", version=\"v2\"}" sample="unsupported value type"
level=debug ts=2021-11-05T19:31:59.676375092Z caller=writer.go:88 component=receive component=receive-writer msg="Out of order sample" lset="{__name__=\"kube_endpoint_labels\", cluster_id=\"hardware-cluster\", cluster_type=\"edge\", container=\"kube-state-metrics\", endpoint=\"http-metrics\", env=\"stg\", exported_endpoint=\"kubernetes\", exported_namespace=\"default\", instance=\"10.42.0.254:8080\", job=\"kube-state-metrics\", namespace=\"monitoring\", pod=\"kube-state-metrics-67998f59d4-qfz6n\", prometheus=\"monitoring/edge\", prometheus_replica=\"prometheus-edge-1\", service=\"kube-state-metrics\", site_id=\"hardware-cluster\", version=\"v2\"}" sample="unsupported value type"
level=debug ts=2021-11-05T19:31:59.676560331Z caller=writer.go:88 component=receive component=receive-writer msg="Out of order sample" lset="{__name__=\"kube_pod_owner\", cluster_id=\"hardware-cluster\", cluster_type=\"edge\", container=\"kube-state-metrics\", endpoint=\"http-metrics\", env=\"stg\", exported_namespace=\"vault-secrets-operator\", exported_pod=\"vault-secrets-operator-59769bc489-m4gwf\", instance=\"10.42.0.254:8080\", job=\"kube-state-metrics\", namespace=\"monitoring\", owner_is_controller=\"true\", owner_kind=\"ReplicaSet\", owner_name=\"vault-secrets-operator-59769bc489\", pod=\"kube-state-metrics-67998f59d4-qfz6n\", prometheus=\"monitoring/edge\", prometheus_replica=\"prometheus-edge-1\", service=\"kube-state-metrics\", site_id=\"hardware-cluster\", uid=\"777a319a-a2a5-4c54-929d-fb1a10463262\", version=\"v2\"}" sample="unsupported value type"
level=debug ts=2021-11-05T19:31:59.676754415Z caller=writer.go:88 component=receive component=receive-writer msg="Out of order sample" lset="{__name__=\"kube_pod_status_scheduled\", cluster_id=\"hardware-cluster\", cluster_type=\"edge\", condition=\"true\", container=\"kube-state-metrics\", endpoint=\"http-metrics\", env=\"stg\", exported_namespace=\"default\", exported_pod=\"server-5c7dfb95b6-n8pq5\", instance=\"10.42.0.254:8080\", job=\"kube-state-metrics\", namespace=\"monitoring\", pod=\"kube-state-metrics-67998f59d4-qfz6n\", prometheus=\"monitoring/edge\", prometheus_replica=\"prometheus-edge-1\", service=\"kube-state-metrics\", site_id=\"hardware-cluster\", uid=\"dc10dc19-ed31-476b-9204-f2b55339fe68\", version=\"v2\"}" sample="unsupported value type"
level=debug ts=2021-11-05T19:31:59.67690643Z caller=writer.go:88 component=receive component=receive-writer msg="Out of order sample" lset="{__name__=\"kube_replicaset_spec_replicas\", cluster_id=\"hardware-cluster\", cluster_type=\"edge\", container=\"kube-state-metrics\", endpoint=\"http-metrics\", env=\"stg\", exported_namespace=\"default\", instance=\"10.42.0.254:8080\", job=\"kube-state-metrics\", namespace=\"monitoring\", pod=\"kube-state-metrics-67998f59d4-qfz6n\", prometheus=\"monitoring/edge\", prometheus_replica=\"prometheus-edge-1\", replicaset=\"other-server-7f65547c6\", service=\"kube-state-metrics\", site_id=\"hardware-cluster\", version=\"v2\"}" sample="unsupported value type"
level=debug ts=2021-11-05T19:31:59.677049925Z caller=writer.go:88 component=receive component=receive-writer msg="Out of order sample" lset="{__name__=\"kube_secret_annotations\", cluster_id=\"hardware-cluster\", cluster_type=\"edge\", container=\"kube-state-metrics\", endpoint=\"http-metrics\", env=\"stg\", exported_namespace=\"monitoring\", instance=\"10.42.0.254:8080\", job=\"kube-state-metrics\", namespace=\"monitoring\", pod=\"kube-state-metrics-67998f59d4-qfz6n\", prometheus=\"monitoring/edge\", prometheus_replica=\"prometheus-edge-1\", secret=\"skupper-local-ca\", service=\"kube-state-metrics\", site_id=\"hardware-cluster\", version=\"v2\"}" sample="unsupported value type"
level=warn ts=2021-11-05T19:31:59.67757962Z caller=writer.go:131 component=receive component=receive-writer msg="Error on ingesting out-of-order samples" numDropped=6
level=debug ts=2021-11-05T19:31:59.677975065Z caller=handler.go:491 component=receive component=receive-handler
level=debug ts=2021-11-05T19:31:59.680275619Z caller=handler.go:656 component=receive component=receive-handler msg="failed to handle request" err="store locally for endpoint thanos-receive-1.thanos-receive.monitoring

Kampe · 2021-11-09T19:56:55Z

Also, we've noticed when a receiver begins to report poorly, this behavior is what we see:

6m26s       Warning   Unhealthy                pod/thanos-receive-0                  Readiness probe failed: Get "http://10.188.18.44:10902/-/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
6m14s       Warning   Unhealthy                pod/thanos-receive-1                  Readiness probe failed: Get "http://10.188.10.61:10902/-/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
7m43s       Warning   Unhealthy                pod/thanos-receive-1                  Liveness probe failed: Get "http://10.188.10.61:10902/-/healthy": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
25m         Warning   NodeNotReady             pod/thanos-receive-3                  Node is not ready
20m         Normal    TaintManagerEviction     pod/thanos-receive-3                  Marking for deletion Pod monitoring/thanos-receive-3
18m         Normal    Scheduled                pod/thanos-receive-3                  Successfully assigned monitoring/thanos-receive-3 to gke-gke-cluster
18m         Normal    SuccessfulAttachVolume   pod/thanos-receive-3                  AttachVolume.Attach succeeded for volume "pvc-939040384"
18m         Normal    Pulled                   pod/thanos-receive-3                  Container image "quay.io/thanos/thanos:v0.22.0" already present on machine
18m         Normal    Created                  pod/thanos-receive-3                  Created container thanos-receive
18m         Normal    Started                  pod/thanos-receive-3                  Started container thanos-receive
7m7s        Warning   Unhealthy                pod/thanos-receive-4                  Readiness probe failed: Get "http://10.188.0.31:10902/-/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
18m         Normal    SuccessfulCreate         statefulset/thanos-receive            create Pod thanos-receive-3 in StatefulSet thanos-receive successful

Logs begin to accumulate:

	
level=debug ts=2021-11-09T19:37:25.380682831Z caller=handler.go:583 component=receive component=receive-handler
level=error ts=2021-11-09T19:37:25.380239604Z caller=handler.go:366 component=receive component=receive-handler err="context deadline exceeded" msg="internal server error"
level=debug ts=2021-11-09T19:37:25.380172697Z caller=handler.go:351 component=receive component=receive-handler msg="failed to handle request" err="context deadline exceeded"
level=debug ts=2021-11-09T19:37:25.170065509Z caller=handler.go:583 component=receive component=receive-handler
level=error ts=2021-11-09T19:37:25.169872207Z caller=handler.go:366 component=receive component=receive-handler err="context deadline exceeded" msg="internal server error"
level=debug ts=2021-11-09T19:37:25.169793555Z caller=handler.go:351 component=receive component=receive-handler msg="failed to handle request" err="context deadline exceeded"
level=debug ts=2021-11-09T19:37:23.923030423Z caller=handler.go:583 component=receive component=receive-handler
level=error ts=2021-11-09T19:37:23.922772065Z caller=handler.go:366 component=receive component=receive-handler err="context deadline exceeded" msg="internal server error"
level=debug ts=2021-11-09T19:37:23.922600212Z caller=handler.go:351 component=receive component=receive-handler msg="failed to handle request" err="context deadline exceeded"
level=debug ts=2021-11-09T19:37:20.110528423Z caller=handler.go:583 component=receive component=receive-handler
level=debug ts=2021-11-09T19:37:20.110112309Z caller=handler.go:583 component=receive component=receive-handler
level=error ts=2021-11-09T19:37:20.110235316Z caller=handler.go:366 component=receive component=receive-handler err="context deadline exceeded" msg="internal server error"
level=debug ts=2021-11-09T19:37:20.110036622Z caller=handler.go:351 component=receive component=receive-handler msg="failed to handle request" err="context deadline exceeded"
level=error ts=2021-11-09T19:37:20.109737187Z caller=handler.go:366 component=receive component=receive-handler err="context deadline exceeded" msg="internal server error"
level=debug ts=2021-11-09T19:37:20.109658655Z caller=handler.go:351 component=receive component=receive-handler msg="failed to handle request" err="context deadline exceeded"
level=debug ts=2021-11-09T19:37:19.358821136Z caller=handler.go:583 component=receive component=receive-handler
level=error ts=2021-11-09T19:37:19.358467832Z caller=handler.go:366 component=receive component=receive-handler err="context deadline exceeded" msg="internal server error"
level=debug ts=2021-11-09T19:37:19.358299121Z caller=handler.go:351 component=receive component=receive-handler msg="failed to handle request" err="context deadline exceeded"
level=debug ts=2021-11-09T19:37:18.753243593Z caller=handler.go:583 component=receive component=receive-handler
level=error ts=2021-11-09T19:37:18.752959602Z caller=handler.go:366 component=receive component=receive-handler err="context deadline exceeded" msg="internal server error"
level=debug ts=2021-11-09T19:37:18.752879604Z caller=handler.go:351 component=receive component=receive-handler msg="failed to handle request" err="context deadline exceeded"
level=debug ts=2021-11-09T19:37:17.110759193Z caller=receive.go:556 component=receive component=uploader msg="upload phase done" uploaded=0 elapsed=3.551445ms
level=debug ts=2021-11-09T19:37:17.107209642Z caller=multitsdb.go:217 component=receive component=multi-tsdb msg="uploading block for tenant" tenant=default-tenant
level=debug ts=2021-11-09T19:37:17.10713892Z caller=receive.go:548 component=receive component=uploader msg="upload phase starting"
level=debug ts=2021-11-09T19:37:14.893874219Z caller=handler.go:583 component=receive component=receive-handler
level=error ts=2021-11-09T19:37:14.893702865Z caller=handler.go:366 component=receive component=receive-handler err="context deadline exceeded" msg="internal server error"
level=debug ts=2021-11-09T19:37:14.893622202Z caller=handler.go:351 component=receive component=receive-handler msg="failed to handle request" err="context deadline exceeded"
level=debug ts=2021-11-09T19:37:14.891242383Z caller=handler.go:583 component=receive component=receive-handler
level=error ts=2021-11-09T19:37:14.891047254Z caller=handler.go:366 component=receive component=receive-handler err="context deadline exceeded" msg="internal server error"
level=debug ts=2021-11-09T19:37:14.890974851Z caller=handler.go:351 component=receive component=receive-handler msg="failed to handle request" err="context deadline exceeded"
level=debug ts=2021-11-09T19:37:14.21629258Z caller=handler.go:583 component=receive component=receive-handler
level=error ts=2021-11-09T19:37:14.216055536Z caller=handler.go:366 component=receive component=receive-handler err="context deadline exceeded" msg="internal server error"
level=debug ts=2021-11-09T19:37:14.215990688Z caller=handler.go:351 component=receive component=receive-handler msg="failed to handle request" err="context deadline exceeded"
level=debug ts=2021-11-09T19:37:13.080939563Z caller=handler.go:583 component=receive component=receive-handler
level=error ts=2021-11-09T19:37:13.080766949Z caller=handler.go:366 component=receive component=receive-handler err="context deadline exceeded" msg="internal server error"
level=debug ts=2021-11-09T19:37:13.080693649Z caller=handler.go:351 component=receive component=receive-handler msg="failed to handle request" err="context deadline exceeded"

and then everything grinds to a halt:

We are users of the thanos-reciever-controller, it doesn't seem to help when the ring gets into this state.

Args we run recieve with:

      - name: thanos-receive
        image: quay.io/thanos/thanos:v0.22.0
        args:
        - receive
        - --log.level=debug
        - --grpc-address=0.0.0.0:10901
        - --http-address=0.0.0.0:10902
        - --remote-write.address=0.0.0.0:10908
        - --objstore.config=$(OBJSTORE_CONFIG)
        - --tsdb.path=/var/thanos/receive
        - --label=receive_replica="$(NAME)"
        - --label=receive="true"
        - --receive.replication-factor=1
        - --tsdb.retention=1d
        - --receive.hashrings-file=/etc/thanos/thanos-receive-hashrings.json
        - --receive.local-endpoint=$(NAME).thanos-receive.$(NAMESPACE).svc.cluster.local:10901
        - --tsdb.allow-overlapping-blocks

Did recently find: cortexproject/cortex#2366 that looks very related and potentially promising?

Is there anything we can do to potentially alleviate this issue for the time being in our environments?

wiardvanrij · 2021-12-10T17:14:51Z

Short term we should try to check if receiver response is correct. OutOfBounds is unrecoverable. Do we also tell this back to Prometheus? That is at least to resolve the infinite loop of retries.

Kampe · 2022-01-10T05:45:27Z

The receiver response appears to be the culprit to trigger prometheus to keep this loop going, sending things out of control quickly.

Very strange behavior:

yeya24 · 2022-01-10T05:55:37Z

Hit this issue quite a lot recently.

jmichalek132 · 2022-01-10T14:56:25Z

Hit this issue quite a lot recently.

We used to hit it too, but at least since switching to the receive router & ingestor mode we haven't hit it since. Not sure however 100% that it's related, but it might be worth trying if you are not using that mode yet. Details on our setup here.

wiardvanrij · 2022-01-10T15:40:46Z

Hit this issue quite a lot recently.

We used to hit it too, but at least since switching to the receive router & ingestor mode we haven't hit it since. Not sure however 100% that it's related, but it might be worth trying if you are not using that mode yet. Details on our setup here.

It can be related if your receive is struggling to keep up I'm sure.

For this issue, what I really would like is an easy-to-replicate setup for my dev environment. There was a related issue, which I now cannot find anymore. What I stated there is that I believe we do not handle the HTTP code correctly. The short version is that Prometheus basically could return two things; 1: "error, try again" and 2: "error, unrecoverable". - I think there is a chance we handle case 2 as 1, causing basically a DoS until one resets Prometheus again.

Kampe · 2022-01-10T21:42:20Z

Hit this issue quite a lot recently.

We used to hit it too, but at least since switching to the receive router & ingestor mode we haven't hit it since. Not sure however 100% that it's related, but it might be worth trying if you are not using that mode yet. Details on our setup here.

It can be related if your receive is struggling to keep up I'm sure.

For this issue, what I really would like is an easy-to-replicate setup for my dev environment. There was a related issue, which I now cannot find anymore. What I stated there is that I believe we do not handle the HTTP code correctly. The short version is that Prometheus basically could return two things; 1: "error, try again" and 2: "error, unrecoverable". - I think there is a chance we handle case 2 as 1, causing basically a DoS until one resets Prometheus again.

exactly, the Prometheus team claims it handles the HTTP code properly and that its on the thanos receive side: prometheus/prometheus#5649

Going to peel apart our thanos config into the router and ingestor and report back.

Kampe · 2022-01-21T16:28:44Z

After giving thanos receive double the resources as well as breaking it down into a router/ingestor - we still see this issue using the latest, 0.24.0.

Whats interesting is it will begin happening with a single tenant (picture below) and then cascade over time to ALL tenants reporting to receive.

If we're not the only one having this issue, this seems to be a massive issue that will happen to any receive user over time.

Kampe · 2022-02-02T18:42:41Z

More graphs of this happening with a very healthy reporting prometheus. This is during normal operations, not during a rollout. The rollouts are MUCH more violent.

wiardvanrij · 2022-02-02T23:33:59Z

What would help me, is a very easy to reproduce example. I understand this is perhaps hard to do but I've already spend some time on this, not even triggering this error once.. while actually pushing some really weird stuff to receiver.

If you could help me with this?

Kampe · 2022-02-03T21:30:59Z

Sure absolutely, I'll see if I can help produce it in any way I can.

Cloud Provider: GKE v1.21.6-gke.1500
On Prem K8s (prometheus remote write client): k3s v1.21.5+k3s2
Prometheus version: v2.33.1
Thanos verison: v0.24.0

Unfortunately, it seems time is a factor in being able to produce this issue, but we've found you can potentially force it if you can get your prometheus to fall behind in shipping it's WAL for a bit by choking its connection throughput. The metrics we usually have it complain about always tend to be related to cadvisor and kubestate metrics we ship:

2022-02-03 14:30:31 | level=debug ts=2022-02-03T20:30:31.276011135Z caller=writer.go:94 component=receive component=receive-writer msg="Out of bounds metric" lset="{__name__=\"container_memory_failures_total\", endpoint=\"https-metrics\", env=\"prd\", failure_type=\"pgmajfault\", id=\"/kubepods/besteffort/pod63be8c1b-7904-4c59-8de5-bbea62bd85e7/9f24f2a933f27bdef7d085c4e48b959b26b674345026e867f3f6c0fa99d3ad17\", image=\"docker.io/rancher/pause:3.1\", instance=\"172.22.3.33:10250\", job=\"kubelet\", metrics_path=\"/metrics/cadvisor\", name=\"9f24f2a933f27bdef7d085c4e48b959b26b674345026e867f3f6c0fa99d3ad17\", namespace=\"monitoring\", node=\"leader\", pod=\"skupper-router-54c8b89f9c-pf4zl\", prometheus=\"monitoring/edge\", prometheus_replica=\"prometheus-edge-1\", scope=\"container\", service=\"kubelet\"}" sample="unsupported value type"
2022-02-03 14:30:31 | level=debug ts=2022-02-03T20:30:31.275938653Z caller=writer.go:94 component=receive component=receive-writer msg="Out of bounds metric" lset="{__name__=\"container_memory_failures_total\", endpoint=\"https-metrics\", env=\"prd\", failure_type=\"pgmajfault\", id=\"/kubepods/besteffort/pod6276d2da-7fd8-4b3b-b8a3-3e3ab0af5b20/47eafd1f33ea4f082968ac51f3c2c8ef1fb048c959e862c8442dfa9d7bad0c3f\", image=\"docker.io/rancher/pause:3.1\", instance=\"172.22.3.33:10250\", job=\"kubelet\", metrics_path=\"/metrics/cadvisor\", name=\"47eafd1f33ea4f082968ac51f3c2c8ef1fb048c959e862c8442dfa9d7bad0c3f\", namespace=\"kube-system\", node=\"leader\", pod=\"svclb-traefik-7jg46\", prometheus=\"monitoring/edge\", prometheus_replica=\"prometheus-edge-1\", scope=\"container\", service=\"kubelet\"}" sample="unsupported value type"
2022-02-03 14:30:31 | level=debug ts=2022-02-03T20:30:31.275881169Z caller=writer.go:94 component=receive component=receive-writer msg="Out of bounds metric" lset="{__name__=\"container_memory_failures_total\", endpoint=\"https-metrics\", env=\"prd\", failure_type=\"pgmajfault\", id=\"/kubepods/besteffort/pod42b0eb0f-c10b-486b-afca-5906f03c64fd/205ba53320b547a7367dd1fe9a988c25d787de210fdb597195cce4bf8b54a902\", image=\"docker.io/rancher/pause:3.1\", instance=\"172.22.3.33:10250\", job=\"kubelet\", metrics_path=\"/metrics/cadvisor\", name=\"205ba53320b547a7367dd1fe9a988c25d787de210fdb597195cce4bf8b54a902\", namespace=\"argocd\", node=\"leader\", pod=\"argocd-redis-ha-server-2\", prometheus=\"monitoring/edge\", prometheus_replica=\"prometheus-edge-1\", scope=\"hierarchy\", service=\"kubelet\"}" sample="unsupported value type"
2022-02-03 14:30:31 | level=debug ts=2022-02-03T20:30:31.275821501Z caller=writer.go:94 component=receive component=receive-writer msg="Out of bounds metric" lset="{__name__=\"container_memory_failures_total\", endpoint=\"https-metrics\", env=\"prd\", failure_type=\"pgmajfault\", id=\"/kubepods/besteffort/pod39136e8b-f1eb-4b8a-a971-446d71d8457d/18295222b54c835d5da2fb12e2972844f9e26dd9afe1bd6a85d6ae677b3ce9d4\", image=\"docker.io/rancher/pause:3.1\", instance=\"172.22.3.33:10250\", job=\"kubelet\", metrics_path=\"/metrics/cadvisor\", name=\"18295222b54c835d5da2fb12e2972844f9e26dd9afe1bd6a85d6ae677b3ce9d4\", namespace=\"argocd\", node=\"leader\", pod=\"argocd-redis-ha-server-1\", prometheus=\"monitoring/edge\", prometheus_replica=\"prometheus-edge-1\", scope=\"container\", service=\"kubelet\"}" sample="unsupported value type"
2022-02-03 14:30:31 | level=debug ts=2022-02-03T20:30:31.2757512Z caller=writer.go:94 component=receive component=receive-writer msg="Out of bounds metric" lset="{__name__=\"container_memory_failures_total\", endpoint=\"https-metrics\", env=\"prd\", failure_type=\"pgmajfault\", id=\"/kubepods/besteffort/pod3326cd27-c028-4539-9d40-b3f4a1ff81a9/c5563b08966a0d32168f855256509a47e61039283713b4229f27b33d82c0bb38\", image=\"docker.io/rancher/pause:3.1\", instance=\"172.22.3.33:10250\", job=\"kubelet\", metrics_path=\"/metrics/cadvisor\", name=\"c5563b08966a0d32168f855256509a47e61039283713b4229f27b33d82c0bb38\", namespace=\"logging\", node=\"leader\", pod=\"skupper-service-controller-7544cf59f7-5bj62\", prometheus=\"monitoring/edge\", prometheus_replica=\"prometheus-edge-1\", scope=\"hierarchy\"}" sample="unsupported value type"
2022-02-03 14:30:31 | level=debug ts=2022-02-03T20:30:31.275657003Z caller=writer.go:94 component=receive component=receive-writer msg="Out of bounds metric" lset="{__name__=\"container_memory_failures_total\", endpoint=\"https-metrics\", env=\"prd\", failure_type=\"pgmajfault\", id=\"/kubepods/besteffort/pod1d90f76f-0116-4b00-8a83-ff687e9e4683/f71ce97614932c09e97acf9728aa08cdb81ae5a8cc9221716e68aff31973c3b4\", image=\"docker.io/rancher/pause:3.1\", instance=\"172.22.3.33:10250\", job=\"kubelet\", metrics_path=\"/metrics/cadvisor\", name=\"f71ce97614932c09e97acf9728aa08cdb81ae5a8cc9221716e68aff31973c3b4\", namespace=\"argocd\", node=\"leader\", pod=\"argocd-redis-ha-haproxy-66945855dd-76zr8\", prometheus=\"monitoring/edge\", prometheus_replica=\"prometheus-edge-1\", scope=\"container\"}" sample="unsupported value type"

and sometimes we catch these:

level=warn ts=2022-02-03T22:43:22.016294171Z caller=writer.go:131 component=receive component=receive-writer msg="Error on ingesting out-of-order samples" numDropped=2
level=debug ts=2022-02-03T22:43:22.016205474Z caller=writer.go:88 component=receive component=receive-writer msg="Out of order sample" lset="{__name__=\"kube_secret_metadata_resource_version\", cluster_type=\"edge\", container=\"kube-state-metrics\", endpoint=\"http-metrics\", env=\"prd\", exported_namespace=\"monitoring\", instance=\"10.42.0.21:8080\", job=\"kube-state-metrics\", namespace=\"monitoring\", pod=\"kube-state-metrics-ff6dbf54d-zg2t2\", prometheus=\"monitoring/edge\", prometheus_replica=\"prometheus-edge-0\", secret=\"prometheus-operator-token-r5ccs\", service=\"kube-state-metrics\", version=\"v2\"}" sample="unsupported value type"
level=debug ts=2022-02-03T22:43:22.016100618Z caller=writer.go:88 component=receive component=receive-writer msg="Out of order sample" lset="{__name__=\"kube_secret_annotations\", cluster_type=\"edge\", container=\"kube-state-metrics\", endpoint=\"http-metrics\", env=\"prd\", exported_namespace=\"kube-system\", instance=\"10.42.0.21:8080\", job=\"kube-state-metrics\", namespace=\"monitoring\", pod=\"kube-state-metrics-ff6dbf54d-zg2t2\", prometheus=\"monitoring/edge\", prometheus_replica=\"prometheus-edge-0\", secret=\"endpointslice-controller-token-647c8\", service=\"kube-state-metrics\", version=\"v2\"}" sample="unsupported value type"
level=debug ts=2022-02-03T22:43:22.01574206Z caller=writer.go:88 component=receive component=receive-writer msg="Out of order sample" lset="{__name__=\"kube_secret_info\", cluster_type=\"edge\", container=\"kube-state-metrics\", endpoint=\"http-metrics\", env=\"prd\", exported_namespace=\"cattle-impersonation-system\", instance=\"10.42.0.21:8080\", job=\"kube-state-metrics\", namespace=\"monitoring\", pod=\"kube-state-metrics-ff6dbf54d-zg2t2\", prometheus=\"monitoring/edge\", prometheus_replica=\"prometheus-edge-0\", secret=\"cattle-impersonation-u-nttkrtrdhm-token-fph5s\", service=\"kube-state-metrics\", version=\"v2\"}" sample="unsupported value type"

Let me know what other information would be useful for you to have to potentially reproduce. We cannot reliably get it to happen, but it happens at-least monthly in terms of cadence currently. Looking at building a system to mitigate it on sites with prometheus if it begins spinning out of control.

wiardvanrij · 2022-02-04T01:02:02Z

Do you monitor prometheus with https://raw.githubusercontent.com/prometheus-operator/kube-prometheus/main/manifests/grafana-dashboardDefinitions.yaml - see prometheus-remote-write.json ?

If so, can you show me how it looks like?

I will try to reproduce this again tomorrow or so.

wiardvanrij · 2022-02-05T22:07:15Z

I could reproduce this by fiddling around with my local time, causing;

level=debug name=receive ts=2022-02-05T22:00:12.323668Z caller=writer.go:94 component=receive component=receive-writer msg="Out of bounds metric" lset="{__name__=\"prometheus_tsdb_head_gc_duration_seconds_sum\", instance=\"localhost:9090\", job=\"prometheus\", prometheus=\"prom0\", replica=\"1\"}" sample="unsupported value type"

What I however not could reproduce is the behaviour from Prometheus side. Perhaps this is due to the fact I'm using an older Prometheus version because I'm having some issues installing prometheus on my M1 lol

wiardvanrij · 2022-02-07T05:46:48Z

@Kampe is there any way you perhaps could test my PR in some staging/dev environment?

Kampe · 2022-02-07T06:04:14Z

Yes absolutely, is there an image sha that gets built for this anywhere I can reference? Would be glad to deploy it tomorrow into development environment!

wiardvanrij · 2022-02-07T06:18:46Z

We only have that for things in main https://quay.io/repository/thanos/thanos?tab=tags - I've pinged to get a review, so perhaps we can move this forward faster.

matej-g · 2022-02-14T15:45:10Z

I looked a bit through the original report and I'm wondering whether there are two distinct things going on:

we're seeing the out-of-bounds error on local node
when the request is fanned out to other replicas, somewhere along the way a timeout occurs and this is what Prometheus sees (hence the context deadline exceeded is seen on Prometheus side, as originally reported) and therefore it keeps retrying, since this would be considered network (therefore recoverable) error

@Kampe did you say you detected this behavior during receiver rollout (i.e. period of time when some replicas might be in non-ready state) or also outside of this?

Kampe · 2022-02-14T16:13:41Z

Interesting, it seems you're correct there are two events happening here, and we also see the context deadline behavior as well during the rollouts, however these will subside eventually after all pods are ready. We will see them crop back up in time, yes - we thought this behavior was related to the receiver's resourcing but we still see this issue when giving receive double the CPU we were requesting. We also rolled out the latest topology using a router and ingestor with hopes this would solve the issue that we were seeing there. No avail.

You can see the 409's start happening in my graphs above ^^

Kampe · 2022-02-14T18:00:54Z

Currently, very content

Kampe · 2022-02-16T21:02:02Z

and here we are again today, no changes have been made to the environment.

matej-g · 2022-02-21T13:55:34Z

I see, so it seemingly happens at random times. So it seems a receiver replica simply fails to forward the request to another receiver(s) and this is manifesting itself as context deadline exceeded, which would indicate one receiver cannot reach the other receiver. I'm not sure what to look at as next here, since this could be a symptom of different issues.

You mention in the original post that killing Prometheus instances resolves it? (I'm thinking how that could be related, since if this is connectivity issue between receivers, I would expect this still to keep happening even after Prometheus instance restarts). Does it subside by itself or is that step still necessary?

Kampe · 2022-02-23T19:08:34Z

I see, so it seemingly happens at random times. So it seems a receiver replica simply fails to forward the request to another receiver(s) and this is manifesting itself as context deadline exceeded, which would indicate one receiver cannot reach the other receiver. I'm not sure what to look at as next here, since this could be a symptom of different issues.

You mention in the original post that killing Prometheus instances resolves it? (I'm thinking how that could be related, since if this is connectivity issue between receivers, I would expect this still to keep happening even after Prometheus instance restarts). Does it subside by itself or is that step still necessary?

Yeah it's very strange as to when we see it appear and are not able to reliably reproduce it, it not only requires the prometheus is restarted, but that its entire WAL is also dumped, or else you'll see the same behavior repeat on restart. (we utilize statefulsets for prometheus as we're using the prometheus operator, so we delete the entire statefulset to get it to report again to thanos receive properly)

This is our go to strategy whenever we see any behavior like this (no matter 500s or 400s as we judge based on last reported block compared to the currently produced block)

We're currently working on an operator to detect that a prometheus is behind in shipping its blocks, and calling the local k8s api to kill the prometheus instances to force remote writes to report again properly because doing this manually at any scale is, as you can imagine, very cumbersome.

vanugrah · 2022-03-23T00:03:02Z

Hey folks - just chiming in here. I am a colleague of @jmichalek132 and can confirm that we are see this behavior happening in our environment even after the move to running thanos receive in router and ingestor mode. The frequency is very irregular and the behavior seems to only affect a prometheus instances in a particular environment while instances in other environments writing to the same endpoint see little to no degradation in their remote write.

We typically see increased latency and timeouts on the router. However the ingestors do not exhibit and substantial latency. This was also validated by forking thanos and adding more instrumentation(spans) around the TSDB write.

If you have any pointers as to what is worth investigating next, we are all ears.

Kampe · 2022-04-07T05:27:58Z

here you'll see it happen over the last day, keep in mind these prometheus are configured to only retain an hour of data - this is all running thanos receive v0.24.0 and prometheus v2.34.0 using the router/ingester pattern for deployment.

stale · 2022-06-12T17:56:06Z

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

joshua-ketcham · 2022-06-22T15:38:08Z

Our team is currently experiencing this issue too. Did you have any luck with the that operator to kill Prometheus instances as a workaround?

stale · 2022-09-21T02:13:25Z

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

zhangrj · 2022-11-15T06:49:42Z

The same problem occured last week in our system. Several prometheus stop remote writing to Thanos due to network problem, when the network recovery after two days，Thanos receive-distribuotr and receive crash. We must restart all prometheus instance and drop all wal data. The version we use is 0.28.0.

fpetkovski · 2022-11-15T07:10:30Z

This is a limitation of TSDB and should be solved once we have support for out of order samples.

douglascamata · 2022-12-14T17:50:28Z

@fpetkovski me and @philipgough also had big troubles because of this and worked on a fix at #5970.

Basically we return a 4xx if the request contained any out of bounds sample, so that Prometheus stop retrying that request forever as it quickly snowball and bring Receives to a halt.

This is still pretty much valid, even when we support out of order samples, because:

Out of order ingestion should be an optional feature.
If out of order ingestion is enabled, samples can still be out of the window for out of order ingestion. These out of window samples shouldn't be sent again.
If out of order ingestion is disabled, we have to tell the Prometheus (or other remote write clients) to stop retrying the request by returning a 4xx. We cannot return a 500 in this case as it leads to infinite retries.

fpetkovski · 2022-12-14T17:59:47Z

Thanks for the explanation. I wonder why this is not covered by https://github.com/thanos-io/thanos/blob/main/pkg/receive/handler.go#L884

douglascamata · 2022-12-14T18:02:33Z

@fpetkovski same. I think some things got lost in translation in between this issue and #5131.

douglascamata · 2022-12-14T18:03:23Z

Also, according to this comment it looks like at some point "out of bounds" returned a 409 error.

douglascamata · 2022-12-14T20:32:05Z

@fpetkovski I suspect it's because of the determineWriteErrorCause call here:

thanos/pkg/receive/handler.go

Lines 516 to 528 in e911f03

    
           switch determineWriteErrorCause(err, 1) { 
        
           case errNotReady: 
        
           	responseStatusCode = http.StatusServiceUnavailable 
        
           case errUnavailable: 
        
           	responseStatusCode = http.StatusServiceUnavailable 
        
           case errConflict: 
        
           	responseStatusCode = http.StatusConflict 
        
           case errBadReplica: 
        
           	responseStatusCode = http.StatusBadRequest 
        
           default: 
        
           	level.Error(tLogger).Log("err", err, "msg", "internal server error") 
        
           	responseStatusCode = http.StatusInternalServerError 
        
           }

errOutOfBounds is not an expected error (see snippet below) and thus it gets completely ignored, falls into the default clause of the switch and becomes a 500.

thanos/pkg/receive/handler.go

Lines 939 to 977 in e911f03

    
           func determineWriteErrorCause(err error, threshold int) error { 
        
           	if err == nil { 
        
           		return nil 
        
           	} 
        
           	unwrappedErr := errors.Cause(err) 
        
           	errs, ok := unwrappedErr.(errutil.NonNilMultiError) 
        
           	if !ok { 
        
           		errs = []error{unwrappedErr} 
        
           	} 
        
           	if len(errs) == 0 { 
        
           		return nil 
        
           	} 
        
           	if threshold < 1 { 
        
           		return err 
        
           	} 
        
           	expErrs := expectedErrors{ 
        
           		{err: errConflict, cause: isConflict}, 
        
           		{err: errNotReady, cause: isNotReady}, 
        
           		{err: errUnavailable, cause: isUnavailable}, 
        
           	} 
        
           	for _, exp := range expErrs { 
        
           		exp.count = 0 
        
           		for _, err := range errs { 
        
           			if exp.cause(errors.Cause(err)) { 
        
           				exp.count++ 
        
           			} 
        
           		} 
        
           	} 
        
           	// Determine which error occurred most. 
        
           	sort.Sort(sort.Reverse(expErrs)) 
        
           	if exp := expErrs[0]; exp.count >= threshold { 
        
           		return exp.err 
        
           	} 
        
           	return err 
        
           }

douglascamata · 2022-12-14T20:53:32Z

Maybe the underlying issue is on the fact that errors.Cause(err)) is going too deep and never seeing the errOutOfBounds? I'll put some theories to test tomorrow.

fpetkovski · 2022-12-15T07:54:35Z

Now that #5910 has been merged, could this issue be fixed?

philipgough · 2022-12-15T08:56:37Z

@fpetkovski does this issue not predate the ketama hashing algorithm?

fpetkovski · 2022-12-15T09:33:28Z

That's correct, but in that PR we revamped the way we calculate quorum. I suspect it should help, but I am not 100% sure. Do you have a way to reproduce this issue somehow?

douglascamata · 2022-12-15T19:00:13Z

As @philipgough could confirm, Thanos isn't at fault here. The error handling code is working fine. At least in our scenario, we attribute 500s to being spammed (effectively DDoSed) by aggressive Prometheus remote write configurations (short min backoff, short client side timeout, big max shards, etc).

philipgough · 2023-02-16T18:06:17Z

We hit this issue again recently in production with Thanos v0.30.1 and ketama hashing algorithm.
The problems started during receiver rollouts triggered by a cluster upgrade. I tried to investigate as best I could and describe the situation below. We are also using the receive-controller and there is a reverse proxy written in Go in front of Thanos.

Initially I was thinking that there was a potential for Thanos to return a 5xx when it should not and it could be related to the existing #5407.

However, adding some tests via #5976 appears to show that is not the case and the error handling is correct.

What I did manage to highlight with a test is that an unrecoverable error with a forward request that takes longer than the forward timeout and results in a context cancellation will return a 500.

So we can envisage a situation where one or more receivers become slow or we get a slow network between some receivers can start a loop where remote write clients begin to retry over and over for requests that will never succeed.

We can see from the graphs below that as things become slow for whatever reason (in our case, we believe we are sending traffic to receives that are not fully ready) a vicious cycle starts with crashing of receivers and build up of appenders etc.

We mitigated the incident in the end by rolling out additional receivers and when everything become healthy we started to handle requests and respond with 429s correctly.

vanugrah · 2023-04-11T16:42:02Z

Hey folks - I wanted to chime in here with some relevant observations. Although I could write at length about this topic, I will keep it brief. Historically when we did deployments on our receive cluster (specifically the ingesters) - we would see all sorts of errors (500/503/409). The logs revealed several types of errors on ingestion from "Out of Bounds" to "Already Exists". However I noticed that when the deployment began we'd also see errors on the routers claiming that quorum wasn't reached. This always perplexed me since our replication factor was 3 and rollouts on the ingester statefulsets would happen one at a time and appropriate probes (startup, readiness and liveness) prevented the deployment from progression until the previous ingester was ready to receive traffic. So theoretically quorum should in fact have been reached but it wasn't. I then considered the possibility that there is a race condition and decided to update the ingester StatefulSets with the spec/minReadySeconds: 120 property (read more here), which adds some amount of delay in between consecutive deployments of pods in a Statefulset. After making this change - I am no longer seeing ANY errors during rollouts! The quorum is maintained (validated by enabling debug logging on the routers) and the thanos receive routers respond to all remote write requests with status code 200. I am still investigating what is the cause of this race condition - but ever since we added this buffer in between the rollouts our ingest pipeline has been operating smoothly and our on call schedule is silent 😄

vanugrah · 2023-04-11T23:37:34Z

I do still recommend tuning prometheus remote write params to reduce its aggressiveness on retries as well as the max samples per batch to reduce the overall remote write rate. In our experiments we noticed that doubling the number of samples per batch from the default 500 to 1000 had virtually no impact on remote write latency ( from the prometheus perspective). This is now our default configuration across the board. There is likely a tradeoff for the appropriate samples per batch beyond which there individual remote write requests start taking too long so be sure to test your changes.

gurumee92 · 2024-01-25T05:16:50Z

If you are using a k8s environment with the thanos receive controller, make sure that the configmap is updated on each receiver pod at the same time.

In my case, it took between 1 and 5 minutes for the full hashing to synchronised (for all pods to see the changed config settings), at which point I noticed a context deadline error on the receiver, which pushed back prometheus remote writes and increased the desired shard count.

as @vanugrah said, setting minReadySeconds to 120s reduced the error somewhat, and setting flag --receive-forward-timeout to 5m eliminated it.

gurumee92 · 2024-01-25T05:23:50Z

I think thanos receiver should also have hashing set up based on url, not ip, like mimir or victoriametrics. I'm not sure exactly if there is a dns lookup process, but I've seen hashing configured in a k8s environment as a pod endpoint work well, but not worked as a headless service.

hashring with pod work well:

[
  {
    "hashring": "default",
    "endpoints": [
      "playground-prometheus-thanos-receiver-0.playground-prometheus-thanos-receiver.prometheus-playground-beta.svc.cluster.local:10901",
      "playground-prometheus-thanos-receiver-1.playground-prometheus-thanos-receiver.prometheus-playground-beta.svc.cluster.local:10901",
      "playground-prometheus-thanos-receiver-2.playground-prometheus-thanos-receiver.prometheus-playground-beta.svc.cluster.local:10901",
      "playground-prometheus-thanos-receiver-3.playground-prometheus-thanos-receiver.prometheus-playground-beta.svc.cluster.local:10901",
      "playground-prometheus-thanos-receiver-4.playground-prometheus-thanos-receiver.prometheus-playground-beta.svc.cluster.local:10901",
      "playground-prometheus-thanos-receiver-5.playground-prometheus-thanos-receiver.prometheus-playground-beta.svc.cluster.local:10901"
    ]
  }
]

but headless service not work

[
  {
    "hashring": "default",
    "endpoints": [
      "playground-prometheus-thanos-receiver.prometheus-playground-beta.svc.cluster.local:10901"
    ]
  }
]

Kampe mentioned this issue Nov 15, 2021

Remove max_retries from queue_config prometheus/prometheus#5649

Merged

wiardvanrij mentioned this issue Feb 6, 2022

[RECEIVE] Make out of bounds a non-recoverable error #5131

Closed

2 tasks

stale bot added the stale label Jun 12, 2022

stale bot removed the stale label Jun 22, 2022

philipgough mentioned this issue Jun 29, 2022

High error rate and choking of receives during load test #5452

Closed

stale bot added the stale label Sep 21, 2022

stale bot removed the stale label Nov 15, 2022

douglascamata mentioned this issue Dec 14, 2022

Handle out of bounds #5970

Closed

2 tasks

philipgough mentioned this issue Dec 15, 2022

test:e2e:receive: Add tests to ensure response codes are correct #5976

Closed

2 tasks

powersj mentioned this issue Mar 13, 2023

[[outputs.http]] prometheus remote write metrics are not receiving on Thanos ""unsupported value type"" influxdata/telegraf#12830

Closed

luqic mentioned this issue Dec 8, 2023

[bitnami/thanos] Allow configuration of minReadySeconds for Thanos Receive bitnami/charts#21480

Closed

Thanos Receive: "Out of bounds metric" #4831

Thanos Receive: "Out of bounds metric" #4831

Comments

Kampe commented Nov 3, 2021 • edited

Kampe commented Nov 5, 2021

Kampe commented Nov 9, 2021 • edited

wiardvanrij commented Dec 10, 2021

Kampe commented Jan 10, 2022

yeya24 commented Jan 10, 2022

jmichalek132 commented Jan 10, 2022

wiardvanrij commented Jan 10, 2022

Kampe commented Jan 10, 2022

Kampe commented Jan 21, 2022 • edited

Kampe commented Feb 2, 2022 • edited

wiardvanrij commented Feb 2, 2022

Kampe commented Feb 3, 2022 • edited

wiardvanrij commented Feb 4, 2022

wiardvanrij commented Feb 5, 2022

wiardvanrij commented Feb 7, 2022

Kampe commented Feb 7, 2022

wiardvanrij commented Feb 7, 2022

matej-g commented Feb 14, 2022 • edited

Kampe commented Feb 14, 2022 • edited

Kampe commented Feb 14, 2022

Kampe commented Feb 16, 2022 • edited

matej-g commented Feb 21, 2022

Kampe commented Feb 23, 2022 • edited

vanugrah commented Mar 23, 2022

Kampe commented Apr 7, 2022 • edited

stale bot commented Jun 12, 2022

joshua-ketcham commented Jun 22, 2022

stale bot commented Sep 21, 2022

zhangrj commented Nov 15, 2022

fpetkovski commented Nov 15, 2022

douglascamata commented Dec 14, 2022 • edited

fpetkovski commented Dec 14, 2022

douglascamata commented Dec 14, 2022

douglascamata commented Dec 14, 2022 • edited

douglascamata commented Dec 14, 2022 • edited

douglascamata commented Dec 14, 2022

fpetkovski commented Dec 15, 2022

philipgough commented Dec 15, 2022

fpetkovski commented Dec 15, 2022

douglascamata commented Dec 15, 2022

philipgough commented Feb 16, 2023

vanugrah commented Apr 11, 2023 • edited

vanugrah commented Apr 11, 2023

gurumee92 commented Jan 25, 2024

gurumee92 commented Jan 25, 2024

Kampe commented Nov 3, 2021 •

edited

Kampe commented Nov 9, 2021 •

edited

Kampe commented Jan 21, 2022 •

edited

Kampe commented Feb 2, 2022 •

edited

Kampe commented Feb 3, 2022 •

edited

matej-g commented Feb 14, 2022 •

edited

Kampe commented Feb 14, 2022 •

edited

Kampe commented Feb 16, 2022 •

edited

Kampe commented Feb 23, 2022 •

edited

Kampe commented Apr 7, 2022 •

edited

douglascamata commented Dec 14, 2022 •

edited

douglascamata commented Dec 14, 2022 •

edited

douglascamata commented Dec 14, 2022 •

edited

vanugrah commented Apr 11, 2023 •

edited