Thanos receive store locally for endpoint conflict #3913

AsherBoone · 2021-03-11T11:03:32Z

Thanos and Prometheus version used:
Thanos: v0.18.0
Prometheus: v2.11.1
Object Storage Provider:
S3
What happened:
Prometheus Error in the logs:

level=error ts=2021-03-04T08:02:54.350Z caller=queue_manager.go:699 component=remote queue=0:http://10.53.26.191:30021/api/v1/receive msg="non-recoverable error" count=100 err="server returned HTTP status 409 Conflict: 2 errors: store locally for endpoint thanos-receive-0.thanos-receive.cattle-prometheus.svc.cluster.local:10901: conflict; forwarding request to endpoint thanos-receive-1.thanos-receive.cattle-prometheus.svc.cluster.local:10901: rpc error: code = AlreadyExi"
level=error ts=2021-03-04T08:03:16.065Z caller=queue_manager.go:699 component=remote queue=0:http://10.53.26.191:30021/api/v1/receive msg="non-recoverable error" count=100 err="server returned HTTP status 409 Conflict: store locally for endpoint thanos-receive-0.thanos-receive.cattle-prometheus.svc.cluster.local:10901: conflict"
level=error ts=2021-03-04T08:03:16.099Z caller=queue_manager.go:699 component=remote queue=0:http://10.53.26.191:30021/api/v1/receive msg="non-recoverable error" count=100 err="server returned HTTP status 409 Conflict: forwarding request to endpoint thanos-receive-1.thanos-receive.cattle-prometheus.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-1.thanos-receive.cattle-prometheus.svc.cluster.local:10901: conflict"
level=error ts=2021-03-04T08:03:16.286Z caller=queue_manager.go:699 component=remote queue=0:http://10.53.26.191:30021/api/v1/receive msg="non-recoverable error" count=100 err="server returned HTTP status 409 Conflict: store locally for endpoint thanos-receive-0.thanos-receive.cattle-prometheus.svc.cluster.local:10901: conflict"
level=error ts=2021-03-04T08:03:16.318Z caller=queue_manager.go:699 component=remote queue=0:http://10.53.26.191:30021/api/v1/receive msg="non-recoverable error" count=100 err="server returned HTTP status 409 Conflict: store locally for endpoint thanos-receive-0.thanos-receive.cattle-prometheus.svc.cluster.local:10901: conflict"
level=error ts=2021-03-04T08:03:16.350Z caller=queue_manager.go:699 component=remote queue=0:http://10.53.26.191:30021/api/v1/receive msg="non-recoverable error" count=100 err="server returned HTTP status 409 Conflict: store locally for endpoint thanos-receive-0.thanos-receive.cattle-prometheus.svc.cluster.local:10901: conflict"
level=error ts=2021-03-04T08:03:16.573Z caller=queue_manager.go:699 component=remote queue=0:http://10.53.26.191:30021/api/v1/receive msg="non-recoverable error" count=100 err="server returned HTTP status 409 Conflict: forwarding request to endpoint thanos-receive-1.thanos-receive.cattle-prometheus.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-1.thanos-receive.cattle-prometheus.svc.cluster.local:10901: conflict"
level=error ts=2021-03-04T08:03:16.637Z caller=queue_manager.go:699 component=remote queue=0:http://10.53.26.191:30021/api/v1/receive msg="non-recoverable error" count=100 err="server returned HTTP status 409 Conflict: forwarding request to endpoint thanos-receive-1.thanos-receive.cattle-prometheus.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-1.thanos-receive.cattle-prometheus.svc.cluster.local:10901: conflict"
level=error ts=2021-03-04T08:03:16.671Z caller=queue_manager.go:699 component=remote queue=0:http://10.53.26.191:30021/api/v1/receive msg="non-recoverable error" count=100 err="server returned HTTP status 409 Conflict: 2 errors: store locally for endpoint thanos-receive-0.thanos-receive.cattle-prometheus.svc.cluster.local:10901: conflict; forwarding request to endpoint thanos-receive-1.thanos-receive.cattle-prometheus.svc.cluster.local:10901: rpc error: code = AlreadyExi"

hashring.json (configMap):

apiVersion: v1
kind: ConfigMap
metadata:
  name: thanos-receive-hashrings
  namespace: cattle-prometheus
data:
  thanos-receive-hashrings.json: |
    [
      {
        "hashring": "soft-tenants",
        "endpoints":
        [
          "thanos-receive-0.thanos-receive.cattle-prometheus.svc.cluster.local:10901",
          "thanos-receive-1.thanos-receive.cattle-prometheus.svc.cluster.local:10901"
        ]
      }
    ]

thanos receive:

    - args:
    - receive
    - --log.level=info
    - --log.format=logfmt
    - --grpc-address=0.0.0.0:10901
    - --http-address=0.0.0.0:10902
    - --remote-write.address=0.0.0.0:19291
    - --objstore.config-file=/etc/thanos/objectstorage.yaml
    - --receive.replication-factor=1
    - --tsdb.path=/var/thanos/receive
    - --tsdb.retention=12h
    - --http-grace-period=2m
    - --grpc-grace-period=2m
    - --label=receive_replica="$(NAME)"
    - --label=receive="true"
    - --receive.hashrings-file=/etc/thanos/thanos-receive-hashrings.json
    - --receive.hashrings-file-refresh-interval=3m
    - --receive.local-endpoint=$(NAME).thanos-receive.cattle-prometheus.svc.cluster.local:10901

receive prometheus remote_write: http://10.53.26.191:30021/api/v1/receive (nodeport mode)
Prometheus log has a lot of errors，I tried to modify the thanos receive many times, but the conflict still appeared. Can anyone help me？

The text was updated successfully, but these errors were encountered:

dhohengassner · 2021-04-19T10:16:50Z

Thanks @AsherBoone for raising this!

Seeing probably the same issue on my clusters. It happens always when I roll the receiver statefulset.

Error on Prometheus side:
level=error remote_name=fc9017 url=https://thanos-receiver.my.domain/api/v1/receive msg="non-recoverable error" count=7250 err="server returned HTTP status 409 Conflict: conflict"

Seeing this and several other errors on Thanos-receive pod:
"failed to handle request" err="5 errors: backing off forward request for endpoint thanos-receive-0.thanos-receive.thanos.svc.cluster.local:10901: target not available; store locally for endpoint thanos-receive-4.thanos-receive.thanos.svc.cluster.local:10901: conflict; forwarding request to endpoint thanos-receive-1.thanos-receive.thanos.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-1.thanos-receive.thanos.svc.cluster.local:10901: conflict; forwarding request to endpoint thanos-receive-2.thanos-receive.thanos.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-2.thanos-receive.thanos.svc.cluster.local:10901: conflict; forwarding request to endpoint thanos-receive-3.thanos-receive.thanos.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-3.thanos-receive.thanos.svc.cluster.local:10901: conflict"

Any help/hint is appreciated!

Kampe · 2021-05-11T20:54:09Z

We see the same issues with our receiver and prometheus setup with prometheus:v2.26.0 and thanos:v0.20.1 respectively

liangyuanpeng · 2021-06-17T07:25:15Z

same issue, thanos, version 0.21.1 (branch: HEAD, revision: 3558f4a)

starleaffff · 2021-07-23T02:38:58Z

We see similar issue, which leaves a hole of hours of missing metrics. thanos v0.19.0, prometheus v2.27.1.

The issue happens occasionally when we roll thanos-receive (replication factor 2, replicas 3). During the period of missing metrics, I see streams of errors with "conflict" and "HTTP status 500", which is interesting. Here is one example (with new lines inserted and endpoints shortened:

err="server returned HTTP status 500 Internal Server Error:
  2 errors:
    replicate write request for endpoint thanos-receive-1: quorum not reached: forwarding request to endpoint thanos-receive-1: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-1: conflict;
    replicate write request for endpoint thanos-receive-2: quorum not reached: forwarding request to endpoint thanos-receive-0: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-0: conflict"

And similar in thanos log:

err="
        2 errors:
        replicate write request for endpoint thanos-receive-2: quorum not reached: forwarding request to endpoint thanos-receive-0: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-0: conflict;
        replicate write request for endpoint thanos-receive-1: quorum not reached: forwarding request to endpoint thanos-receive-1: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-1: conflict"
    msg="internal server error"

If I understand correctly, this means that both thanos-receive-1 and thanos-receive-0 already has the metric sent by Prometheus. Why would Thanos respond with status 500 causing Prometheus to retry?

stale · 2021-09-22T02:43:58Z

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale · 2021-10-11T06:07:56Z

Closing for now as promised, let us know if you need this to be reopened! 🤗

koktlzz · 2021-11-12T06:03:00Z

same issue in thanos v0.23.

dploeger · 2021-11-25T11:51:19Z

I think the only valid error here is what @starleaffff said (if you still have that).

Actually the message exactly says what happened:

(...) Conflict: store locally (...)

Meaning: I already have the data you've sent me, please store it locally and don't bother me with it.

And the receiver does just that unlike the sidecar which will provide the same data to the queriers and the queriers deduplicate.

So, yeah. 409 is okay, though very disturbing to see in the prometheus log. Maybe this should be documented somewhere? (Or it is and I haven't come across)

koktlzz · 2021-11-26T13:15:45Z

I think the only valid error here is what @starleaffff said (if you still have that).

Actually the message exactly says what happened:

(...) Conflict: store locally (...)

Meaning: I already have the data you've sent me, please store it locally and don't bother me with it.

And the receiver does just that unlike the sidecar which will provide the same data to the queriers and the queriers deduplicate.

So, yeah. 409 is okay, though very disturbing to see in the prometheus log. Maybe this should be documented somewhere? (Or it is and I haven't come across)

Thank you for your reply. Maybe this error actually makes no sense.
Anyway, my prometheus and receiver works well. However, it really makes me panic as you say.😓

enifeilio · 2022-03-31T08:12:59Z

disturbing
it also makes me panic as you say

Kampe · 2022-04-06T01:52:07Z

I too clench up with these errors.

zhangrj · 2022-04-06T06:09:04Z

The same in v0.25, any one can help?

sharathfeb12 · 2022-04-14T19:11:16Z

Seeing same issue on v0.25.2 as well.

FTwOoO · 2022-05-12T07:56:25Z

Seeing same issue on v0.22.0 as well

sharathfeb12 · 2022-05-30T16:59:10Z

Seeing the same issue on v0.26.0 as well.

phillebaba · 2022-05-31T07:07:52Z

@sharathfeb12 what version of Prometheus are you running, and are you running it in agent mode?

sharathfeb12 · 2022-06-20T22:58:15Z

I am running v2.30.1. seeing the same issue on v2.36.1 as well.

phillebaba · 2022-06-21T07:48:15Z

@sharathfeb12 I am guessing you are running with a replication factor greater than 1? Out of interest are you running Thanos Router-Ingestor or just a single stage Recevier?

Temporarily setting the replication factor to 1 seemed to solve the issues. I have created #5407 to track some of the debugging that I have been doing for this issue. My guess is that the incorrect status code is returned by Thanos which causes Prometheus to keep retrying sending the same time series which Thanos already has. The reason setting the replication factor to 1 seems to help is because there is different error handling logic for it.

sharathfeb12 · 2022-07-01T23:55:07Z

I am running with a replication factor of 2. This is because our GKE clusters go through a node pool upgrade very often. We do not want an outage when this happens. So, we are using a replication factor of 2.

Due to the errors, the service teams think there is an issue on the server-side; and are not confident using the Thanos solution. I have also seen that the count goes down when we run with 1 prometheus replica instead of running in HA.

cybervedaa · 2022-10-19T00:52:46Z

I see this issue in thanos 0.28.0 as well. I am running with replication factor = 1. But still see this occassionally. I have to turn off remote write on all prometheus instances, then re-enable it. Would be great if someone can work on a fix for this issue.

matej-g · 2022-10-19T07:25:36Z

@cybervedaa there are couple of related issues see namely #5407, we're looking at these actively 👍

cybervedaa · 2022-10-19T20:25:08Z

Thank you for the update, Matej

…

On Wed, Oct 19, 2022 at 12:25 AM Matej Gera ***@***.***> wrote: @cybervedaa <https://github.com/cybervedaa> there are couple of related issues see namely #5407 <#5407>, we're looking at these actively 👍 — Reply to this email directly, view it on GitHub <#3913 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFVZHOFLUDXMXGIGPB5H4DWD6O7ZANCNFSM4ZAB2PIQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

stale bot added the stale label Sep 22, 2021

stale bot closed this as completed Oct 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thanos receive store locally for endpoint conflict #3913

Thanos receive store locally for endpoint conflict #3913

AsherBoone commented Mar 11, 2021 •

edited

dhohengassner commented Apr 19, 2021 •

edited

Kampe commented May 11, 2021

liangyuanpeng commented Jun 17, 2021

starleaffff commented Jul 23, 2021 •

edited

stale bot commented Sep 22, 2021

stale bot commented Oct 11, 2021

koktlzz commented Nov 12, 2021

dploeger commented Nov 25, 2021

koktlzz commented Nov 26, 2021

enifeilio commented Mar 31, 2022

Kampe commented Apr 6, 2022

zhangrj commented Apr 6, 2022

sharathfeb12 commented Apr 14, 2022

FTwOoO commented May 12, 2022

sharathfeb12 commented May 30, 2022

phillebaba commented May 31, 2022

sharathfeb12 commented Jun 20, 2022

phillebaba commented Jun 21, 2022

sharathfeb12 commented Jul 1, 2022

cybervedaa commented Oct 19, 2022

matej-g commented Oct 19, 2022

cybervedaa commented Oct 19, 2022 via email

Thanos receive store locally for endpoint conflict #3913

Thanos receive store locally for endpoint conflict #3913

Comments

AsherBoone commented Mar 11, 2021 • edited

dhohengassner commented Apr 19, 2021 • edited

Kampe commented May 11, 2021

liangyuanpeng commented Jun 17, 2021

starleaffff commented Jul 23, 2021 • edited

stale bot commented Sep 22, 2021

stale bot commented Oct 11, 2021

koktlzz commented Nov 12, 2021

dploeger commented Nov 25, 2021

koktlzz commented Nov 26, 2021

enifeilio commented Mar 31, 2022

Kampe commented Apr 6, 2022

zhangrj commented Apr 6, 2022

sharathfeb12 commented Apr 14, 2022

FTwOoO commented May 12, 2022

sharathfeb12 commented May 30, 2022

phillebaba commented May 31, 2022

sharathfeb12 commented Jun 20, 2022

phillebaba commented Jun 21, 2022

sharathfeb12 commented Jul 1, 2022

cybervedaa commented Oct 19, 2022

matej-g commented Oct 19, 2022

cybervedaa commented Oct 19, 2022 via email

AsherBoone commented Mar 11, 2021 •

edited

dhohengassner commented Apr 19, 2021 •

edited

starleaffff commented Jul 23, 2021 •

edited