Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thanos receive store locally for endpoint conflict #3913

Closed
AsherBoone opened this issue Mar 11, 2021 · 22 comments
Closed

Thanos receive store locally for endpoint conflict #3913

AsherBoone opened this issue Mar 11, 2021 · 22 comments
Labels

Comments

@AsherBoone
Copy link

AsherBoone commented Mar 11, 2021

Thanos and Prometheus version used:
Thanos: v0.18.0
Prometheus: v2.11.1
Object Storage Provider:
S3
What happened:
Prometheus Error in the logs:

level=error ts=2021-03-04T08:02:54.350Z caller=queue_manager.go:699 component=remote queue=0:http://10.53.26.191:30021/api/v1/receive msg="non-recoverable error" count=100 err="server returned HTTP status 409 Conflict: 2 errors: store locally for endpoint thanos-receive-0.thanos-receive.cattle-prometheus.svc.cluster.local:10901: conflict; forwarding request to endpoint thanos-receive-1.thanos-receive.cattle-prometheus.svc.cluster.local:10901: rpc error: code = AlreadyExi"
level=error ts=2021-03-04T08:03:16.065Z caller=queue_manager.go:699 component=remote queue=0:http://10.53.26.191:30021/api/v1/receive msg="non-recoverable error" count=100 err="server returned HTTP status 409 Conflict: store locally for endpoint thanos-receive-0.thanos-receive.cattle-prometheus.svc.cluster.local:10901: conflict"
level=error ts=2021-03-04T08:03:16.099Z caller=queue_manager.go:699 component=remote queue=0:http://10.53.26.191:30021/api/v1/receive msg="non-recoverable error" count=100 err="server returned HTTP status 409 Conflict: forwarding request to endpoint thanos-receive-1.thanos-receive.cattle-prometheus.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-1.thanos-receive.cattle-prometheus.svc.cluster.local:10901: conflict"
level=error ts=2021-03-04T08:03:16.286Z caller=queue_manager.go:699 component=remote queue=0:http://10.53.26.191:30021/api/v1/receive msg="non-recoverable error" count=100 err="server returned HTTP status 409 Conflict: store locally for endpoint thanos-receive-0.thanos-receive.cattle-prometheus.svc.cluster.local:10901: conflict"
level=error ts=2021-03-04T08:03:16.318Z caller=queue_manager.go:699 component=remote queue=0:http://10.53.26.191:30021/api/v1/receive msg="non-recoverable error" count=100 err="server returned HTTP status 409 Conflict: store locally for endpoint thanos-receive-0.thanos-receive.cattle-prometheus.svc.cluster.local:10901: conflict"
level=error ts=2021-03-04T08:03:16.350Z caller=queue_manager.go:699 component=remote queue=0:http://10.53.26.191:30021/api/v1/receive msg="non-recoverable error" count=100 err="server returned HTTP status 409 Conflict: store locally for endpoint thanos-receive-0.thanos-receive.cattle-prometheus.svc.cluster.local:10901: conflict"
level=error ts=2021-03-04T08:03:16.573Z caller=queue_manager.go:699 component=remote queue=0:http://10.53.26.191:30021/api/v1/receive msg="non-recoverable error" count=100 err="server returned HTTP status 409 Conflict: forwarding request to endpoint thanos-receive-1.thanos-receive.cattle-prometheus.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-1.thanos-receive.cattle-prometheus.svc.cluster.local:10901: conflict"
level=error ts=2021-03-04T08:03:16.637Z caller=queue_manager.go:699 component=remote queue=0:http://10.53.26.191:30021/api/v1/receive msg="non-recoverable error" count=100 err="server returned HTTP status 409 Conflict: forwarding request to endpoint thanos-receive-1.thanos-receive.cattle-prometheus.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-1.thanos-receive.cattle-prometheus.svc.cluster.local:10901: conflict"
level=error ts=2021-03-04T08:03:16.671Z caller=queue_manager.go:699 component=remote queue=0:http://10.53.26.191:30021/api/v1/receive msg="non-recoverable error" count=100 err="server returned HTTP status 409 Conflict: 2 errors: store locally for endpoint thanos-receive-0.thanos-receive.cattle-prometheus.svc.cluster.local:10901: conflict; forwarding request to endpoint thanos-receive-1.thanos-receive.cattle-prometheus.svc.cluster.local:10901: rpc error: code = AlreadyExi"

hashring.json (configMap):

apiVersion: v1
kind: ConfigMap
metadata:
  name: thanos-receive-hashrings
  namespace: cattle-prometheus
data:
  thanos-receive-hashrings.json: |
    [
      {
        "hashring": "soft-tenants",
        "endpoints":
        [
          "thanos-receive-0.thanos-receive.cattle-prometheus.svc.cluster.local:10901",
          "thanos-receive-1.thanos-receive.cattle-prometheus.svc.cluster.local:10901"
        ]
      }
    ]

thanos receive:

    - args:
    - receive
    - --log.level=info
    - --log.format=logfmt
    - --grpc-address=0.0.0.0:10901
    - --http-address=0.0.0.0:10902
    - --remote-write.address=0.0.0.0:19291
    - --objstore.config-file=/etc/thanos/objectstorage.yaml
    - --receive.replication-factor=1
    - --tsdb.path=/var/thanos/receive
    - --tsdb.retention=12h
    - --http-grace-period=2m
    - --grpc-grace-period=2m
    - --label=receive_replica="$(NAME)"
    - --label=receive="true"
    - --receive.hashrings-file=/etc/thanos/thanos-receive-hashrings.json
    - --receive.hashrings-file-refresh-interval=3m
    - --receive.local-endpoint=$(NAME).thanos-receive.cattle-prometheus.svc.cluster.local:10901

receive prometheus remote_write: http://10.53.26.191:30021/api/v1/receive (nodeport mode)
Prometheus log has a lot of errors,I tried to modify the thanos receive many times, but the conflict still appeared. Can anyone help me?

@dhohengassner
Copy link

dhohengassner commented Apr 19, 2021

Thanks @AsherBoone for raising this!

Seeing probably the same issue on my clusters. It happens always when I roll the receiver statefulset.

Error on Prometheus side:
level=error remote_name=fc9017 url=https://thanos-receiver.my.domain/api/v1/receive msg="non-recoverable error" count=7250 err="server returned HTTP status 409 Conflict: conflict"

Seeing this and several other errors on Thanos-receive pod:
"failed to handle request" err="5 errors: backing off forward request for endpoint thanos-receive-0.thanos-receive.thanos.svc.cluster.local:10901: target not available; store locally for endpoint thanos-receive-4.thanos-receive.thanos.svc.cluster.local:10901: conflict; forwarding request to endpoint thanos-receive-1.thanos-receive.thanos.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-1.thanos-receive.thanos.svc.cluster.local:10901: conflict; forwarding request to endpoint thanos-receive-2.thanos-receive.thanos.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-2.thanos-receive.thanos.svc.cluster.local:10901: conflict; forwarding request to endpoint thanos-receive-3.thanos-receive.thanos.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-3.thanos-receive.thanos.svc.cluster.local:10901: conflict"

Any help/hint is appreciated!

@Kampe
Copy link

Kampe commented May 11, 2021

We see the same issues with our receiver and prometheus setup with prometheus:v2.26.0 and thanos:v0.20.1 respectively

@liangyuanpeng
Copy link

same issue, thanos, version 0.21.1 (branch: HEAD, revision: 3558f4a)

@starleaffff
Copy link

starleaffff commented Jul 23, 2021

We see similar issue, which leaves a hole of hours of missing metrics. thanos v0.19.0, prometheus v2.27.1.

The issue happens occasionally when we roll thanos-receive (replication factor 2, replicas 3). During the period of missing metrics, I see streams of errors with "conflict" and "HTTP status 500", which is interesting. Here is one example (with new lines inserted and endpoints shortened:

err="server returned HTTP status 500 Internal Server Error:
  2 errors:
    replicate write request for endpoint thanos-receive-1: quorum not reached: forwarding request to endpoint thanos-receive-1: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-1: conflict;
    replicate write request for endpoint thanos-receive-2: quorum not reached: forwarding request to endpoint thanos-receive-0: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-0: conflict"

And similar in thanos log:

err="
        2 errors:
        replicate write request for endpoint thanos-receive-2: quorum not reached: forwarding request to endpoint thanos-receive-0: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-0: conflict;
        replicate write request for endpoint thanos-receive-1: quorum not reached: forwarding request to endpoint thanos-receive-1: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-1: conflict"
    msg="internal server error"

If I understand correctly, this means that both thanos-receive-1 and thanos-receive-0 already has the metric sent by Prometheus. Why would Thanos respond with status 500 causing Prometheus to retry?

@stale
Copy link

stale bot commented Sep 22, 2021

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Sep 22, 2021
@stale
Copy link

stale bot commented Oct 11, 2021

Closing for now as promised, let us know if you need this to be reopened! 🤗

@stale stale bot closed this as completed Oct 11, 2021
@koktlzz
Copy link

koktlzz commented Nov 12, 2021

same issue in thanos v0.23.

@dploeger
Copy link

I think the only valid error here is what @starleaffff said (if you still have that).

Actually the message exactly says what happened:

(...) Conflict: store locally (...)

Meaning: I already have the data you've sent me, please store it locally and don't bother me with it.

And the receiver does just that unlike the sidecar which will provide the same data to the queriers and the queriers deduplicate.

So, yeah. 409 is okay, though very disturbing to see in the prometheus log. Maybe this should be documented somewhere? (Or it is and I haven't come across)

@koktlzz
Copy link

koktlzz commented Nov 26, 2021

I think the only valid error here is what @starleaffff said (if you still have that).

Actually the message exactly says what happened:

(...) Conflict: store locally (...)

Meaning: I already have the data you've sent me, please store it locally and don't bother me with it.

And the receiver does just that unlike the sidecar which will provide the same data to the queriers and the queriers deduplicate.

So, yeah. 409 is okay, though very disturbing to see in the prometheus log. Maybe this should be documented somewhere? (Or it is and I haven't come across)

Thank you for your reply. Maybe this error actually makes no sense.
Anyway, my prometheus and receiver works well. However, it really makes me panic as you say.😓

@enifeilio
Copy link

disturbing
it also makes me panic as you say

@Kampe
Copy link

Kampe commented Apr 6, 2022

I too clench up with these errors.

@zhangrj
Copy link

zhangrj commented Apr 6, 2022

The same in v0.25, any one can help?

@sharathfeb12
Copy link

Seeing same issue on v0.25.2 as well.

@FTwOoO
Copy link

FTwOoO commented May 12, 2022

Seeing same issue on v0.22.0 as well

@sharathfeb12
Copy link

Seeing the same issue on v0.26.0 as well.

@phillebaba
Copy link
Contributor

@sharathfeb12 what version of Prometheus are you running, and are you running it in agent mode?

@sharathfeb12
Copy link

I am running v2.30.1. seeing the same issue on v2.36.1 as well.

@phillebaba
Copy link
Contributor

@sharathfeb12 I am guessing you are running with a replication factor greater than 1? Out of interest are you running Thanos Router-Ingestor or just a single stage Recevier?

Temporarily setting the replication factor to 1 seemed to solve the issues. I have created #5407 to track some of the debugging that I have been doing for this issue. My guess is that the incorrect status code is returned by Thanos which causes Prometheus to keep retrying sending the same time series which Thanos already has. The reason setting the replication factor to 1 seems to help is because there is different error handling logic for it.

@sharathfeb12
Copy link

I am running with a replication factor of 2. This is because our GKE clusters go through a node pool upgrade very often. We do not want an outage when this happens. So, we are using a replication factor of 2.

Due to the errors, the service teams think there is an issue on the server-side; and are not confident using the Thanos solution. I have also seen that the count goes down when we run with 1 prometheus replica instead of running in HA.

@cybervedaa
Copy link

I see this issue in thanos 0.28.0 as well. I am running with replication factor = 1. But still see this occassionally. I have to turn off remote write on all prometheus instances, then re-enable it. Would be great if someone can work on a fix for this issue.

@matej-g
Copy link
Collaborator

matej-g commented Oct 19, 2022

@cybervedaa there are couple of related issues see namely #5407, we're looking at these actively 👍

@cybervedaa
Copy link

cybervedaa commented Oct 19, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests