receive: Incorrect error status code with multiple AlreadyExists #5407

phillebaba · 2022-06-07T16:40:16Z

Thanos, Prometheus and Golang version used:

Thanos: quay.io/thanos/thanos:v0.26.0
Prometheus: quay.io/prometheus/prometheus:2.36.0

Object Storage Provider:
Azure

What happened:

I am running a multi-tenant router-receiver setup with multiple Prometheus instances remote writing to it, and a replication factor of 2. At times an error occurs where remote writing fails, and Prometheus will have to resend time series. However when this occurs some time series may already be present in the receiver store and return a AlreadyExistserror. This does not seem to be an issue when a single receiver already has the time series, but a bigger issue when multiple receivers returns the same error. The Prometheus instance will receive a 500 error and try again, forcing it to end up in an infinite loop where it will never be able to recover from.

What you expected to happen:

I expect the correct error to be returned for this condition so that the Prometheus instance will stop sending the same time series which already exist in the receiver store.

How to reproduce it (as minimally and precisely as possible):

This error will only occur when receivers are being torn down and setup again one by one. While this is being done some replications may fail causing the Prometheus instance to resend the same data. It is hard to pin point more conditions to get this error to happen, as sometimes everything works out without any issues. When the error occurs every single Prometheus instance will fail to write as they will continue to attempt sending the same metrics over and over again.

Full logs to relevant components:

level=error ts=2022-06-07T14:16:31.548131665Z caller=handler.go:373 component=receive component=receive-handler tenant=*** err="2 errors: replicate write request for endpoint thanos-ingestor-receiver-0.thanos-ingestor-receiver.monitor.svc.cluster.local:10901: quorum not reached: forwarding request to endpoint thanos-ingestor-receiver-0.thanos-ingestor-receiver.monitor.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-ingestor-receiver-0.monitor.svc.cluster.local:10901: conflict; replicate write request for endpoint thanos-ingestor-receiver-5.thanos-ingestor-receiver.monitor.svc.cluster.local:10901: quorum not reached: forwarding request to endpoint thanos-ingestor-receiver-0.thanos-ingestor-receiver.monitor.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-ingestor-receiver-0.monitor.svc.cluster.local:10901: conflict" msg="internal server error"

Anything else we need to know:

I have done some analysis of what is occurring and my guess is that the determineWriteErrorCause function has a bug in it where it does not detect the conflict error and return it.

thanos/pkg/receive/handler.go

Line 730 in 17c5764

func determineWriteErrorCause(err error, threshold int) error {

Looking at the handler code it is apparent that if a wrapped error is returned a 500 status code will be given instead of a 409 which is the correct one for conflict errors.

thanos/pkg/receive/handler.go

Lines 361 to 375 in 17c5764

    
           switch determineWriteErrorCause(err, 1) { 
        
           case nil: 
        
           	return 
        
           case errNotReady: 
        
           	http.Error(w, err.Error(), http.StatusServiceUnavailable) 
        
           case errUnavailable: 
        
           	http.Error(w, err.Error(), http.StatusServiceUnavailable) 
        
           case errConflict: 
        
           	http.Error(w, err.Error(), http.StatusConflict) 
        
           case errBadReplica: 
        
           	http.Error(w, err.Error(), http.StatusBadRequest) 
        
           default: 
        
           	level.Error(tLogger).Log("err", err, "msg", "internal server error") 
        
           	http.Error(w, err.Error(), http.StatusInternalServerError) 
        
           }

Reading through the replication code it seems like the original AlreadyExists error is being wrapped two or three times into other errors. When this is done the function to determine if the error is a conflict will not work and instead it will return the whole error causing a 500 status.

thanos/pkg/receive/handler.go

Line 484 in 22aefd6

ec <- errors.Wrapf(err, "replicate write request for endpoint %v", endpoint)

thanos/pkg/receive/handler.go

Line 582 in 22aefd6

ec <- errors.Wrapf(err, "forwarding request to endpoint %v", endpoint)

thanos/pkg/receive/handler.go

Line 665 in 22aefd6

return errors.Wrap(determineWriteErrorCause(err, quorum), "quorum not reached")

We need to write a unit test specifically replicating the AlreadyExists error to verify that determineWriteErrorCause will return a conflict error.

The text was updated successfully, but these errors were encountered:

phillebaba · 2022-06-07T17:15:43Z

I think I have found the origin of this issue. As the comment in determineWriteErrorCause states it will not go deeper than the first multi error to determine the cause. The issue is that there are places where a returned multi error is flattened and places where it is not. If the multi error is not flattened the cause will never be determined. Resulting in a 500 status code.

This error is flattened with a comment explaining why it is.

thanos/pkg/receive/handler.go

Lines 510 to 513 in 22aefd6

    
           // When a MultiError is added to another MultiError, the error slices are concatenated, not nested. 
        
           // To avoid breaking the counting logic, we need to flatten the error. 
        
           level.Debug(tLogger).Log("msg", "local tsdb write failed", "err", err.Error()) 
        
           ec <- errors.Wrapf(determineWriteErrorCause(err, 1), "store locally for endpoint %v", endpoint)

In the case where replicate is called in fan out it is not flattened, meaning that there is a risk that a multi error is placed inside of a multi error.

thanos/pkg/receive/handler.go

Lines 475 to 490 in 22aefd6

    
           go func(endpoint string) { 
        
           	defer wg.Done() 
        
           	var err error 
        
           	tracing.DoInSpan(fctx, "receive_replicate", func(ctx context.Context) { 
        
           		err = h.replicate(ctx, tenant, wreqs[endpoint]) 
        
           	}) 
        
           	if err != nil { 
        
           		h.replications.WithLabelValues(labelError).Inc() 
        
           		ec <- errors.Wrapf(err, "replicate write request for endpoint %v", endpoint) 
        
           		return 
        
           	} 
        
           	h.replications.WithLabelValues(labelSuccess).Inc() 
        
           	ec <- nil 
        
           }(endpoint)

phillebaba · 2022-06-08T14:19:49Z

I really think that there is something weird going on when the replication factor is greater than one. Decreasing the replication factor temporarily gets all Prometheus agents out of their bad loop, and then it can be increased again. When decreasing the replication factor it starts returning 409 errors instead which gets them to backoff and decrease their shards.

The issue with testing this is that it is difficult to reproduce the error properly.

fpetkovski · 2022-06-09T05:53:59Z

I think we might be able to catch this with a good enough test. For clarification, do you only see this when running separate receivers and ingesters? You could run the same multi-tenant setup without having a split.

fpetkovski · 2022-06-09T09:11:35Z

This can also be a bug in how we verify quorum during replication:
https://github.com/thanos-io/thanos/blob/main/pkg/receive/handler.go#L438

The actual formula should be

(ReplicationFactor + 1) / 2

So with replication factor of 2, it should be sufficient for just one write to succeed.

That is also what is described in the design doc: https://thanos.io/tip/proposals-done/201812-thanos-remote-receive.md/

phillebaba · 2022-06-14T11:39:42Z

It might take some time to setup a new environment without the ingestor split. I will try to get some time to write a unit test to reproduce this.

Tontaube · 2022-06-28T06:52:31Z

Please ask if you need further information, we're experiencing this issue in a routing receive setup and obviously use multiple receive for scaling reasons, which makes applying the workaround quite hard.

phillebaba · 2022-06-29T14:04:09Z

@Tontaube are you seeing the same issue with a replication factor of 1? I think that is the interesting detail, which will help confirm the idea that the issue is related to the replication factor logic.

Kampe · 2022-07-03T01:54:41Z

Can confirm, we see the same issue with a replication factor of 1

Tontaube · 2022-07-12T12:26:26Z

I cannot tell, as our environment recovered due to solved other issues. We saw the problem always then, when data came in more than 2 hours too late. Reducing the receive instances was not tried, on one hand due to the fact it being a gitops only environment (no manual adaptions in prod) and the assumption, that one receiver would not be capable to handle the load, anyway.

enimath · 2022-08-12T11:02:38Z

I am not 100% sure but maybe I am strugling same issue on my HA setup with 3x receivers , 2x replica :
thanos, version 0.27.0 (branch: HEAD, revision: f68bb36)
build user: root@708554d4fa46
build date: 20220705-14:57:42
go version: go1.17.11
platform: linux/amd64
#5513

matej-g · 2022-09-16T15:49:43Z

Hey @phillebaba,

I revisited this and I think you're totally right, we are unable to properly determine error cause if there is a multi-error within a multi-error. This can happen, especially with replication factor 2, which you mention is your case. Looking at this portion of the method:

thanos/pkg/receive/handler.go

Lines 1126 to 1132 in 9237dd9

    
           // Determine which error occurred most. 
        
           sort.Sort(sort.Reverse(expErrs)) 
        
           if exp := expErrs[0]; exp.count >= threshold { 
        
           	return exp.err 
        
           } 
        
           return err

we see that the method returns a "cause" error (line 1129) if threshold is reached or it returns the original error (line 1132).

What I'm suspecting is that setting threshold to write quorum can cause this unwanted multi-error nesting. I considered two different scenarios:

with replication factor 2, the write quorum (threshold) is 2. This means that during a fanout, if 1 request succeeds and 1 request fails (what seems to be your case), we will never reach threshold. This means that even if error from the one failing request is a conflict error, we will actually return the multi-error (causing the "multi-error in multi-error" problem and resulting in 500).
with replication factor 3 (which for example uses our team), the write quorum (threshold) is still 2. Things can go a few ways:
a) If 1 request fails, we still reached quorum, so there will be no error returned to Prometheus client
b) if 2 requests fail with conflict error, we will reach the threshold and return conflict error = there will be no nested multi-error
c) if 2 requests fail with different causes errors, we will not reach threshold, returning multi-error = we will return 500 to Prometheus client and it will retry

This might explain why this issue is not very prevalent with (I believe) more commonly used quorums like 3 or 5, where we usually reach the threshold. However, with quorum 2, if 1 request fails, this automatically means we always nest multi-error = we always return 500 to client = infinite retries.

I hope this is making sense. If these assumptions are correct, I see two ways of fixing this:
a) We remove the threshold and simply return either a the most frequent expected cause error or if none are found, we return the original error
b) As @fpetkovski was suggesting, our write quorum formula might not be as was originally intended - if we switched to (n + 1) / 2, this would make difference for even number replication factors, since with replication factor 2, we would only need 1 successful write and threshold during fanout error would be 1 as well, meaning we would avoid the nested multi-error (this would be my preference)

squat · 2022-09-16T16:36:37Z

I can absolutely believe that there are some subtle issues in receive error handling. We've brought up discussions around this in the past and I fear that we haven't been completely rigorous in our assessment of receive errors :///
xref: #3833 (comment), #2899 (review)

just to add some historical context, the behavior of error handling was changed when we changed multi-error libraries and how we flattened/nested them. this behavior entirely changed how errors are counted and thus how quorum is validated. This challenge has been further highlighted in the past by speed optimizations to the error handling path that may have simplified error handling for speed reasons but maybe did not account for all of the edge-cases. That's not to say that the goals of correctness and performance are incompatible.

@matej-g I think your explanation sounds very very feasible however I'm worried that the we'll continue to play whackamole with receiver error issues if we aren't completely rigorous and exhaustive in how we analyze all of the potential error scenarios. I guess overall I'm mostly worried that we will implement heuristics for simplicity instead of being completely exhaustive.

a) We remove the threshold and simply return either a the most frequent expected cause error or if none are found, we return the original error

I'm somewhat confused by this; could you clarify it? The most frequent error cause isn't necessarily the failure we care about bubbling up to the top and thus not necessarily what should dictate the final error code, right?

I think we would be very well served to add explicit test cases for the scenarios you highlighted in handler_test.go so we can document the correct behavior in tests.

SushilSanjayBhile · 2024-03-26T16:56:05Z

Anyone working on this?
I am also facing same issue.

matej-g added bug component: receive needs-investigation labels Jun 8, 2022

phillebaba mentioned this issue Jun 21, 2022

Thanos receive store locally for endpoint conflict #3913

Closed

matej-g mentioned this issue Nov 24, 2022

Fix ketama quorum #5910

Merged

JayChanggithub mentioned this issue Jan 11, 2023

Receive component got "AlreadyExists desc = store" error. #6036

Open

philipgough mentioned this issue Feb 16, 2023

Thanos Receive: "Out of bounds metric" #4831

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

receive: Incorrect error status code with multiple AlreadyExists #5407

receive: Incorrect error status code with multiple AlreadyExists #5407

phillebaba commented Jun 7, 2022 •

edited

phillebaba commented Jun 7, 2022 •

edited

phillebaba commented Jun 8, 2022

fpetkovski commented Jun 9, 2022 •

edited

fpetkovski commented Jun 9, 2022 •

edited

phillebaba commented Jun 14, 2022

Tontaube commented Jun 28, 2022

phillebaba commented Jun 29, 2022

Kampe commented Jul 3, 2022

Tontaube commented Jul 12, 2022

enimath commented Aug 12, 2022

matej-g commented Sep 16, 2022

squat commented Sep 16, 2022

SushilSanjayBhile commented Mar 26, 2024

receive: Incorrect error status code with multiple AlreadyExists #5407

receive: Incorrect error status code with multiple AlreadyExists #5407

Comments

phillebaba commented Jun 7, 2022 • edited

phillebaba commented Jun 7, 2022 • edited

phillebaba commented Jun 8, 2022

fpetkovski commented Jun 9, 2022 • edited

fpetkovski commented Jun 9, 2022 • edited

phillebaba commented Jun 14, 2022

Tontaube commented Jun 28, 2022

phillebaba commented Jun 29, 2022

Kampe commented Jul 3, 2022

Tontaube commented Jul 12, 2022

enimath commented Aug 12, 2022

matej-g commented Sep 16, 2022

squat commented Sep 16, 2022

SushilSanjayBhile commented Mar 26, 2024

phillebaba commented Jun 7, 2022 •

edited

phillebaba commented Jun 7, 2022 •

edited

fpetkovski commented Jun 9, 2022 •

edited

fpetkovski commented Jun 9, 2022 •

edited