Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

receive: Incorrect error status code with multiple AlreadyExists #5407

Open
phillebaba opened this issue Jun 7, 2022 · 13 comments
Open

receive: Incorrect error status code with multiple AlreadyExists #5407

phillebaba opened this issue Jun 7, 2022 · 13 comments

Comments

@phillebaba
Copy link
Contributor

phillebaba commented Jun 7, 2022

Thanos, Prometheus and Golang version used:

Thanos: quay.io/thanos/thanos:v0.26.0
Prometheus: quay.io/prometheus/prometheus:2.36.0

Object Storage Provider:
Azure

What happened:

I am running a multi-tenant router-receiver setup with multiple Prometheus instances remote writing to it, and a replication factor of 2. At times an error occurs where remote writing fails, and Prometheus will have to resend time series. However when this occurs some time series may already be present in the receiver store and return a AlreadyExistserror. This does not seem to be an issue when a single receiver already has the time series, but a bigger issue when multiple receivers returns the same error. The Prometheus instance will receive a 500 error and try again, forcing it to end up in an infinite loop where it will never be able to recover from.

What you expected to happen:

I expect the correct error to be returned for this condition so that the Prometheus instance will stop sending the same time series which already exist in the receiver store.

How to reproduce it (as minimally and precisely as possible):

This error will only occur when receivers are being torn down and setup again one by one. While this is being done some replications may fail causing the Prometheus instance to resend the same data. It is hard to pin point more conditions to get this error to happen, as sometimes everything works out without any issues. When the error occurs every single Prometheus instance will fail to write as they will continue to attempt sending the same metrics over and over again.

Full logs to relevant components:

level=error ts=2022-06-07T14:16:31.548131665Z caller=handler.go:373 component=receive component=receive-handler tenant=*** err="2 errors: replicate write request for endpoint thanos-ingestor-receiver-0.thanos-ingestor-receiver.monitor.svc.cluster.local:10901: quorum not reached: forwarding request to endpoint thanos-ingestor-receiver-0.thanos-ingestor-receiver.monitor.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-ingestor-receiver-0.monitor.svc.cluster.local:10901: conflict; replicate write request for endpoint thanos-ingestor-receiver-5.thanos-ingestor-receiver.monitor.svc.cluster.local:10901: quorum not reached: forwarding request to endpoint thanos-ingestor-receiver-0.thanos-ingestor-receiver.monitor.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-ingestor-receiver-0.monitor.svc.cluster.local:10901: conflict" msg="internal server error"

Anything else we need to know:

I have done some analysis of what is occurring and my guess is that the determineWriteErrorCause function has a bug in it where it does not detect the conflict error and return it.

func determineWriteErrorCause(err error, threshold int) error {

Looking at the handler code it is apparent that if a wrapped error is returned a 500 status code will be given instead of a 409 which is the correct one for conflict errors.

switch determineWriteErrorCause(err, 1) {
case nil:
return
case errNotReady:
http.Error(w, err.Error(), http.StatusServiceUnavailable)
case errUnavailable:
http.Error(w, err.Error(), http.StatusServiceUnavailable)
case errConflict:
http.Error(w, err.Error(), http.StatusConflict)
case errBadReplica:
http.Error(w, err.Error(), http.StatusBadRequest)
default:
level.Error(tLogger).Log("err", err, "msg", "internal server error")
http.Error(w, err.Error(), http.StatusInternalServerError)
}

Reading through the replication code it seems like the original AlreadyExists error is being wrapped two or three times into other errors. When this is done the function to determine if the error is a conflict will not work and instead it will return the whole error causing a 500 status.

ec <- errors.Wrapf(err, "replicate write request for endpoint %v", endpoint)

ec <- errors.Wrapf(err, "forwarding request to endpoint %v", endpoint)

return errors.Wrap(determineWriteErrorCause(err, quorum), "quorum not reached")

We need to write a unit test specifically replicating the AlreadyExists error to verify that determineWriteErrorCause will return a conflict error.

@phillebaba
Copy link
Contributor Author

phillebaba commented Jun 7, 2022

I think I have found the origin of this issue. As the comment in determineWriteErrorCause states it will not go deeper than the first multi error to determine the cause. The issue is that there are places where a returned multi error is flattened and places where it is not. If the multi error is not flattened the cause will never be determined. Resulting in a 500 status code.

This error is flattened with a comment explaining why it is.

// When a MultiError is added to another MultiError, the error slices are concatenated, not nested.
// To avoid breaking the counting logic, we need to flatten the error.
level.Debug(tLogger).Log("msg", "local tsdb write failed", "err", err.Error())
ec <- errors.Wrapf(determineWriteErrorCause(err, 1), "store locally for endpoint %v", endpoint)

In the case where replicate is called in fan out it is not flattened, meaning that there is a risk that a multi error is placed inside of a multi error.

go func(endpoint string) {
defer wg.Done()
var err error
tracing.DoInSpan(fctx, "receive_replicate", func(ctx context.Context) {
err = h.replicate(ctx, tenant, wreqs[endpoint])
})
if err != nil {
h.replications.WithLabelValues(labelError).Inc()
ec <- errors.Wrapf(err, "replicate write request for endpoint %v", endpoint)
return
}
h.replications.WithLabelValues(labelSuccess).Inc()
ec <- nil
}(endpoint)

@phillebaba
Copy link
Contributor Author

I really think that there is something weird going on when the replication factor is greater than one. Decreasing the replication factor temporarily gets all Prometheus agents out of their bad loop, and then it can be increased again. When decreasing the replication factor it starts returning 409 errors instead which gets them to backoff and decrease their shards.

The issue with testing this is that it is difficult to reproduce the error properly.

@fpetkovski
Copy link
Contributor

fpetkovski commented Jun 9, 2022

I think we might be able to catch this with a good enough test. For clarification, do you only see this when running separate receivers and ingesters? You could run the same multi-tenant setup without having a split.

@fpetkovski
Copy link
Contributor

fpetkovski commented Jun 9, 2022

This can also be a bug in how we verify quorum during replication:
https://github.com/thanos-io/thanos/blob/main/pkg/receive/handler.go#L438

The actual formula should be

(ReplicationFactor + 1) / 2

So with replication factor of 2, it should be sufficient for just one write to succeed.

That is also what is described in the design doc: https://thanos.io/tip/proposals-done/201812-thanos-remote-receive.md/

@phillebaba
Copy link
Contributor Author

It might take some time to setup a new environment without the ingestor split. I will try to get some time to write a unit test to reproduce this.

@Tontaube
Copy link

Please ask if you need further information, we're experiencing this issue in a routing receive setup and obviously use multiple receive for scaling reasons, which makes applying the workaround quite hard.

@phillebaba
Copy link
Contributor Author

@Tontaube are you seeing the same issue with a replication factor of 1? I think that is the interesting detail, which will help confirm the idea that the issue is related to the replication factor logic.

@Kampe
Copy link

Kampe commented Jul 3, 2022

Can confirm, we see the same issue with a replication factor of 1

@Tontaube
Copy link

I cannot tell, as our environment recovered due to solved other issues. We saw the problem always then, when data came in more than 2 hours too late. Reducing the receive instances was not tried, on one hand due to the fact it being a gitops only environment (no manual adaptions in prod) and the assumption, that one receiver would not be capable to handle the load, anyway.

@enimath
Copy link

enimath commented Aug 12, 2022

I am not 100% sure but maybe I am strugling same issue on my HA setup with 3x receivers , 2x replica :
thanos, version 0.27.0 (branch: HEAD, revision: f68bb36)
build user: root@708554d4fa46
build date: 20220705-14:57:42
go version: go1.17.11
platform: linux/amd64
#5513

@matej-g
Copy link
Collaborator

matej-g commented Sep 16, 2022

Hey @phillebaba,

I revisited this and I think you're totally right, we are unable to properly determine error cause if there is a multi-error within a multi-error. This can happen, especially with replication factor 2, which you mention is your case. Looking at this portion of the method:

thanos/pkg/receive/handler.go

Lines 1126 to 1132 in 9237dd9

// Determine which error occurred most.
sort.Sort(sort.Reverse(expErrs))
if exp := expErrs[0]; exp.count >= threshold {
return exp.err
}
return err

we see that the method returns a "cause" error (line 1129) if threshold is reached or it returns the original error (line 1132).

What I'm suspecting is that setting threshold to write quorum can cause this unwanted multi-error nesting. I considered two different scenarios:

  • with replication factor 2, the write quorum (threshold) is 2. This means that during a fanout, if 1 request succeeds and 1 request fails (what seems to be your case), we will never reach threshold. This means that even if error from the one failing request is a conflict error, we will actually return the multi-error (causing the "multi-error in multi-error" problem and resulting in 500).

  • with replication factor 3 (which for example uses our team), the write quorum (threshold) is still 2. Things can go a few ways:
    a) If 1 request fails, we still reached quorum, so there will be no error returned to Prometheus client
    b) if 2 requests fail with conflict error, we will reach the threshold and return conflict error = there will be no nested multi-error
    c) if 2 requests fail with different causes errors, we will not reach threshold, returning multi-error = we will return 500 to Prometheus client and it will retry

This might explain why this issue is not very prevalent with (I believe) more commonly used quorums like 3 or 5, where we usually reach the threshold. However, with quorum 2, if 1 request fails, this automatically means we always nest multi-error = we always return 500 to client = infinite retries.

I hope this is making sense. If these assumptions are correct, I see two ways of fixing this:
a) We remove the threshold and simply return either a the most frequent expected cause error or if none are found, we return the original error
b) As @fpetkovski was suggesting, our write quorum formula might not be as was originally intended - if we switched to (n + 1) / 2, this would make difference for even number replication factors, since with replication factor 2, we would only need 1 successful write and threshold during fanout error would be 1 as well, meaning we would avoid the nested multi-error (this would be my preference)

@squat
Copy link
Member

squat commented Sep 16, 2022

I can absolutely believe that there are some subtle issues in receive error handling. We've brought up discussions around this in the past and I fear that we haven't been completely rigorous in our assessment of receive errors :///
xref: #3833 (comment), #2899 (review)

just to add some historical context, the behavior of error handling was changed when we changed multi-error libraries and how we flattened/nested them. this behavior entirely changed how errors are counted and thus how quorum is validated. This challenge has been further highlighted in the past by speed optimizations to the error handling path that may have simplified error handling for speed reasons but maybe did not account for all of the edge-cases. That's not to say that the goals of correctness and performance are incompatible.

@matej-g I think your explanation sounds very very feasible however I'm worried that the we'll continue to play whackamole with receiver error issues if we aren't completely rigorous and exhaustive in how we analyze all of the potential error scenarios. I guess overall I'm mostly worried that we will implement heuristics for simplicity instead of being completely exhaustive.

a) We remove the threshold and simply return either a the most frequent expected cause error or if none are found, we return the original error

I'm somewhat confused by this; could you clarify it? The most frequent error cause isn't necessarily the failure we care about bubbling up to the top and thus not necessarily what should dictate the final error code, right?

I think we would be very well served to add explicit test cases for the scenarios you highlighted in handler_test.go so we can document the correct behavior in tests.

@SushilSanjayBhile
Copy link

Anyone working on this?
I am also facing same issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants