Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ketama quorum #5910

Merged
merged 7 commits into from Dec 15, 2022
Merged

Fix ketama quorum #5910

merged 7 commits into from Dec 15, 2022

Conversation

fpetkovski
Copy link
Contributor

@fpetkovski fpetkovski commented Nov 20, 2022

The quorum calculation is currently broken when using the Ketama
hashring. The reasons are explained in detail in issue #5784.

This commit fixes quorum calculation by tracking successfull writes
for each individual time-series inside a remote-write request.

The commit also removes the replicate() method inside the Handler
and moves the entire logic of fanning out and calculating success
into the fanoutForward() method.

Signed-off-by: Filip Petkovski filip.petkovsky@gmail.com

Fixes #5784

Copy link
Collaborator

@matej-g matej-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @fpetkovski, I think the approach in general looks good and somewhat converges with #5791 as you mentioned. Couple of thoughts:

  • I'm not sure if error determination works correctly here, it might happen that we will have a mix of different failure reasons for different replication batches (some may end up with 409, some with 500) - in such case I think we don't have other option but to tell client to retry (i.e. return server error).
  • Would be good to add more tests cases with different numbers of nodes / replication factor + E2E tests, perhaps taken over from Receiver: Fix quorum handling for all hashing algorithms #5791

pkg/receive/handler.go Outdated Show resolved Hide resolved
@fpetkovski fpetkovski force-pushed the ketama-quorum branch 6 times, most recently from 7542349 to 2ec06ca Compare November 24, 2022 11:02
@fpetkovski fpetkovski marked this pull request as ready for review November 24, 2022 11:04
pkg/receive/handler.go Outdated Show resolved Hide resolved
The quorum calculation is currently broken when using the Ketama
hashring. The reasons are explained in detail in issue
thanos-io#5784.

This commit fixes quorum calculation by tracking successfull writes
for each individual time-series inside a remote-write request.

The commit also removes the replicate() method inside the Handler
and moves the entire logic of fanning out and calculating success
into the fanoutForward() method.

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
@fpetkovski
Copy link
Contributor Author

Thanks everyone for the review. We had a sync with @matej-g and it seems like the only correct way to verify quorum is to track successful writes for each individual time-series. I've updated this PR to reflect that.

pkg/receive/handler.go Outdated Show resolved Hide resolved
pkg/receive/handler.go Outdated Show resolved Hide resolved
Copy link
Collaborator

@matej-g matej-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we spoke, overall this approach seems fine and more understandable than with replicating batches. One more part to figure out is the error handling / determination.

On the other hand, I'm uncertain about the performance implications, as we're changing the characteristics of how replication in receiver works. That's on both micro level (as we'll now track each series replication instead of batches) and macro level (we'll send fewer but bigger requests). It would be nice to run some of the benchmarks we have for handler as well as see this in action on a cluster with some real traffic or a synthetic load test.

if seriesReplicated {
errs.Add(rerr.Err())
} else if uint64(len(rerr)) >= failureThreshold {
cause := determineWriteErrorCause(rerr.Err(), quorum)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll also have to change how we determine the HTTP error we return to client, when this cause error bubbles up back to handleRequest. Right now, we'll return error that occurs the most or the original multi error, since we use threshold 1. But this might be incorrect, as if a cause error for any individual series replications will be server error, we have to retry the whole request. I think solution would be:

  • Return server error, if any of the cause errors is an unknown error / unavailable / not ready (cases when we have to retry). Tricky but less important part here might be exactly which error to return if we have a mixed bag of server errors - the behavior of client should be same though regardless of the error message we decide to return.
  • Otherwise we should only have conflict errors and can return conflict

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing we should be mindful of here is when cause will return the original multi-error (and same actually above for the if branch), we are putting a multi-error inside of the errs multi-error, which can lead to erroneous 5xx as described in #5407 (comment) .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you are correct. However I wonder if we already have this issue in main because we calculate the top-level cause the same way using threshold=1. So if we have 2 batches with conflict and 1 batch with server error, we will return conflict to the user and not retry the request.

In any case, I would also prefer to solve this problem now since it can lead to data loss.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I am not sure about is what should the error code be when we try to replicate a series and we get one success, one server error and one client error. Right now I believe we return client-error, but if we change the rules, we would return a server error. It also means that in case of 2 conflicts (samples already exist in TSDB) and 1 server error, we would still return a server error even though that might not be necessary.

Maybe for replicating an individual series we can treat client-errors as success and only return 5xx when two replicas fail. For the overall response, we can return 5xx if any series has a 5xx.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I believe we basically have to treat conflict as if we have 'success'. It's just important to return the correct status to the upstream so if we have any conflicts in the replication, we'll want to return this to the client. Otherwise 5xx and OK should be clear (5xx if any series fails quorum; OK if no failed quorums or conflicts).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I think the MultiError and determineWriteErrorCause are not good abstractions for this. The determineWriteErrorCause function is overloaded and tries to determine the error for both cases.

Because of this, I added two error types writeErrors, and replicationErrors with their own own Cause() methods. The writeErrors cause prioritizes server errors, while the one from replicationErrors is mostly identical to determineWriteErrorCause and is used for determining the error of replicating a single series.

This way we always use the Cause method and depending on the error type we will bubble the appropriate error.

@fpetkovski fpetkovski force-pushed the ketama-quorum branch 4 times, most recently from 12dd5d5 to b0a227b Compare November 26, 2022 10:19
Signed-off-by: fpetkovski <filip.petkovsky@gmail.com>
@fpetkovski fpetkovski force-pushed the ketama-quorum branch 3 times, most recently from 4c7536b to 0c6087b Compare November 26, 2022 18:18
Signed-off-by: fpetkovski <filip.petkovsky@gmail.com>
@fpetkovski fpetkovski force-pushed the ketama-quorum branch 2 times, most recently from 2c4ed70 to a19fa93 Compare November 27, 2022 11:25
@fpetkovski
Copy link
Contributor Author

Here are the benchmark results with the per-series error tracking:

name                                                                               old time/op    new time/op    delta
HandlerReceiveHTTP/typical_labels_under_1KB,_500_of_them/OK-8                         576µs ± 2%     611µs ± 3%    +6.20%  (p=0.000 n=9+9)
HandlerReceiveHTTP/typical_labels_under_1KB,_500_of_them/conflict_errors-8            649µs ± 3%    1057µs ± 2%   +62.75%  (p=0.000 n=10+9)
HandlerReceiveHTTP/typical_labels_under_1KB,_5000_of_them/OK-8                       5.60ms ± 2%    5.95ms ± 1%    +6.10%  (p=0.000 n=9+8)
HandlerReceiveHTTP/typical_labels_under_1KB,_5000_of_them/conflict_errors-8          6.34ms ± 1%   10.57ms ± 1%   +66.61%  (p=0.000 n=10+9)
HandlerReceiveHTTP/typical_labels_under_1KB,_20000_of_them/OK-8                      22.6ms ± 1%    23.6ms ± 1%    +4.44%  (p=0.000 n=8+9)
HandlerReceiveHTTP/typical_labels_under_1KB,_20000_of_them/conflict_errors-8         25.6ms ± 2%    41.3ms ± 2%   +61.48%  (p=0.000 n=10+9)
HandlerReceiveHTTP/extremely_large_label_value_10MB,_10_of_them/OK-8                 51.7ms ± 1%    51.7ms ± 1%      ~     (p=0.796 n=10+10)
HandlerReceiveHTTP/extremely_large_label_value_10MB,_10_of_them/conflict_errors-8    52.2ms ± 1%    52.0ms ± 2%      ~     (p=0.436 n=9+9)

name                                                                               old alloc/op   new alloc/op   delta
HandlerReceiveHTTP/typical_labels_under_1KB,_500_of_them/OK-8                        1.15MB ± 0%    1.20MB ± 0%    +3.57%  (p=0.000 n=9+10)
HandlerReceiveHTTP/typical_labels_under_1KB,_500_of_them/conflict_errors-8           1.41MB ± 0%    1.79MB ± 0%   +27.16%  (p=0.000 n=10+10)
HandlerReceiveHTTP/typical_labels_under_1KB,_5000_of_them/OK-8                       13.0MB ± 0%    13.6MB ± 0%    +4.36%  (p=0.000 n=9+9)
HandlerReceiveHTTP/typical_labels_under_1KB,_5000_of_them/conflict_errors-8          15.5MB ± 0%    19.5MB ± 1%   +25.67%  (p=0.000 n=10+10)
HandlerReceiveHTTP/typical_labels_under_1KB,_20000_of_them/OK-8                      52.6MB ± 1%    56.9MB ± 1%    +8.29%  (p=0.000 n=9+10)
HandlerReceiveHTTP/typical_labels_under_1KB,_20000_of_them/conflict_errors-8         62.0MB ± 0%    78.6MB ± 1%   +26.80%  (p=0.000 n=10+10)
HandlerReceiveHTTP/extremely_large_label_value_10MB,_10_of_them/OK-8                  110MB ± 0%     110MB ± 0%      ~     (p=0.105 n=10+10)
HandlerReceiveHTTP/extremely_large_label_value_10MB,_10_of_them/conflict_errors-8     110MB ± 0%     110MB ± 0%    +0.00%  (p=0.050 n=10+10)

name                                                                               old allocs/op  new allocs/op  delta
HandlerReceiveHTTP/typical_labels_under_1KB,_500_of_them/OK-8                         3.10k ± 0%     3.61k ± 0%   +16.63%  (p=0.000 n=9+10)
HandlerReceiveHTTP/typical_labels_under_1KB,_500_of_them/conflict_errors-8            6.63k ± 0%    14.65k ± 0%  +120.83%  (p=0.000 n=10+10)
HandlerReceiveHTTP/typical_labels_under_1KB,_5000_of_them/OK-8                        30.3k ± 0%     35.4k ± 0%   +16.57%  (p=0.000 n=9+9)
HandlerReceiveHTTP/typical_labels_under_1KB,_5000_of_them/conflict_errors-8           65.1k ± 0%    145.2k ± 0%  +122.84%  (p=0.000 n=10+9)
HandlerReceiveHTTP/typical_labels_under_1KB,_20000_of_them/OK-8                        121k ± 0%      141k ± 0%   +16.56%  (p=0.000 n=9+9)
HandlerReceiveHTTP/typical_labels_under_1KB,_20000_of_them/conflict_errors-8           260k ± 0%      580k ± 0%  +123.02%  (p=0.000 n=7+9)
HandlerReceiveHTTP/extremely_large_label_value_10MB,_10_of_them/OK-8                   87.1 ± 1%     107.4 ± 1%   +23.29%  (p=0.000 n=9+10)
HandlerReceiveHTTP/extremely_large_label_value_10MB,_10_of_them/conflict_errors-8       219 ± 1%       384 ± 0%   +75.33%  (p=0.000 n=10+9)

There is a notable difference when we have actual errors, but this is likely expected because we have more errors to work with, and more objects to manage.

Signed-off-by: fpetkovski <filip.petkovsky@gmail.com>
Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
@fpetkovski
Copy link
Contributor Author

Rolled this out in staging today to see if there are any differences in resource usage. We have a routers and this is what cpu and memory looks like before and after the rollout

image

The spike indicates when the rollout took place.

Copy link
Collaborator

@matej-g matej-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest changes are looking good, differentiating between write and replication errors makes the error handling more digestible 👍 I have couple more nits and would be good if in general we could add few more comments here and there in the forward method to make the 'funneling' from writer errors -> replication errors -> final error a bit more obvious. I'm also wondering since now we have quite a lot of error handling code, if it would make sense to extract these types and methods into separate file (e.g. receive/errors.go).

We also load tested the changes with @philipgough on our test cluster but could not see any difference in performance. The microbenchmark runs also look acceptable for me. So performance-wise I'd expect this to be all good.

pkg/receive/handler.go Outdated Show resolved Hide resolved
pkg/receive/handler.go Outdated Show resolved Hide resolved
pkg/receive/handler.go Outdated Show resolved Hide resolved
pkg/receive/handler.go Outdated Show resolved Hide resolved
pkg/receive/handler.go Outdated Show resolved Hide resolved
pkg/receive/handler.go Outdated Show resolved Hide resolved
}

expErrs := expectedErrors{
{err: errUnavailable, cause: isUnavailable},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, we should not expect unavailable here, as that is expected on node level. I think we can only expect not ready (if TSDB appender is not ready) or conflict.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can still have an unavailable error because write errors can come either from writing to a local TSDB, or from sending a request for replication to a different node. And that node can return unavailable in various different cases:

switch determineWriteErrorCause(err, 1) {
case nil:
return &storepb.WriteResponse{}, nil
case errNotReady:
return nil, status.Error(codes.Unavailable, err.Error())
case errUnavailable:
return nil, status.Error(codes.Unavailable, err.Error())
case errConflict:
return nil, status.Error(codes.AlreadyExists, err.Error())
case errBadReplica:
return nil, status.Error(codes.InvalidArgument, err.Error())
default:
return nil, status.Error(codes.Internal, err.Error())
}

If the cause of a replicationErr is an unavailable error, then this error will bubble up to the write errors and we need to be able to detect it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, you're right, I see the flow now. I got confused because I associated write errors only in narrow sense (i.e. TSDB write errors) but we're also using them to capture remote write errors on line 626 that can originate in node's unavailability etc.

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
matej-g
matej-g previously approved these changes Nov 29, 2022
Copy link
Collaborator

@matej-g matej-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR looks good to me now 👍, great job @fpetkovski.

One more theoretical concern I discussed with @fpetkovski was what kind of effect would the increased resource usage for error handling have on an 'unhappy' path kind of scenario (e.g. we have some nodes down in our hashring or clients keep sending us invalid data, resulting in an increased error rate in the system - since on microbenchmarks we see this could consume ~20% more memory, would this translate to an overall increase of memory usage in a receive replica? Could that lead to further destabilization of a hashring?). We could run some additional load test to try out this hypothesis (cc @philipgough).

With this in mind, I'm still happy to go forward and iterate on this solution if any performance issues pop up.

Still I'd also like more eyes on this, nominating @bwplotka @philipgough @douglascamata 😜

@@ -51,196 +47,6 @@ import (
"github.com/thanos-io/thanos/pkg/testutil"
)

func TestDetermineWriteErrorCause(t *testing.T) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could replace this with couple of test cases for replicationErrors and writeErrors cause?

Copy link
Contributor

@douglascamata douglascamata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggesting some changes to variable names to make understand this code slightly easier.

return err
}
key := endpointReplica{endpoint: endpoint, replica: rn}
er, ok := wreqs[key]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this variable named er receive a better name? I have no clue what an er is and it's easy to mistake it for err and even endpointReplica (often variables of this type have the name er, which is something else I think we have to slowly move away from).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I renamed this variable to writeTarget for clarity.

Comment on lines 659 to 660
if er.endpoint == h.options.Endpoint {
go func(er endpointReplica) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment here about the er variable name, which also applies to other occurrences: it gives no clue of what it is in this context and can be easily confused with err . Could we rename it? Some suggestions: replicationKey, replicaKey, replicationID, endpointReplica.

@@ -607,68 +644,41 @@ func (h *Handler) fanoutForward(pctx context.Context, tenant string, wreqs map[e
tLogger = log.With(h.logger, logTags)
}

ec := make(chan error)
ec := make(chan writeResponse)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could ec receive a better name? It's used many times in the next hundreds of lines and the name isn't clear. Suggestions: errorChannel, if that's even what is actually is. 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to responses.

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
Copy link
Contributor

@douglascamata douglascamata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the work, @fpetkovski. 🙇

This is a LATM (looks amazing to me)! 🚀

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job, especially on tests. LGTM 👍🏽

Although I would really want to batch those requests at some point.

// It will return cause of each contained error but will not traverse any deeper.
func determineWriteErrorCause(err error, threshold int) error {
// errorSet is a set of errors.
type errorSet struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long term, perhaps it would be better to just use merrors and some Dedup function?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also consider some compatibility or usage of the official error (un)wrapping coming with Go 1.20: https://tip.golang.org/doc/go1.20#errors

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks awesome 👍

@bwplotka bwplotka merged commit d76c723 into thanos-io:main Dec 15, 2022
ngraham20 pushed a commit to ngraham20/thanos that referenced this pull request May 18, 2023
* Fix quorum calculation for Ketama hashring

The quorum calculation is currently broken when using the Ketama
hashring. The reasons are explained in detail in issue
thanos-io#5784.

This commit fixes quorum calculation by tracking successfull writes
for each individual time-series inside a remote-write request.

The commit also removes the replicate() method inside the Handler
and moves the entire logic of fanning out and calculating success
into the fanoutForward() method.

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

* Fix error propagation

Signed-off-by: fpetkovski <filip.petkovsky@gmail.com>

* Fix writer errors

Signed-off-by: fpetkovski <filip.petkovsky@gmail.com>

* Separate write from replication errors

Signed-off-by: fpetkovski <filip.petkovsky@gmail.com>

* Add back replication metric

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

* Address PR comments

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

* Address code review comments

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
Signed-off-by: fpetkovski <filip.petkovsky@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Receive: Ketama replication quorum handling is incorrect
4 participants