query: add 'sticky' store nodes #2072

GiedriusS · 2020-01-28T21:15:54Z

Add the ability to have 'sticky' store nodes - we will always consider them available and retain their clients. This makes it possible to have consistent partial responses when a node goes down when we know that it must be up at all times. The relevant part from the documentation which explains more:

Thanos Query periodically checks the health of the StoreAPI nodes that it knows about via the `Info()` gRPC call. However, if one is failing then it is not considered a part of the active set of StoreAPI nodes. To make them always part of the active set, you need to make them "sticky" - their last available information will be retained even in the face of a failure of a health-check.

This is useful in the cases where you have some kind of caching layer in front of Thanos Query i.e. Cortex's `query-frontend` and you know that certain nodes must always be alive. It allows you to get a partial response when one of the sticky nodes goes down.

To make a node sticky you need to add a suffix `+sticky` to the end of the address.

Sticky nodes have a yellow `UP` status in the `Stores` page if we have failed to check their data but we still consider them available.

Ad-hoc tests:

Tested locally. Turned off one store node which made it get removed from the healthyStores. Sending queries for a short time still gave an error but it went away after the refresh delay.

Added a +sticky suffix then. Shut off the same node and tried sending queries again. Got errors that Thanos Query couldn't connect to that node. Disabling partial response made it into an error.

Looking for feedback on:

Does it even make sense?! 😄
The code, ofc
Is it okay to have this as a suffix? Or should we make static targets 'sticky' by default? But that would change Thanos Query's behavior for our users and maybe someone wants to stay with the current behavior. IMHO we should leave this possibility to add a suffix.

WIP:

Still want to add a test for the parsing part
Looking for feedback

Relevant issue: #1651.

Add the ability to specify a '+sticky' suffix via the `--store` parameter. If it has been specified then the store node's information is retained between loops of checking and we always consider it healthy. It is useful in the cases where we run Cortex's query-frontend in front of Thanos Query and we want to always consider certain store nodes. We might want to avoid the suffix here and always consider all statically specified store nodes as sticky but that will be a backward incompatible change. Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com>

Make the 'UP' text yellow if it is a sticky node. Add verbiage about this to the documentation. Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com>

Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com>

First try to see if there is a special suffix `+sticky`. Then if it is not present continue with the usual parsing. Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com>

Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com>

bwplotka

Does it even make sense?! smile

Yes, I think so! Nice idea. However, I am not sure if I would not go even further: Make ALL stores "sticky"...

Is it okay to have this as a suffix? Or should we make static targets 'sticky' by default? But that would change Thanos Query's behavior for our users and maybe someone wants to stay with the current behavior. IMHO we should leave this possibility to add a suffix.

I don't like the suffix - I would say we should make ALL targets "sticky" if you specify extra flag to querier X. I think this feature makes sense a lot.

What I don't like is sticky - I think we can do better in naming this feature. I would say maybe --querier.store-strict-mode with huge description to explain that properly (:

I would also change the implementation. This is bit sketchy to put the broken address to healthyStores e.g we unnecessary dial/timeout wait for it etc. I think we need separate mechanism in storeset that will allow consumers to find out the unhealthy stores that will trigger partial resposne with correct error (: But I like the idea overall!

bwplotka · 2020-01-29T07:30:17Z

pkg/query/storeset.go

@@ -36,6 +36,8 @@ type StoreSpec interface {
 	// NOTE: It is implementation responsibility to retry until context timeout, but a caller responsibility to manage
 	// given store connection.
 	Metadata(ctx context.Context, client storepb.StoreClient) (labelSets []storepb.LabelSet, mint int64, maxt int64, storeType component.StoreAPI, err error)
+	// Sticky tells us whether the StoreSpec should always be considered healthy.
+	Sticky() bool


Suggested change

Sticky() bool

IsSticky() bool

to be consistent.

bwplotka · 2020-01-29T07:33:44Z

pkg/query/storeset.go

+				level.Warn(s.logger).Log("msg", "update of store node failed", "err", errors.Wrap(err, "getting metadata"), "address", addr, "sticky", spec.Sticky())
+				if seenAlready && spec.Sticky() {
+					mtx.Lock()
+					healthyStores[addr] = st


This feels very wrong (:

Putting unhealthy thing in healthy map

bwplotka · 2020-01-29T07:49:56Z

I would love to hear others' opinions because this is something we should do from start probably even. @brancz @squat @povilasv @FUSAKLA @domgreen @IKSIN @d-ulyanov

GiedriusS · 2020-01-29T15:16:18Z

I would love to hear others' opinions because this is something we should do from start probably even. @brancz @squat @povilasv @FUSAKLA @domgreen @IKSIN @d-ulyanov

Since it seems like there are different options and opinions here so lets put this PR on hold for a bit and let me make a proposal PR instead where we would make the decision on how to proceed.

GiedriusS · 2020-01-31T21:17:52Z

The proposal which suggests going further with this: #2086.

d-ulyanov · 2020-02-01T11:25:03Z

Hi, guys and thank you for mentioning us!
Totally agree with @bwplotka - I would make all stores "sticky".

As a user I expect following behaviuor:
As we're adding stores just as a simple list to Thanos Query config - we expect all of them work.
If one of them is not responding - any response should be marked as partial: we don't know for sure is retrieved data consistent or not (hope in the future we'll invent some approach how to know are requested metrics located on unhealthy Store or not).

If partial response enabled - as usual I'm expecting to get warning, otherwise - request must fail.
Implementation could be simple - let's make StoreSet keep 2 lists - all stores and healthy stores. If their lenght is not equal - mark response as partial in all cases (in the future we can improve this logic).

Also, I would not add any logic related to "sticky" to DNSProvider - I'm not sure that its should think about anything except for DNS resolution, maybe it would be better to add some wrapper around DNS provider.

GiedriusS · 2020-02-20T15:36:44Z

I will close this and open another one with the agreed implementation on the proposal (:

GiedriusS added 5 commits January 28, 2020 22:25

query: add indication about sticky nodes to stores page

90b6e0e

Make the 'UP' text yellow if it is a sticky node. Add verbiage about this to the documentation. Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com>

ui: fix logical error

fd862dd

Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com>

dns: provider: fix parsing

0b386b8

First try to see if there is a special suffix `+sticky`. Then if it is not present continue with the usual parsing. Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com>

query: storeset: add test for a sticky node

9c61e72

Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com>

GiedriusS added the component: query label Jan 28, 2020

bwplotka requested changes Jan 29, 2020

View reviewed changes

bwplotka mentioned this pull request Feb 1, 2020

Add proposal for improving Thanos Query healthiness handling #2086

Merged

GiedriusS closed this Feb 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

query: add 'sticky' store nodes #2072

query: add 'sticky' store nodes #2072

GiedriusS commented Jan 28, 2020 •

edited

Loading

bwplotka left a comment •

edited

Loading

bwplotka Jan 29, 2020

bwplotka Jan 29, 2020

bwplotka commented Jan 29, 2020

GiedriusS commented Jan 29, 2020

GiedriusS commented Jan 31, 2020

d-ulyanov commented Feb 1, 2020

GiedriusS commented Feb 20, 2020

query: add 'sticky' store nodes #2072

query: add 'sticky' store nodes #2072

Conversation

GiedriusS commented Jan 28, 2020 • edited Loading

bwplotka left a comment • edited Loading

Choose a reason for hiding this comment

bwplotka Jan 29, 2020

Choose a reason for hiding this comment

bwplotka Jan 29, 2020

Choose a reason for hiding this comment

bwplotka commented Jan 29, 2020

GiedriusS commented Jan 29, 2020

GiedriusS commented Jan 31, 2020

d-ulyanov commented Feb 1, 2020

GiedriusS commented Feb 20, 2020

GiedriusS commented Jan 28, 2020 •

edited

Loading

bwplotka left a comment •

edited

Loading