Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

query: add 'sticky' store nodes #2072

Closed
wants to merge 5 commits into from

Conversation

GiedriusS
Copy link
Member

@GiedriusS GiedriusS commented Jan 28, 2020

Add the ability to have 'sticky' store nodes - we will always consider them available and retain their clients. This makes it possible to have consistent partial responses when a node goes down when we know that it must be up at all times. The relevant part from the documentation which explains more:

Thanos Query periodically checks the health of the StoreAPI nodes that it knows about via the `Info()` gRPC call. However, if one is failing then it is not considered a part of the active set of StoreAPI nodes. To make them always part of the active set, you need to make them "sticky" - their last available information will be retained even in the face of a failure of a health-check.

This is useful in the cases where you have some kind of caching layer in front of Thanos Query i.e. Cortex's `query-frontend` and you know that certain nodes must always be alive. It allows you to get a partial response when one of the sticky nodes goes down.

To make a node sticky you need to add a suffix `+sticky` to the end of the address.

Sticky nodes have a yellow `UP` status in the `Stores` page if we have failed to check their data but we still consider them available.

Ad-hoc tests:

Tested locally. Turned off one store node which made it get removed from the healthyStores. Sending queries for a short time still gave an error but it went away after the refresh delay.

Added a +sticky suffix then. Shut off the same node and tried sending queries again. Got errors that Thanos Query couldn't connect to that node. Disabling partial response made it into an error.

Looking for feedback on:

  • Does it even make sense?! 😄
  • The code, ofc
  • Is it okay to have this as a suffix? Or should we make static targets 'sticky' by default? But that would change Thanos Query's behavior for our users and maybe someone wants to stay with the current behavior. IMHO we should leave this possibility to add a suffix.

WIP:

  • Still want to add a test for the parsing part
  • Looking for feedback

Relevant issue: #1651.

Add the ability to specify a '+sticky' suffix via the `--store`
parameter. If it has been specified then the store node's information is
retained between loops of checking and we always consider it healthy.

It is useful in the cases where we run Cortex's query-frontend in front
of Thanos Query and we want to always consider certain store nodes.

We might want to avoid the suffix here and always consider all
statically specified store nodes as sticky but that will be a backward
incompatible change.

Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com>
Make the 'UP' text yellow if it is a sticky node. Add verbiage about
this to the documentation.

Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com>
Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com>
First try to see if there is a special suffix `+sticky`. Then if it is
not present continue with the usual parsing.

Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com>
Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com>
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Does it even make sense?! smile

Yes, I think so! Nice idea. However, I am not sure if I would not go even further: Make ALL stores "sticky"...

  • Is it okay to have this as a suffix? Or should we make static targets 'sticky' by default? But that would change Thanos Query's behavior for our users and maybe someone wants to stay with the current behavior. IMHO we should leave this possibility to add a suffix.

I don't like the suffix - I would say we should make ALL targets "sticky" if you specify extra flag to querier X. I think this feature makes sense a lot.

What I don't like is sticky - I think we can do better in naming this feature. I would say maybe --querier.store-strict-mode with huge description to explain that properly (:

I would also change the implementation. This is bit sketchy to put the broken address to healthyStores e.g we unnecessary dial/timeout wait for it etc. I think we need separate mechanism in storeset that will allow consumers to find out the unhealthy stores that will trigger partial resposne with correct error (: But I like the idea overall!

@@ -36,6 +36,8 @@ type StoreSpec interface {
// NOTE: It is implementation responsibility to retry until context timeout, but a caller responsibility to manage
// given store connection.
Metadata(ctx context.Context, client storepb.StoreClient) (labelSets []storepb.LabelSet, mint int64, maxt int64, storeType component.StoreAPI, err error)
// Sticky tells us whether the StoreSpec should always be considered healthy.
Sticky() bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Sticky() bool
IsSticky() bool

to be consistent.

level.Warn(s.logger).Log("msg", "update of store node failed", "err", errors.Wrap(err, "getting metadata"), "address", addr, "sticky", spec.Sticky())
if seenAlready && spec.Sticky() {
mtx.Lock()
healthyStores[addr] = st
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels very wrong (:

Putting unhealthy thing in healthy map

@bwplotka
Copy link
Member

I would love to hear others' opinions because this is something we should do from start probably even. @brancz @squat @povilasv @FUSAKLA @domgreen @IKSIN @d-ulyanov

@GiedriusS
Copy link
Member Author

I would love to hear others' opinions because this is something we should do from start probably even. @brancz @squat @povilasv @FUSAKLA @domgreen @IKSIN @d-ulyanov

Since it seems like there are different options and opinions here so lets put this PR on hold for a bit and let me make a proposal PR instead where we would make the decision on how to proceed.

@GiedriusS
Copy link
Member Author

The proposal which suggests going further with this: #2086.

@d-ulyanov
Copy link
Contributor

Hi, guys and thank you for mentioning us!
Totally agree with @bwplotka - I would make all stores "sticky".

As a user I expect following behaviuor:
As we're adding stores just as a simple list to Thanos Query config - we expect all of them work.
If one of them is not responding - any response should be marked as partial: we don't know for sure is retrieved data consistent or not (hope in the future we'll invent some approach how to know are requested metrics located on unhealthy Store or not).

If partial response enabled - as usual I'm expecting to get warning, otherwise - request must fail.
Implementation could be simple - let's make StoreSet keep 2 lists - all stores and healthy stores. If their lenght is not equal - mark response as partial in all cases (in the future we can improve this logic).

Also, I would not add any logic related to "sticky" to DNSProvider - I'm not sure that its should think about anything except for DNS resolution, maybe it would be better to add some wrapper around DNS provider.

@GiedriusS
Copy link
Member Author

I will close this and open another one with the agreed implementation on the proposal (:

@GiedriusS GiedriusS closed this Feb 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants