-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
query: add 'sticky' store nodes #2072
Conversation
Add the ability to specify a '+sticky' suffix via the `--store` parameter. If it has been specified then the store node's information is retained between loops of checking and we always consider it healthy. It is useful in the cases where we run Cortex's query-frontend in front of Thanos Query and we want to always consider certain store nodes. We might want to avoid the suffix here and always consider all statically specified store nodes as sticky but that will be a backward incompatible change. Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com>
Make the 'UP' text yellow if it is a sticky node. Add verbiage about this to the documentation. Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com>
Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com>
First try to see if there is a special suffix `+sticky`. Then if it is not present continue with the usual parsing. Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com>
Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Does it even make sense?! smile
Yes, I think so! Nice idea. However, I am not sure if I would not go even further: Make ALL stores "sticky"...
- Is it okay to have this as a suffix? Or should we make static targets 'sticky' by default? But that would change Thanos Query's behavior for our users and maybe someone wants to stay with the current behavior. IMHO we should leave this possibility to add a suffix.
I don't like the suffix - I would say we should make ALL targets "sticky" if you specify extra flag to querier X. I think this feature makes sense a lot.
What I don't like is sticky
- I think we can do better in naming this feature. I would say maybe --querier.store-strict-mode
with huge description to explain that properly (:
I would also change the implementation. This is bit sketchy to put the broken address to healthyStores
e.g we unnecessary dial/timeout wait for it etc. I think we need separate mechanism in storeset that will allow consumers to find out the unhealthy stores that will trigger partial resposne with correct error (: But I like the idea overall!
@@ -36,6 +36,8 @@ type StoreSpec interface { | |||
// NOTE: It is implementation responsibility to retry until context timeout, but a caller responsibility to manage | |||
// given store connection. | |||
Metadata(ctx context.Context, client storepb.StoreClient) (labelSets []storepb.LabelSet, mint int64, maxt int64, storeType component.StoreAPI, err error) | |||
// Sticky tells us whether the StoreSpec should always be considered healthy. | |||
Sticky() bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sticky() bool | |
IsSticky() bool |
to be consistent.
level.Warn(s.logger).Log("msg", "update of store node failed", "err", errors.Wrap(err, "getting metadata"), "address", addr, "sticky", spec.Sticky()) | ||
if seenAlready && spec.Sticky() { | ||
mtx.Lock() | ||
healthyStores[addr] = st |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels very wrong (:
Putting unhealthy thing in healthy map
Since it seems like there are different options and opinions here so lets put this PR on hold for a bit and let me make a proposal PR instead where we would make the decision on how to proceed. |
The proposal which suggests going further with this: #2086. |
Hi, guys and thank you for mentioning us! As a user I expect following behaviuor: If partial response enabled - as usual I'm expecting to get warning, otherwise - request must fail. Also, I would not add any logic related to "sticky" to DNSProvider - I'm not sure that its should think about anything except for DNS resolution, maybe it would be better to add some wrapper around DNS provider. |
I will close this and open another one with the agreed implementation on the proposal (: |
Add the ability to have 'sticky' store nodes - we will always consider them available and retain their clients. This makes it possible to have consistent partial responses when a node goes down when we know that it must be up at all times. The relevant part from the documentation which explains more:
Ad-hoc tests:
Tested locally. Turned off one store node which made it get removed from the
healthyStores
. Sending queries for a short time still gave an error but it went away after the refresh delay.Added a
+sticky
suffix then. Shut off the same node and tried sending queries again. Got errors that Thanos Query couldn't connect to that node. Disabling partial response made it into an error.Looking for feedback on:
WIP:
Relevant issue: #1651.