-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add proposal for store instance high availability #404
Merged
Merged
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
158 changes: 158 additions & 0 deletions
158
docs/proposals/201807_store_instance_high_availability.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,158 @@ | ||
# High-availability for store instances | ||
|
||
Status: draft | **in-review** | rejected | accepted | complete | ||
|
||
Proposal author: [@mattbostock](https://github.com/mattbostock) | ||
Implementation owner: [@mattbostock](https://github.com/mattbostock) | ||
|
||
## Motivation | ||
|
||
Thanos store instances currently have no explicit support for | ||
high-availability; query instances treat all store instances equally. If | ||
multiple store instances are used as gateways to a single bucket in an object | ||
store, Thanos query instances will wait for all instances to respond (subject | ||
to timeouts) before returning a response. | ||
|
||
## Goals | ||
|
||
- Explicitly support and document high availability for store instances. | ||
|
||
- Reduce the query latency incurred by failing store instances when other store | ||
instances could return the same response faster. | ||
|
||
## Proposal | ||
|
||
Thanos supports deduplication of metrics retrieved from multiple Prometheus | ||
servers to avoid gaps in query responses where a single Prometheus server | ||
failed but similar data was recorded by another Prometheus server in the same | ||
failure domain. To support deduplication, Thanos must wait for all Thanos | ||
sidecar servers to return their data (subject to timeouts) before returning a | ||
response to a client. | ||
|
||
When retrieving data from Thanos bucket store instances, however, the desired | ||
behaviour is different; we want Thanos use the first successful response it | ||
receives, on the assumption that all bucket store instances that communicate | ||
with the same bucket have access to the same data. | ||
|
||
To support the desired behaviour for bucket store instances while still | ||
allowing for deduplication, we propose to expand the [InfoResponse | ||
Protobuf](https://github.com/improbable-eng/thanos/blob/b67aa3a709062be97215045f7488df67a9af2c66/pkg/store/storepb/rpc.proto#L28-L32) | ||
used by the Store API by adding two fields: | ||
|
||
- a string identifier that can be used to group store instances | ||
|
||
- an enum representing the [peer type as defined in the cluster | ||
package](https://github.com/improbable-eng/thanos/blob/673614d9310f3f90fdb4585ca6201496ff92c697/pkg/cluster/cluster.go#L51-L64) | ||
|
||
For example; | ||
|
||
```diff | ||
--- before 2018-07-02 15:49:09.000000000 +0100 | ||
+++ after 2018-07-02 15:49:13.000000000 +0100 | ||
@@ -1,5 +1,6 @@ | ||
message InfoResponse { | ||
repeated Label labels = 1 [(gogoproto.nullable) = false]; | ||
int64 min_time = 2; | ||
int64 max_time = 3; | ||
+ string store_group_id = 4; | ||
+ enum PeerType { | ||
+ STORE = 0; | ||
+ SOURCE = 1; | ||
+ QUERY = 2; | ||
+ } | ||
+ PeerType peer_type = 5; | ||
} | ||
``` | ||
|
||
For the purpose of querying data from store instances, stores instance will be | ||
grouped by: | ||
|
||
- labels, as returned as part of `InfoResponse` | ||
- the new `store_group_id` string identifier | ||
|
||
Therefore, stores having identical sets of labels and identical values for | ||
`store_group_id` will belong in the same group for the purpose of querying | ||
data. Stores having an empty `store_group_id` field and matching labels will be | ||
considered to be part of the same group. Stores having an empty | ||
`store_group_id` field and empty label sets will also be considered part of the | ||
same group. | ||
|
||
If a service implementing the store API (a 'store instance') has a `STORE` or | ||
`QUERY` peer type, query instances will treat each store instance in the same | ||
group as having access to the same data. Query instances will randomly pick any | ||
two store instances[1][] from the same group and use the first response | ||
returned. | ||
|
||
[1]: https://www.eecs.harvard.edu/~michaelm/postscripts/mythesis.pdf | ||
|
||
Otherwise, for the `SOURCE` peer type, query instances will wait for all | ||
instances within the same group to respond (subject to existing timeouts) | ||
before returning a response, consistent with the current behaviour. This is | ||
necessary to collect all data available for the purposes of deduplication and | ||
to fill gaps in data where an individual Prometheus server failed to ingest | ||
data for a period of time. | ||
|
||
Each service implementing the store API must determine what value the | ||
`store_group_id` should return. For bucket stores, `store_group_id` should | ||
contain the concatenation of the object store URL and bucket name. For all | ||
other existing services implementing the store API, we will use an empty string | ||
for `store_group_id` until a reason exists to use it. | ||
|
||
Multiple buckets or object stores will be supported by setting the | ||
`store_group_id`. | ||
|
||
Existing instances running older versions of Thanos will be assumed to have | ||
an empty string for `store_group_id` and a `SOURCE` peer type, which will | ||
retain existing behaviour when awaiting responses. | ||
|
||
### Scope | ||
|
||
Horizontal scaling should be handled separately and is out of scope for this | ||
proposal. | ||
|
||
## User experience | ||
|
||
From a user's point of view, query responses should be faster and more reliable: | ||
|
||
- Running multiple bucket store instances will allow the query to be served even | ||
if a single store instance fails. | ||
|
||
- Query latency should be lower since the response will be served from the | ||
first bucket store instance to reply. | ||
|
||
The user experience for query responses involving only Thanos sidecars will be | ||
unaffected. | ||
|
||
## Alternatives considered | ||
|
||
### Implicitly relying on store labels | ||
|
||
Rather than expanding the `InfoResponse` Protobuf, we had originally considered | ||
relying on an empty set of store labels to determine that a store instance was | ||
acting as a gateway. | ||
|
||
We decided against this approach as it would make debugging harder due to its | ||
implicit nature, and is likely to cause bugs in future. | ||
|
||
### Using boolean fields to determine query behaviour | ||
|
||
We rejected the idea of adding a `gateway` or `deduplicated` boolean field to | ||
`InfoResponse` in the store RPC API. The value of these fields would have had | ||
the same effect on query behaviour as returning the peer type field as proposed | ||
above and would be more explicit, but were specific to this use case. | ||
|
||
The peer type field in `InfoResponse` proposed above could be used for other | ||
use cases aside from determining query behaviour. | ||
|
||
## Related future work | ||
|
||
### Sharing data between store instances | ||
|
||
Thanos bucket stores download index and metadata from the object store on | ||
start-up. If multiple instances of a bucket store are used to provide high | ||
availability, each instance will download the same files for its own use. These | ||
file sizes can be in the order of gigabytes. | ||
|
||
Ideally, the overhead of each store instance downloading its own data would be | ||
avoided. We decided that it would be more appropriate to tackle sharing data as | ||
part of future work to support the horizontal scaling of store instances. |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is worth to mention, that we DO not ignore these timeouts, even with the HA sidecar scenario we treat one replica being inaccessible as an "error case" - which might be wrong or confusing and definitely different to store gateway case (:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good shout, I'll make that explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thoughts, I'm not sure if we need to expand on this as I want to avoid confusing the reader by giving too much detail. Maybe I should remove the 'subject to timeouts' part?