Skip to content

Conversation

@fpetkovski
Copy link
Contributor

The distributed query engine queries both store and ingest components in parallel and deduplicates samples by prioritizing those coming from components with a higher max time. Different Prometheus/Receive components can apply retention independetly of each other which can cause remote aggregates to be executed on partial data. This will lead to incorrect results because these partial aggregates will be prioritized over data coming from stores. The problem is first described in thanos-io/promql-engine#187.

To address this problem, this commit exposes an info field called guaranteed_min_time (better names are welcome :) ) from all stores APIs. The field indicates what is the minimum time that a store API guarantees to hold across all TSDBs. For stores APIs holding a single TSDB (Prometheus, Ruler) this field is calculated by using the retention and min block duration parameters. The Querier/Proxy component calculate the value by taking the highest guaranteed_min_time across all TSDBs.

Using this field, the distributed engine can scope the start time of remote queries to a time for which data is guaranteed to be complete.

  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

Changes

  • Expose guaranteed_min_time info field.

Verification

  • Tested manually and using unit tests. The field is also exposed in the UI.

The distributed query engine queries both store and ingest components
in parallel and deduplicates samples by prioritizing those coming
from components with a higher max time. Different Prometheus/Receive
components can apply retention independetly of each other which can cause
remote aggregates to be executed on partial data. This will lead to incorrect
results because these partial aggregates will be prioritized over data
coming from stores. The problem is first described in thanos-io/promql-engine#187.

To address this problem, this commit exposes an info field called guaranteed_min_time from all stores APIs.
The field indicates what is the minimum time that the store guarantees to hold across all TSDBs.
For stores holding a single TSDB (Prometheus, Ruler) this field is calculated by using the retention and min block duration parameters.
The Querier/Proxy component calculate the value by taking the highest guaranteed_min_time across all TSDBs.

Using this field, the distributed engine can scope the start time of remote queries to
a time for which data is guaranteed to be complete.

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
@fpetkovski fpetkovski force-pushed the guaranteed-min-time branch 2 times, most recently from 2f6cd59 to 209d0b3 Compare April 12, 2023 14:09
The retention flag in Prometheus has been renamed to `storage.tsdb.retention.time`
while Thanos still uses `storage.tsdb.retention`. This leads to the guaranteed min time
being calculated incorrectly for Prometheus instances.

This commit fixes the issue.

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
@fpetkovski fpetkovski force-pushed the guaranteed-min-time branch from 209d0b3 to fce6436 Compare April 12, 2023 14:09
Copy link
Member

@GiedriusS GiedriusS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a bit like a hack because it sounds to me like the core of the problem is that InfoResponse only contains one StoreInfo whereas nowadays multiple TSDBs are supported in one component. 🤔 I wonder if it would be a lot of work to instead introduce repeated StoreInfo. Also, what makes this StoreInfo special? Perhaps we would like to have repeated TargetsInfo and so on?

@fpetkovski
Copy link
Contributor Author

Yeah the root of the problem is that StoreInfo has multiple LabelSets but only one min and max time for all of them. Ideally we want each LabelSet to have its own range. We could then propagate this information to engine so it can decide how to scope a query. In practice we want to make sure that replicas of a label set are always queried on the same time range.

So maybe StoreInfo needs to have a repeated TSDBInfo, and TSDBInfo would have ZLabelSet and min and max time.

@fpetkovski
Copy link
Contributor Author

@matej-g @GiedriusS this is what this suggestion would look like: #6329

@stale
Copy link

stale bot commented Jun 18, 2023

Hello 👋 Looks like there was no activity on this amazing PR for the last 30 days.
Do you mind updating us on the status? Is there anything we can help with? If you plan to still work on it, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next week, this issue will be closed (we can always reopen a PR if you get back to this!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Jun 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants