Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support max source resolution for instant queries #1431

Conversation

0robustus1
Copy link
Contributor

  • [] Added thanos query --query.instant.default.max_source_resolution flag to set a default value for maxSourceResolution on instant queries (as are for example used by the thanos rule component)
  • [] Added support for the max_source_resolution query parameter for /api/v1/query

Reasoning

We have configured our compactor with different retentions for different resolution levels:

thanos
  compact
  --log.level=debug
  --data-dir=/var/thanos/store
  --objstore.config-file=/s3/thanos.yaml
  --retention.resolution-raw=7d
  --retention.resolution-5m=30d
  --retention.resolution-1h=180d
  --wait

This causes a situation where the result of a thanos rule recording rule was
only taking the last 7-8 days of data into account, because instant queries were
always using a maxSourceResolution of 0 and therefore only using raw resolution
data.

In our case the query looked like this and was executed on a thanos ruler
because our prometheuses only have a relatively short retention of a few days:

(100 -
  sum_over_time (
    some_metric:errors_count:increase1d[30d]
  ) /
  sum_over_time (
    some_metric:total_count:increase1d[30d]
  ) * 100
)

The thanos rule component runs an instant query in order to retrieve a single
datapoint to set for the recording rule metric. Our query requires that data
from at least 30 days is taken into account. But by default it would only select
raw data and therefore only provide data from the last 7 days.

Changes

This allows for /api/v1/query to accept the ?max_source_resolution query parameter
(the documentation did not clearly state that it was only accepted for
/api/v1/query_range prior to this).
We also added a command line argument to set a default should the query parameter not be set.

We also propose the following (for later PRs):

  • Amend the UI to allow setting the max_source_resolution for "Console"
    queries (which are executing instant queries).
  • Add a new field to the rule-yaml for thanos ruler to set the
    max_source_resolution on a per-recording-rule basis.

Verification

  • We've used a fork to fire individual instant queries with the max_source_resolution
    parameter set and compared it with the last result we got from a range query.
  • We also deployed a version with the default value changed and observed that the recording
    rule was now producing the expected value.

@0robustus1 0robustus1 force-pushed the support_max_source_resolution_for_instant_queries branch 2 times, most recently from dc7a3c3 to 765ac90 Compare August 16, 2019 14:06
@brancz
Copy link
Member

brancz commented Aug 20, 2019

This looks right to me, but it'd be good if @bwplotka and or @GiedriusS have a look as well.

pkg/query/api/v1.go Outdated Show resolved Hide resolved
cmd/thanos/query.go Outdated Show resolved Hide resolved
Copy link
Member

@GiedriusS GiedriusS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea to add that parameter there is good but not sure about the flag. Perhaps we could re-use --query.auto-downsampling here and automatically select it depending on the range vectors?

@0robustus1 0robustus1 force-pushed the support_max_source_resolution_for_instant_queries branch 2 times, most recently from a430f7e to 4765b2a Compare August 21, 2019 08:16
@0robustus1
Copy link
Contributor Author

Hi @GiedriusS, that might be a significantly more comprehensive and powerful solution.
It sounds however like a larger-scale change (that should then be adopted for all query types).

There is just one concern that comes to mind. Wouldn't the querier need to be aware of the retention configured for the different compaction levels? How else would it have translated the 30d range selector to the correct maxSourceResolution (1h) in a retention setup like this:

thanos
  compact
  --log.level=debug
  --data-dir=/var/thanos/store
  --objstore.config-file=/s3/thanos.yaml
  --retention.resolution-raw=7d
  --retention.resolution-5m=11d
  --retention.resolution-1h=365d
  --wait

Configuring a maxSourceResolution via the flag would enable that.

Copy link
Member

@povilasv povilasv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, maybe we should change the default --query.instant.default.max_source_resolution=0s to 1h.

As it has a bit buggy behavior if you configure retention like in the example.

Thoughts?

@0robustus1
Copy link
Contributor Author

That's what we set it to in our setup now and would probably leave it at. I just wasn't sure whether to include it in the Pull-Request.

@GiedriusS
Copy link
Member

Hi @GiedriusS, that might be a significantly more comprehensive and powerful solution.
It sounds however like a larger-scale change (that should then be adopted for all query types).

There is just one concern that comes to mind. Wouldn't the querier need to be aware of the retention configured for the different compaction levels? How else would it have translated the 30d range selector to the correct maxSourceResolution (1h) in a retention setup like this:

thanos
  compact
  --log.level=debug
  --data-dir=/var/thanos/store
  --objstore.config-file=/s3/thanos.yaml
  --retention.resolution-raw=7d
  --retention.resolution-5m=11d
  --retention.resolution-1h=365d
  --wait

Configuring a maxSourceResolution via the flag would enable that.

Indeed, it would be a much bigger change but not sure that introducing a new flag is the best solution. It's always easiest to do that but then after some time you end with a thing which has hundreds of knobs which is bad.

I think my original suggestion would work out because actually we do downsampling at fixed intervals. The options that you've pasted only control the retention part. I have outlined this in my other issue what I have started about how Thanos Store should always prefer higher resolution data.

I still would very much like @bwplotka to comment on this.

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👋 Thanks for this all!

I am fine with unblocking you @0robustus1 on this, if we make this flag hidden for now. Once that is done LGTM 👍 (: Essentially for long term direction we need bit more design/discussion. IMO there might be some work on PromQL to improve this flow (more below).

Overall, I agree with @GiedriusS and I would vote for query instant asking for range vector using exactly the same logic as the range query. I would really reuse --query.auto-downsampling and same flow.

There is just one concern that comes to mind. Wouldn't the querier need to be aware of the retention configured for the different compaction levels? How else would it have translated the 30d

Technically it would be nice to control all by the step you provide to the query... which you cannot on the instant query. I think there were discussions on enabling step for range vector selector [ .. ] and that would be helpful. I would really love to follow on that with Prometheus team since at some point we want to propose downsampling to Prometheus itself.

Anyway, from Thanos perspective IMO querier should not be aware of anything on storage layer, especially retention - all it knows about is StoreAPI

The options that you've pasted only control the retention part. I have outlined this in my other issue what I have started about how Thanos Store should always prefer higher resolution data.

This is an interesting one. Especially for range vectors and functions, it might be really faster and equally accurate (: in most cases. (: But I don't think for all cases. I think still having granular control for each range vector selector would be the solution here, but let's discuss.

cmd/thanos/query.go Outdated Show resolved Hide resolved
cmd/thanos/query.go Outdated Show resolved Hide resolved
pkg/query/api/v1.go Outdated Show resolved Hide resolved
pkg/query/api/v1_test.go Outdated Show resolved Hide resolved
@0robustus1 0robustus1 force-pushed the support_max_source_resolution_for_instant_queries branch from 8e2c093 to 1fc56e5 Compare August 26, 2019 14:00
@0robustus1
Copy link
Contributor Author

I adjusted the code to the remaining review comments. The flag is now hidden.

@0robustus1 0robustus1 force-pushed the support_max_source_resolution_for_instant_queries branch from 1fc56e5 to 5e69042 Compare August 27, 2019 07:40
This adds support for the ?max_source_resolution query
param for instant queries.

Instant queries had the issue that when there was a subquery requesting wide
data ranges e.g. sum_over_time(metric_name[30d])) and the retention of raw data
was for example only 7d, the query would (silently) only take data from the last
7 days into account (ignoring the downsampled 5m and 1h timeseries that were
available with longer retention).

We default to 1h max_source_resolution for now as this will cover
the highest/broadest resolution available and should include
all retention. It is configurable per query (although no known
clients send it).

Signed-off-by: Tim Reddehase <tim.reddehase@xing.com>
The flag only affects instant queries.

Signed-off-by: Tim Reddehase <tim.reddehase@xing.com>
Signed-off-by: Tim Reddehase <tim.reddehase@xing.com>
Instead of defining an additional function we
do the adjustment of maxSourceResolution directly in query_range.

Signed-off-by: Tim Reddehase <tim.reddehase@xing.com>
Signed-off-by: Tim Reddehase <tim.reddehase@xing.com>
Signed-off-by: Tim Reddehase <tim.reddehase@xing.com>
Signed-off-by: Tim Reddehase <tim.reddehase@xing.com>
Signed-off-by: Tim Reddehase <tim.reddehase@xing.com>
@0robustus1 0robustus1 force-pushed the support_max_source_resolution_for_instant_queries branch from 5e69042 to 2433524 Compare August 27, 2019 12:06
@bwplotka
Copy link
Member

Actually, I was super wrong.. All things I mentioned with resolution per range works right now: https://prometheus.io/blog/2019/01/28/subquery-support/#subqueries

I might think we need to change slightly implementation then for querier and use per selection Step int64 // Query step size in milliseconds. instead of global, query one.

@bwplotka
Copy link
Member

Failed quick experiment here so merging this "hidden" workaround for now.

Thanks for doing this! Would love to have your input on further work in this area @0robustus1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants