Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query frontend: large step queries are shifted #3353

Closed
darkweaver87 opened this issue Oct 22, 2020 · 10 comments · Fixed by #3356
Closed

Query frontend: large step queries are shifted #3353

darkweaver87 opened this issue Oct 22, 2020 · 10 comments · Fixed by #3356

Comments

@darkweaver87
Copy link

darkweaver87 commented Oct 22, 2020

Thanos, Prometheus and Golang version used:
thanos, version 0.15.0 (branch: HEAD, revision: fbd14b4)
build user: circleci@b18149728583
build date: 20200907-09:47:14
go version: go1.14.2
(docker image: quay.io/thanos/thanos:v0.15.0)

prometheus: prom/prometheus:v2.20.1

Object Storage Provider:
s3

What happened:
When I use large step on queries, points are shifted.
Examples using with:

  • thanos-query 7200s step (ok):
curl http://127.0.0.1:10902/api/v1/query_range -d 'end=1603353600&query=count(push_time_seconds)&start=1602748800&step=7200' | jq '.data.result[]|.values|.[0]'
[
  1602748800,
  "1043"
]
  • thanos-query-frontend 7200s step (ok):
curl http://127.0.0.1:10903/api/v1/query_range -d 'end=1603353600&query=count(push_time_seconds)&start=1602748800&step=7200' | jq '.data.result[]|.values|.[0]'
[
  1602748800,
  "1043"
]

Then with larger step:

  • thanos-query 10800s step (ok):
curl http://127.0.0.1:10902/api/v1/query_range -d 'end=1603353600&query=count(push_time_seconds)&start=1602748800&step=10800' | jq '.data.result[]|.values|.[0]'
[
  1602748800,
  "1043"
]
  • thanos-query-frontend 10800s step (ko):
curl http://127.0.0.1:10903/api/v1/query_range -d 'end=1603353600&query=count(push_time_seconds)&start=1602748800&step=10800' | jq '.data.result[]|.values|.[0]'
[
  1602741600,
  "1019"
]

What you expected to happen:
I would expect to have the same answers between queries on thanos query and query frontend :-)

How to reproduce it (as minimally and precisely as possible):
cf. queries above

Full logs to relevant components:

Anything else we need to know:

Many thanks :-)

@yeya24
Copy link
Contributor

yeya24 commented Oct 22, 2020

Hello, thanks for the issue.
How do you set up query frontend? Do you use cache?
If you use cache, then this is expected because the query result will be cached at the frontend when you query. This is fine because there is a cache validation time.

@darkweaver87
Copy link
Author

darkweaver87 commented Oct 22, 2020

Hello thanks for your reply :-)
My query frontend is setup this way (a POD in my K8S):

      containers:
      - args:
        - query-frontend
        - --query-frontend.compress-responses
        - --http-address=0.0.0.0:10902
        - --query-frontend.downstream-url=http://thanos-query.thanos.svc.cluster.local:10902
        - --query-range.split-interval=24h
        - --query-range.max-retries-per-request=5
        - --query-frontend.log_queries_longer_than=5s
        - |-
          --query-range.response-cache-config="config":
            "max_size": "512MB"
            "max_size_items": 0
            "validity": "6h"
          "type": "in-memory"
        image: quay.io/thanos/thanos:v0.15.0

I use in-memory cache but when I disable it I've exactly the same behavior.

@yeya24
Copy link
Contributor

yeya24 commented Oct 22, 2020

OK. After checking the code I think this is caused by the step align middleware. It re-calculates the start and end time based on your step so that the requests can be better cached.
The code is https://github.com/cortexproject/cortex/blob/master/pkg/querier/queryrange/step_align.go#L19-L22.

So the start time of your last query becomes 1602748800/10800*10800 = 1602741600 while other queries' start time is 1602748800.

@darkweaver87
Copy link
Author

Thanks for pointing this :-)
Well it seems that the shit increase with the step (with or without cache):

  • step 7200s -> shift = 0s
  • step 10800s -> shift = 7200s
  • step 86400s -> shift = 28800s

On 7 days, 8 hours begins to be observable :-)
Screenshot from 2020-10-22 15-40-42

@yeya24
Copy link
Contributor

yeya24 commented Oct 22, 2020

Yes, the shift is related to your step as pointed out in the code.

So if you want to do a range query for 7d, then why do you use such a large step value? For a standard 7d query, the step should be 2419s which is 40m. What's your use case of using 86400s (1d) as your resolution?

@yeya24
Copy link
Contributor

yeya24 commented Oct 22, 2020

But anyway, if this step alignment is not what you want, as it returns more data than expected, then we can provide a flag to disable this middleware. Users can make their choice to choose whether they want to disable this or not because the cacheability will be impacted without this middleware.

Does this sound good to you?

@darkweaver87
Copy link
Author

Actually I'm embedding those graphs (this is not this query but anyway) in a webapp with a JS lib for graphing which consumes more CPU to each point to plot :-/ and my client doesn't want a good definition but 1 point per day for a month and 1 point every 6 hours for a week.
What you propose seems to be a good compromise :-)

@yeya24
Copy link
Contributor

yeya24 commented Oct 22, 2020

Thanks, I will open a pr soon.

@darkweaver87
Copy link
Author

Many thanks :-)
I understand this can be a special way of using the frontend :-)

@bwplotka
Copy link
Member

Agree, let's expose it. Thanks, Ben and @darkweaver87 for investigation 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants