Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

query: Add store.response-timeout #928

Merged
merged 6 commits into from
Mar 26, 2019

Conversation

povilasv
Copy link
Member

@povilasv povilasv commented Mar 15, 2019

Changes

Improves handling of timed out data/
Adds --store.response-timeout in order to have a timeout on a single stream operation.

Some times GRPC streams get stuck and don't send any data for long periods of time
E.G. Simple GRPC Store API, which just sleeps.

Right now the only option to time them out in GRPC is to set context.WithTimeout()
The issue is that it is a global timeout,
if we set it to 10s, we won't ever get results from Thanos Store (backed by object store)
if we set it to 120s, we will always be waiting 120s for the HTTP queries to finish, which is :/
As we will be waiting for the slowest one to finish

This fix adds a timeout per single data receive operation

Ref: #928

Thanos Query will log if timeout was reached:

thanos-query-66f5864584-kh6wp thanos-query level=warn ts=2019-03-19T11:12:41.659624416Z caller=proxy.go:373 err="failed to receive any data in 500ms from Addr: prometheus-0.thanos-sidecar.sys-mon:10901 Labels: [{cloud_provider aws {} [] 0} {kubernetes_cluster dev-aws {} [] 0} {monitor sys-mon-prometheus {} [] 0} {replica sys-mon-prometheus-0 {} [] 0} {uw_environment dev {} [] 0}] Mint: 1552824000000 Maxt: 9223372036854775807:: context deadline exceeded" msg="returning partial response"
thanos-query-66f5864584-kh6wp thanos-query level=warn ts=2019-03-19T11:13:50.291321886Z caller=proxy.go:373 err="failed to receive any data in 500ms from Addr: thanos-store.telecom:10901 Labels: [] Mint: 1535673600000 Maxt: 1552989600000:: context deadline exceeded" msg="returning partial response"
thanos-query-66f5864584-kh6wp thanos-query level=warn ts=2019-03-19T11:15:37.430210112Z caller=proxy.go:373 err="failed to receive any data in 500ms from Addr: thanos-store.customer-platform:10901 Labels: [] Mint: 1549807838621 Maxt: 1552989600000:: context deadline exceeded" msg="returning partial response"

Verification

Tested locally on our cluster, both turned on and turned off (default).

Unit tests.

@povilasv povilasv force-pushed the add-series-timeout branch 3 times, most recently from 87bc810 to b36db84 Compare March 15, 2019 13:08
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally I am fine with that (added suggestions), but curious what @devnev and @mjd95 think about this. Can we have your opinions?

cmd/thanos/query.go Outdated Show resolved Hide resolved
cmd/thanos/query.go Outdated Show resolved Hide resolved
pkg/store/proxy.go Outdated Show resolved Hide resolved
pkg/store/proxy.go Outdated Show resolved Hide resolved
pkg/store/proxy.go Outdated Show resolved Hide resolved
pkg/store/proxy.go Outdated Show resolved Hide resolved
Copy link
Contributor

@mjd95 mjd95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 looks like a sensible mitigation. Would be great to look in to the underlying store issue as follow-up.

cmd/thanos/query.go Outdated Show resolved Hide resolved
cmd/thanos/query.go Outdated Show resolved Hide resolved
@bwplotka bwplotka requested a review from devnev March 16, 2019 12:48
@povilasv povilasv force-pushed the add-series-timeout branch 2 times, most recently from 4c822b4 to 2207f9d Compare March 18, 2019 14:05
@povilasv povilasv changed the title WIP: query: Add store.receive-timeout WIP: query: Add store.read-timeout Mar 18, 2019
@povilasv povilasv force-pushed the add-series-timeout branch 13 times, most recently from a69a9c5 to c230d23 Compare March 19, 2019 14:03
@povilasv povilasv changed the title WIP: query: Add store.read-timeout query: Add store.read-timeout Mar 19, 2019
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM just:

  • flag name suggestion
  • CI test suggestion

cmd/thanos/query.go Outdated Show resolved Hide resolved
pkg/store/proxy_test.go Show resolved Hide resolved
pkg/store/proxy_test.go Outdated Show resolved Hide resolved
@GiedriusS GiedriusS removed the WIP label Mar 19, 2019
pkg/store/proxy.go Outdated Show resolved Hide resolved
@povilasv povilasv changed the title query: Add store.read-timeout query: Add store.response-timeout Mar 22, 2019
@povilasv
Copy link
Member Author

@bwplotka I've introduced THANOS_ENABLE_STORE_READ_TIMEOUT_TESTS ENV to run timeout tests. Maybe you could introduce somekind of a circle config which would run this once a day with that environment enabled? I would like to know if somebody introduces any regressions on this timeout.

https://github.com/improbable-eng/thanos/pull/928/files#diff-e7ba964cd2afa54767a2adc1bac73e67R438

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good and I would be happy to merge if not the confusion around gRPC metric we add here. See my comment.

pkg/store/proxy.go Outdated Show resolved Hide resolved
pkg/store/proxy.go Outdated Show resolved Hide resolved
pkg/store/proxy.go Outdated Show resolved Hide resolved
@bwplotka
Copy link
Member

Also I am happy to add this periodic test in CircleCi for this env, can you add Github issue on me for that?

@povilasv
Copy link
Member Author

Also I am happy to add this periodic test in CircleCi for this env, can you add Github issue on me for that?

@bwplotka #960

@bwplotka
Copy link
Member

Still before merging we need to remove the metrics from here I guess?

@povilasv povilasv force-pushed the add-series-timeout branch 2 times, most recently from d2bba2e to 0519103 Compare March 25, 2019 09:30
@povilasv
Copy link
Member Author

Still before merging we need to remove the metrics from here I guess?

Yup, also made a PR for metrics grpc-ecosystem/go-grpc-prometheus#71

@povilasv
Copy link
Member Author

povilasv commented Mar 25, 2019

@bwplotka Do you want me to squash commits before merge? :)

I've squashed it

povilasv and others added 3 commits March 26, 2019 10:13
Co-Authored-By: povilasv <p.versockas@gmail.com>
Co-Authored-By: povilasv <p.versockas@gmail.com>
pkg/store/proxy.go Outdated Show resolved Hide resolved
Co-Authored-By: povilasv <p.versockas@gmail.com>
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, just one minor nit, not a blocker though. LGTM, thanks!

pkg/store/proxy.go Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants