Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat sicedar: added readiness prober #1395

Merged
merged 7 commits into from
Aug 28, 2019

Conversation

FUSAKLA
Copy link
Member

@FUSAKLA FUSAKLA commented Aug 10, 2019

Signed-off-by: Martin Chodur m.chodur@seznam.cz

  • CHANGELOG entry if change is relevant to the end user.

Changes

Next PR in the from the series of adding readiness and liveness probes.

Added readiness and liveness prober to the thanos sidecar.
It follows the logic of setting the promUp metric so when it successfully fetches config of the Prometheus it becomes ready.

The question is should it take into account status of the GRPC interface also?

Verification

  1. started sidecar without running the Prometheus instance results in
$ curl http://0.0.0.0:10902/-/ready
thanos sidecar is not ready. Reason: request flags against http://localhost:9090/api/v1/status/config: Get http://localhost:9090/api/v1/status/config: dial tcp 127.0.0.1:9090: connect: connection refused
  1. started the prometheus sidecar becomes ready
$ curl http://0.0.0.0:10902/-/ready
thanos sidecar is ready
  1. stopped the prometheus instance again sidecar goes again to not ready state after max 30s which is the sidecar check interval
$ curl http://0.0.0.0:10902/-/ready
thanos sidecar is not ready. Reason: request flags against http://localhost:9090/api/v1/status/config: Get http://localhost:9090/api/v1/status/config: dial tcp 127.0.0.1:9090: connect: connection refused

Signed-off-by: Martin Chodur <m.chodur@seznam.cz>
Copy link
Member

@povilasv povilasv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seal

cmd/thanos/sidecar.go Outdated Show resolved Hide resolved
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some suggestions (: Thanks overall!

func registerSidecar(m map[string]setupFunc, app *kingpin.Application, name string) {
cmd := app.Command(name, "sidecar for Prometheus server")
func registerSidecar(m map[string]setupFunc, app *kingpin.Application) {
comp := component.Sidecar
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would maybe inline those, would be more verbose.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I inlined it in the registerSidecar function, not sure if you want to inline it everywhere even in the runSidecar ?

cmd/thanos/sidecar.go Outdated Show resolved Hide resolved
cmd/thanos/sidecar.go Outdated Show resolved Hide resolved
Signed-off-by: Martin Chodur <m.chodur@seznam.cz>
Signed-off-by: Martin Chodur <m.chodur@seznam.cz>
@FUSAKLA
Copy link
Member Author

FUSAKLA commented Aug 14, 2019

@povilasv thanks! there are some new changes just so you don't approve something you didn't see so maybe check that out :)

@bwplotka Thanks for the suggestions! I tried some refactor of the sidecar liveness so PTAL.
Also I applied your other suggestions on the compactor also since the changes are the same as in the previous PR.

@FUSAKLA
Copy link
Member Author

FUSAKLA commented Aug 14, 2019

those failing test are weird, not sure if related to my changes?

cmd/thanos/sidecar.go Outdated Show resolved Hide resolved
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it generally make sense.. I am bit worried that those values for sinceLastSuccessfulHeartbeatLimit might be generally nice.. but might not match some unique use cases user may have. Also this means any flake on connection to Prometheus (single heartbeat failed) might mean unavailability. I am a bit skeptical, I would mark it as non ready after 3 heartbeats or something (?)

cmd/thanos/sidecar.go Outdated Show resolved Hide resolved
pkg/prober/prober.go Outdated Show resolved Hide resolved
@bwplotka
Copy link
Member

Let's chat offline @FUSAKLA but overall problem with those probes is that we need to be careful. @povilasv mentioned that all of this can be configured anyone on Kubernetes side, but that's not that simple. Problems:

  1. You can control those things.. timeouts, thresholds etc in Kubernetes, but maybe not in thousands of other solutions/load balancers
  2. You can control those things... But only per container. This means that if you have 4 different cases in the application (like sidecar) when you make something not ready or not healthy.... You cannot do it yolo and ask someone to configure something on top. As they can only adjust all cases in same time there is No granularity.

@bwplotka
Copy link
Member

bwplotka commented Aug 16, 2019

Regarding healthyness: https://twitter.com/miekg/status/1152911058627153920

We are I think on the same page that we should do it ONLY set healthyness to NOT healthy if we are sure that restart will help.

@FUSAKLA
Copy link
Member Author

FUSAKLA commented Aug 16, 2019

Yeah, so should we get back to the original state of just modifying the readiness based on the halthchecks?

I wanted originally to avoid touching the liveness probe because it felt bit harsh but I understood that this was what you suggested. But I probably did not clearly understood what you meant, sorry for the confusement 😕

Signed-off-by: Martin Chodur <m.chodur@seznam.cz>
Signed-off-by: Martin Chodur <m.chodur@seznam.cz>
@FUSAKLA
Copy link
Member Author

FUSAKLA commented Aug 20, 2019

Along with the offline agreement, the readiness probe is only set after the initial Prometheus external labels fetch from the Prometheus. Once they are fetched it stays ready all the time.

If we hit any case in the future when more complex readiness handling will be needed it can be added then.

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good 👍

One important thing before merge. We never set prober to be healthy. And in prober constructor we set healthyness to be initialError. Maybe we should be healthy by default in NewProber?

@FUSAKLA
Copy link
Member Author

FUSAKLA commented Aug 27, 2019

@bwplotka this is done in the defaultHTTPListener constructor where the mux is started actually if that is ok?

thanos/cmd/thanos/main.go

Lines 362 to 369 in 0a30a2e

g.Add(func() error {
level.Info(logger).Log("msg", "listening for metrics", "address", httpBindAddr)
readinessProber.SetHealthy()
return errors.Wrap(http.Serve(l, mux), "serve metrics")
}, func(err error) {
readinessProber.SetNotHealthy(err)
runutil.CloseWithLogOnErr(logger, l, "metric listener")
})

Signed-off-by: Martin Chodur <m.chodur@seznam.cz>
@FUSAKLA
Copy link
Member Author

FUSAKLA commented Aug 27, 2019

But I found one bug with opposite condition on error logging in the Prober, fixed in 60897a8 🙄

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -37,6 +37,19 @@ type Prober struct {
func NewProber(component component.Component, logger log.Logger, reg prometheus.Registerer) *Prober {
initialErr := fmt.Errorf(initialErrorFmt, component)

// From Kubernetes documentation https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/ :
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit but this should part of struct TBH (for GoDocs)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, is it ok this way?

Signed-off-by: Martin Chodur <m.chodur@seznam.cz>
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@bwplotka bwplotka merged commit c83332f into thanos-io:master Aug 28, 2019
@bwplotka
Copy link
Member

Thanks @FUSAKLA !

@FUSAKLA
Copy link
Member Author

FUSAKLA commented Aug 28, 2019

Great, thank you all for the comments and going through the discussions around the readiness!

There is only few components left 😄
If you have the time and will discuss another one, then I made next PR for the store #1460

wbh1 pushed a commit to wbh1/thanos that referenced this pull request Sep 17, 2019
* feat sicedar: added readiness prober

Signed-off-by: Martin Chodur <m.chodur@seznam.cz>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants