Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ruler: probes are not started while WAL is being replayed #4280

Closed
OGKevin opened this issue May 27, 2021 · 8 comments
Closed

ruler: probes are not started while WAL is being replayed #4280

OGKevin opened this issue May 27, 2021 · 8 comments

Comments

@OGKevin
Copy link
Contributor

OGKevin commented May 27, 2021

Thanos, Prometheus and Golang version used:
Thanos: 20e8105ec5b697a4755f2ff735509924173bd5cf

What happened:
Ruler crashed and was restarted, it now needs to replay the WAL. However, the probes are started after the WAL has been read, This code

thanos/cmd/thanos/rule.go

Lines 320 to 323 in f2564a7

db, err := tsdb.Open(conf.dataDir, log.With(logger, "component", "tsdb"), reg, tsdbOpts)
if err != nil {
return errors.Wrap(err, "open TSDB")
}

eventually calls head.Init in tsdb that replays the wall. The health probes are started afterwards here:

thanos/cmd/thanos/rule.go

Lines 531 to 537 in f2564a7

g.Add(func() error {
statusProber.Ready()
return s.ListenAndServe()
}, func(err error) {
statusProber.NotReady(err)
s.Shutdown(err)
})

This means that while ruler is reading the WAL, the liveness probe will fail and you end up in a crash loop unless you significantly increase the initial delay.

What you expected to happen:
Probe be the first thing to get started while the rest of ruler starts up. Reading WAL should be considered healthy but not ready. While failure to read WAL should be unhealthy.

@bwplotka
Copy link
Member

bwplotka commented Jun 1, 2021

Valid bug. Maybe @ianbillett want to try to fix it? 🤗

@bwplotka
Copy link
Member

bwplotka commented Jun 1, 2021

BTW note that we are close to make stateless ruler happen (#4250), but WAL will be still there 🙃

@OGKevin
Copy link
Contributor Author

OGKevin commented Jun 2, 2021

Cool, I'm also happy to work on this. Was just not sure if this was an actual bug or not.

@bill3tt
Copy link
Contributor

bill3tt commented Jun 2, 2021

@OGKevin nice report! Since you reported the bug, you have the right to fix it. It doesn't seem too complicated, so a nice opportunity to get a commit into the Thanos code base :)

To fix, I think we just need to change the ordering of the ruler startup (as you point out above) so that we initialise the probes, and mark ourselves as healthy before then initialising the TSDB.

@OGKevin
Copy link
Contributor Author

OGKevin commented Jun 2, 2021

So due to the current dependency chain, the current order makes sense.

HTTP and GRPC server need Rule Manager, Rule Manager needs tsdb.

I can think of 2 solutions atm:

  1. Start HTTP server first with probes, and later register the additional /api routes to HTTP and start GRPC server once Rule Manager has been set up.
  2. Create some "complex" implementations for HTTP and GRPC server that run fully on start and return errors until the status probe is set to ready. So basically make them status probe aware. But this most likely would require more discussion to come up with a design. This one could also introduce null pointer bugs.

@bwplotka
Copy link
Member

I think we need 2 but it won't be complex. It will be exactly the same as

type ReadyStorage struct {
(: so some readiness struct that will tell API when to return success.

@stale
Copy link

stale bot commented Aug 21, 2021

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Aug 21, 2021
@stale
Copy link

stale bot commented Sep 4, 2021

Closing for now as promised, let us know if you need this to be reopened! 🤗

@stale stale bot closed this as completed Sep 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants