Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reload Config without downtime #264

Open
Smithx10 opened this issue Jan 17, 2020 · 7 comments
Open

Reload Config without downtime #264

Smithx10 opened this issue Jan 17, 2020 · 7 comments

Comments

@Smithx10
Copy link

I'm actively rewriting the zrepl config clients list when new clients come up, and would like zrepl to see the new clients without a full restart.

@genofire
Copy link

genofire commented Oct 2, 2020

And without to stop current running jobs ;)

@problame
Copy link
Member

problame commented Oct 4, 2020

Given that #202 also only asks for reloads of transport config, I think we could start by just implementing reloads for transports.

I'm unsure about the user interface / user experience for reloads, though. Here are some options:

  • zrepl config reload transport JOB parses the on-disk config and calls a new method on a job, e.g. Job.ReloadTransport().

    • easy to implement because the reload logic doesn't need to deal with job renames
    • fine-grained + specific in what is to be reloaded
      • we don't have to figure out, based on a diff, what should be reloaded. That diff would probably be the bulk of the work.
    • if a config change is incompatible, the error reporting is ok because the user has enough context (they specified JOB themselves, after all)
  • zrepl config reload (or a SIGHUP) could load the new config from disk and determine what has changed compared to the currently loaded config. If the set of job names is the same, we assume no renames occurred, otherwise we bail out. Then we pass each new job config to a job via a new method Job.ReloadConfig(config.jobs[i]) which again compares what has changed and determines whether it can perform the reload.

    • what should happen if we have a mechanism to reload the transport but the user also changed the snapshot interval? Error and do nothing? Just reload the transport and log a warning?

I'm looking forward to your feedback!

@3nprob
Copy link

3nprob commented Mar 2, 2021

I think my perspective is a somehwat common one. I run zrepl as a systemd service (using the default resulting from the deb package). I configure all jobs through ansible.

The ideal scenario here is that I could issue a systemd service reload (which I guess would best be a SIGHUP, but a command is nbd as long as it doesn't require a JOB as argument).

This would then behave as if a restart had happened, except without interrupting ongoing operations.

what should happen if we have a mechanism to reload the transport but the user also changed the snapshot interval? Error and do nothing? Just reload the transport and log a warning?

I may miss something re how things are working internally, but couldn't it just switch over to the new schedule (and in case anything is currently running, let it finish that regardless, during/after which the new interval takes effect)?

If there's any complicating factor in any of this, I think even ignoring changes on existing jobs and only considering new ones is still useful in itself, though being able to change the interval would be great.

@InsanePrawn
Copy link
Contributor

I may miss something re how things are working internally, but couldn't it just switch over to the new schedule (and in case anything is currently running, let it finish that regardless, during/after which the new interval takes effect)?

If there's any complicating factor in any of this, I think even ignoring changes on existing jobs and only considering new ones is still useful in itself, though being able to change the interval would be great.

I'd agree, letting everything ongoing finish and then reloading/replacing (or deleting) the jobs when they're idle seems sane enough to me from a conceptual perspective, of course implementing it is a different problem.
[Or as my people say: Zehnzeiler :^)]
We can then argue about whether deletions should stop running jobs.

@mister2d
Copy link

Any progress on this?

@problame
Copy link
Member

I haven't seen any proposal for how we would detect job renames.
And in general, I think we can't easily implement a generic "reload config" hook right now.

That being said, I think it's feasible to start small, e.g., zrepl config reload transport [JOB], where JOB is optional.
Or even more fine-grained, to address the two use cases we've seen in this thread

  • zrepl config reload transport-tls [JOB]
  • zrepl config reload transport-tcp [JOB]

I think these are good beginner tasks and I'm happy to review draft-level pull requests.

@problame problame added this to the 0.7 milestone Sep 25, 2022
@InsanePrawn
Copy link
Contributor

I haven't seen any proposal for how we would detect job renames.

Do we need to?

IMO, this should be handled the same as a deletion and creation of another new job:

  1. Detect deletion: Job name is missing from new config.
  2. As was previously discussed, we should probably let ongoing jobs finish. We might print a warning about jobs this applies to during config reloading.
  3. New job detected: job name was not known in old config: bring up the new job.

What kinds of things could actually conflict or go wrong here? I can imagine the following:

  • unhealthy job parallelism: a leftover job could be running in parallel to a new or renamed job instance, weirdness could ensue. Maybe the config reload should have a --stop-orphaned-jobs flag or default to that behaviour.
  • resource conflicts: right now I can only imagine transport conflicts, e.g. trying to bind to the same port. For transports, zrepl could potentially detect conflicts caused by a config reload, i.e. same transport specified in a deleted job that would remain due to being active and a new or renamed job. (To be clear: I don't think zrepl should examine the system's open ports, but rather only its internal state).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants