Reload Config without downtime #264

Smithx10 · 2020-01-17T05:06:41Z

I'm actively rewriting the zrepl config clients list when new clients come up, and would like zrepl to see the new clients without a full restart.

genofire · 2020-10-02T14:14:34Z

And without to stop current running jobs ;)

problame · 2020-10-04T09:28:23Z

Given that #202 also only asks for reloads of transport config, I think we could start by just implementing reloads for transports.

I'm unsure about the user interface / user experience for reloads, though. Here are some options:

zrepl config reload transport JOB parses the on-disk config and calls a new method on a job, e.g. Job.ReloadTransport().
- easy to implement because the reload logic doesn't need to deal with job renames
- fine-grained + specific in what is to be reloaded
  - we don't have to figure out, based on a diff, what should be reloaded. That diff would probably be the bulk of the work.
- if a config change is incompatible, the error reporting is ok because the user has enough context (they specified JOB themselves, after all)
zrepl config reload (or a SIGHUP) could load the new config from disk and determine what has changed compared to the currently loaded config. If the set of job names is the same, we assume no renames occurred, otherwise we bail out. Then we pass each new job config to a job via a new method Job.ReloadConfig(config.jobs[i]) which again compares what has changed and determines whether it can perform the reload.
- what should happen if we have a mechanism to reload the transport but the user also changed the snapshot interval? Error and do nothing? Just reload the transport and log a warning?

I'm looking forward to your feedback!

3nprob · 2021-03-02T13:05:05Z

I think my perspective is a somehwat common one. I run zrepl as a systemd service (using the default resulting from the deb package). I configure all jobs through ansible.

The ideal scenario here is that I could issue a systemd service reload (which I guess would best be a SIGHUP, but a command is nbd as long as it doesn't require a JOB as argument).

This would then behave as if a restart had happened, except without interrupting ongoing operations.

what should happen if we have a mechanism to reload the transport but the user also changed the snapshot interval? Error and do nothing? Just reload the transport and log a warning?

I may miss something re how things are working internally, but couldn't it just switch over to the new schedule (and in case anything is currently running, let it finish that regardless, during/after which the new interval takes effect)?

If there's any complicating factor in any of this, I think even ignoring changes on existing jobs and only considering new ones is still useful in itself, though being able to change the interval would be great.

InsanePrawn · 2021-03-02T14:13:37Z

I may miss something re how things are working internally, but couldn't it just switch over to the new schedule (and in case anything is currently running, let it finish that regardless, during/after which the new interval takes effect)?

If there's any complicating factor in any of this, I think even ignoring changes on existing jobs and only considering new ones is still useful in itself, though being able to change the interval would be great.

I'd agree, letting everything ongoing finish and then reloading/replacing (or deleting) the jobs when they're idle seems sane enough to me from a conceptual perspective, of course implementing it is a different problem.
[Or as my people say: Zehnzeiler :^)]
We can then argue about whether deletions should stop running jobs.

mister2d · 2022-08-29T01:11:19Z

Any progress on this?

problame · 2022-09-25T16:43:05Z

I haven't seen any proposal for how we would detect job renames.
And in general, I think we can't easily implement a generic "reload config" hook right now.

That being said, I think it's feasible to start small, e.g., zrepl config reload transport [JOB], where JOB is optional.
Or even more fine-grained, to address the two use cases we've seen in this thread

zrepl config reload transport-tls [JOB]
zrepl config reload transport-tcp [JOB]

I think these are good beginner tasks and I'm happy to review draft-level pull requests.

InsanePrawn · 2023-06-03T12:35:36Z

I haven't seen any proposal for how we would detect job renames.

Do we need to?

IMO, this should be handled the same as a deletion and creation of another new job:

Detect deletion: Job name is missing from new config.
As was previously discussed, we should probably let ongoing jobs finish. We might print a warning about jobs this applies to during config reloading.
New job detected: job name was not known in old config: bring up the new job.

What kinds of things could actually conflict or go wrong here? I can imagine the following:

unhealthy job parallelism: a leftover job could be running in parallel to a new or renamed job instance, weirdness could ensue. Maybe the config reload should have a --stop-orphaned-jobs flag or default to that behaviour.
resource conflicts: right now I can only imagine transport conflicts, e.g. trying to bind to the same port. For transports, zrepl could potentially detect conflicts caused by a config reload, i.e. same transport specified in a deleted job that would remain due to being active and a new or renamed job. (To be clear: I don't think zrepl should examine the system's open ports, but rather only its internal state).

problame added feature data_model labels Jan 17, 2020

problame mentioned this issue Mar 10, 2021

Allow specifying push/source encryption send options on filesystem-level #437

Open

problame added this to the 0.7 milestone Sep 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reload Config without downtime #264

Reload Config without downtime #264

Smithx10 commented Jan 17, 2020

genofire commented Oct 2, 2020

problame commented Oct 4, 2020

3nprob commented Mar 2, 2021 •

edited

InsanePrawn commented Mar 2, 2021

mister2d commented Aug 29, 2022

problame commented Sep 25, 2022

InsanePrawn commented Jun 3, 2023

Reload Config without downtime #264

Reload Config without downtime #264

Comments

Smithx10 commented Jan 17, 2020

genofire commented Oct 2, 2020

problame commented Oct 4, 2020

3nprob commented Mar 2, 2021 • edited

InsanePrawn commented Mar 2, 2021

mister2d commented Aug 29, 2022

problame commented Sep 25, 2022

InsanePrawn commented Jun 3, 2023

3nprob commented Mar 2, 2021 •

edited