restart workers online (graceful restart) to hot reload, in production environment. #2619

yangbo1024 · 2022-12-08T01:56:19Z

Is there an existing issue for this?

I have searched the existing issues

Is your feature request related to a problem? Please describe.

Condition: keep service available any time.

---- client code----

async def run():                                                                                        
    while 1:
        await post('http://127.0.0.1:8000/')

when called app.m.restart("ALL_PROCESSES") in a worker, sanic crashed.

---- server code ----

@app.post("/")
async def handler(request):
    app.m.restart('__ALL_PROCESSES__')
    return response.text('ok')


if __name__ == "__main__":
    app.run(debug=True, workers=2)

Describe the solution you'd like

graceful restarting and reduce the effect when restarting.
my messy describe:

graceful restart workers; restart all workers will not crash, if only 1 worker, block a little while (if worker not started yet) is ok.
a way to graceful restart worker one by one, code eg:
woker_names = tuple(app.m.workers.keys())
for woker_name in worker_names:
ret_val = app.m.restart(worker_name)
# here, the worker has been graceful restarted, ret_val is meaningful
may combine the above 2, when restarting all workers, 50% workers restarting, 50% old workers keep serving

Additional context

simplify the api,

app.m.restart('__ALL_PROCESSES__')   => app.m.restart_all()

thanks.

ahopkins · 2022-12-08T09:56:50Z

Thanks, I just opened a PR for this. In the meantime, you can do this manually:

@app.get("/restart")
async def restart_handler(request: Request):
    for name, worker in request.app.m.workers.items():
        if worker.get("server"):
            request.app.m.restart(name)
    return json(None, status=202)

yangbo1024 · 2022-12-08T10:53:26Z

thanks, it works

yangbo1024 · 2022-12-08T11:29:04Z

In older version of sanic, I use gunicorn to get gracefully online hot updating.
gunicorn's steps:

start new worker,
then kill old worker.
sounds good, but after step 1, theres's no gentle notification was sent to arbiter(main process) to tell it that the worker's ready to accept connections, so the whole service will block a while until the first worker finished initializing. If worker has too much init work, the block time is a big 'moment'.
So, I have to stop the worker one by one with some outer script or manually~ painful.
I really hope sanic's WorkerManager solve this problem, avoid server blocking when restarting.

ahopkins · 2022-12-08T12:49:39Z

@yangbo1024 See linked PR: #2623

LMK if this what you had in mind.

yangbo1024 · 2022-12-08T12:57:19Z

amazing! thats it

Tronic · 2022-12-08T19:18:14Z

A truly hot reload (zero downtime) is also not implemented AFAIK. It would require the use of a UNIX socket (for atomic replace on server restart) or passing of TCP listening socket from old WorkerManager to the new one. Ideally one of those would also be implemented but it is quite complex to do (especially the passing of the old socket, which requires checking that the address/port didn't change etc), and the benefits of that are minimal (the downtime is extremely short but it may still cause some 502 or connection refused errors that the client code might not expect, if a request happens within that short timeframe where nothing is bound/listened to).

ahopkins · 2022-12-08T19:48:22Z

@Tronic The sockets are opened and listen is called in the master process and passed to each subprocess. Also, any existing requests will complete and the old worker does a graceful shutdown. So there should NOT be any time between when the new starts and the old shutsdown.

🤔 Well, I guess, that is not 100% true. ack happens in app._startup() which is before before_server_start and after_server_start. Maybe we need to move ack until the process is fully ready to receive requests.

Tronic · 2022-12-08T22:15:01Z

During a restart, doesn't the old master process first close its listening socket? Presumably this happens right after a restart is signaled and soon after that the new master process binds and listens a new socket. In between, new connections are refused. Even if old workers are still working with their existing connections.

If the master process was made either to not restart itself at all (only workers get replaced) or to pass the existing listening socket to the new master process, there'd be only one listening socket (of which each process share copies). Because eventually a new worker starts accepting on that socket, the system is truly zero downtime.

In zero-downtime restarts, workers of two or more app versions may be simultaneously running and serving existing requests or websocket connections, up to the graceful shutdown timeout, while at the same time the new server and workers are already serving any new requests. This may lead to some 500 errors due to database incompatibilities or other minor confusion within the "gracefully" handled requests if the new version updated the db schema.

ahopkins · 2022-12-08T22:48:33Z

Correct, once the replacement is running, there are no new requests to the old. I'll test this some more, but in my testing I could not reach the old worker once the new started, but it would continue to complete the in flight.

ahopkins · 2022-12-08T22:50:39Z

It should be pointed out, this is not a breaking change. Reloads have only ever been supported in the past for development use, and there has been no method to trigger them externally.

yangbo1024 · 2022-12-09T01:25:11Z

I think it's okay to do a good job of HTTP link processing with short responses, and the same is true for my personal needs. Restarting the main process, modifying the IP address and port is not considered. For some long-term services, such as websockets, streams, etc., they can be handled by sanic user, including but not limited to deliberately increasing the waiting time for graceful closing. Otherwise, the sanic becomes too complex and even brings more bugs, which is not friendly to the majority of users.

yangbo1024 · 2022-12-09T01:38:03Z

In my experience with gunicorn to release old version sanic, my approach was to use sanic's 'before_server_stop' listener for the old worker to wait a short time before actually starting to stop the old worker, and this wait time was to estimate the actual startup time of the new worker plus a little more time, usually I use 5 seconds.
Even if the old worker has stopped and the new worker has not been started, the request will not be lost, because the main process listens on this port, but does not accept, so if the time estimate is inaccurate, it will block during the handover of the old and new worker, not be lost.

Tronic · 2022-12-09T02:01:21Z

Correct, once the replacement is running, there are no new requests to the old. I'll test this some more, but in my testing I could not reach the old worker once the new started, but it would continue to complete the in flight.

I didn't look at the implementation yet, but are you

employing reuseaddr/reuseport to bind another socket to the same address,
closing the old listening socket first and then opening a new one, or
passing the old socket to new processes (after reload no bind or listen needed because it is the same socket)

I would say that 1 is buggy and very different across different OS and also a security problem, 2 is not exactly zero downtime (but it may be so close that it doesn't matter) and 3 is the elegant way to do it.

Tronic · 2022-12-09T02:16:05Z

Restarting the main process, modifying the IP address and port is not considered.

Address, port or unix socket modification needs another socket, and in such case no zero downtime is expected. Development mode auto reloads obviously need to cope with such changes (specifically if app.run is used and its arguments modified). In production mode reloads there could be options to change socket address, TLS settings etc, which might then not be zero downtime.

Another production mode reload option that could be of use is to make Sanic wait until all old workers have exited before starting the new server and accepting new connections (to avoid any problems if the app cannot cope with different versions of it running simultaneously, due to database or other interprocess communications). If used with option 3 of my previous message, this can still avoid any connection refused errors, only having connections delayed during the grace period.

ahopkins · 2022-12-11T08:28:37Z

specifically if app.run is used and its arguments modified

This is not supported and I do not have any plans to add that.

ahopkins · 2022-12-11T09:02:06Z

The last commit I pushed to the PR (#2623) adds a app.config.RESTART_ORDER config value. It will default to the existing order, so no change there. But, will allow someone like @yangbo1024 to opt into the new pattern of starting a new process and waiting for ack before terminating the old.

I think this should resolve any remaining concerns. @Tronic?

Tronic · 2022-12-12T23:10:05Z

The last commit I pushed to the PR (#2623) adds a app.config.RESTART_ORDER config value. It will default to the existing order, so no change there. But, will allow someone like @yangbo1024 to opt into the new pattern of starting a new process and waiting for ack before terminating the old.

I think this should resolve any remaining concerns. @Tronic?

Adding config options is not my preferable option, and I still didn't get how exactly this is implemented for now (namely which of the three options in my previous message). But given that I don't have the time to dig into the implementation I'll have to concur with the current implementation that was merged even if it isn't perfect. My biggest concern is if future improvements are impaired e.g. because that config option cannot be removed without breaking something.

yangbo1024 added the feature request label Dec 8, 2022

ahopkins mentioned this issue Dec 8, 2022

Add a restart mechanism to all workers in the multiplexer #2622

Merged

ahopkins mentioned this issue Dec 8, 2022

Wait for new process to ACK before termination of old on restart #2623

Closed

ahopkins closed this as completed in #2622 Dec 11, 2022

ahopkins mentioned this issue Dec 18, 2022

Implement restart ordering #2632

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

restart workers online (graceful restart) to hot reload, in production environment. #2619

restart workers online (graceful restart) to hot reload, in production environment. #2619

yangbo1024 commented Dec 8, 2022 •

edited by ahopkins

ahopkins commented Dec 8, 2022

yangbo1024 commented Dec 8, 2022

yangbo1024 commented Dec 8, 2022

ahopkins commented Dec 8, 2022 •

edited

yangbo1024 commented Dec 8, 2022

Tronic commented Dec 8, 2022

ahopkins commented Dec 8, 2022

Tronic commented Dec 8, 2022 •

edited

ahopkins commented Dec 8, 2022

ahopkins commented Dec 8, 2022

yangbo1024 commented Dec 9, 2022 •

edited

yangbo1024 commented Dec 9, 2022 •

edited

Tronic commented Dec 9, 2022

Tronic commented Dec 9, 2022 •

edited

ahopkins commented Dec 11, 2022

ahopkins commented Dec 11, 2022

Tronic commented Dec 12, 2022

restart workers online (graceful restart) to hot reload, in production environment. #2619

restart workers online (graceful restart) to hot reload, in production environment. #2619

Comments

yangbo1024 commented Dec 8, 2022 • edited by ahopkins

Is there an existing issue for this?

Is your feature request related to a problem? Please describe.

---- client code----

---- server code ----

Describe the solution you'd like

Additional context

ahopkins commented Dec 8, 2022

yangbo1024 commented Dec 8, 2022

yangbo1024 commented Dec 8, 2022

ahopkins commented Dec 8, 2022 • edited

yangbo1024 commented Dec 8, 2022

Tronic commented Dec 8, 2022

ahopkins commented Dec 8, 2022

Tronic commented Dec 8, 2022 • edited

ahopkins commented Dec 8, 2022

ahopkins commented Dec 8, 2022

yangbo1024 commented Dec 9, 2022 • edited

yangbo1024 commented Dec 9, 2022 • edited

Tronic commented Dec 9, 2022

Tronic commented Dec 9, 2022 • edited

ahopkins commented Dec 11, 2022

ahopkins commented Dec 11, 2022

Tronic commented Dec 12, 2022

yangbo1024 commented Dec 8, 2022 •

edited by ahopkins

ahopkins commented Dec 8, 2022 •

edited

Tronic commented Dec 8, 2022 •

edited

yangbo1024 commented Dec 9, 2022 •

edited

yangbo1024 commented Dec 9, 2022 •

edited

Tronic commented Dec 9, 2022 •

edited