New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
restart workers online (graceful restart) to hot reload, in production environment. #2619
Comments
Thanks, I just opened a PR for this. In the meantime, you can do this manually: @app.get("/restart")
async def restart_handler(request: Request):
for name, worker in request.app.m.workers.items():
if worker.get("server"):
request.app.m.restart(name)
return json(None, status=202) |
thanks, it works |
In older version of sanic, I use gunicorn to get gracefully online hot updating.
|
@yangbo1024 See linked PR: #2623 LMK if this what you had in mind. |
amazing! thats it |
A truly hot reload (zero downtime) is also not implemented AFAIK. It would require the use of a UNIX socket (for atomic replace on server restart) or passing of TCP listening socket from old WorkerManager to the new one. Ideally one of those would also be implemented but it is quite complex to do (especially the passing of the old socket, which requires checking that the address/port didn't change etc), and the benefits of that are minimal (the downtime is extremely short but it may still cause some 502 or connection refused errors that the client code might not expect, if a request happens within that short timeframe where nothing is bound/listened to). |
@Tronic The sockets are opened and 🤔 Well, I guess, that is not 100% true. |
During a restart, doesn't the old master process first close its listening socket? Presumably this happens right after a restart is signaled and soon after that the new master process binds and listens a new socket. In between, new connections are refused. Even if old workers are still working with their existing connections. If the master process was made either to not restart itself at all (only workers get replaced) or to pass the existing listening socket to the new master process, there'd be only one listening socket (of which each process share copies). Because eventually a new worker starts accepting on that socket, the system is truly zero downtime. In zero-downtime restarts, workers of two or more app versions may be simultaneously running and serving existing requests or websocket connections, up to the graceful shutdown timeout, while at the same time the new server and workers are already serving any new requests. This may lead to some 500 errors due to database incompatibilities or other minor confusion within the "gracefully" handled requests if the new version updated the db schema. |
Correct, once the replacement is running, there are no new requests to the old. I'll test this some more, but in my testing I could not reach the old worker once the new started, but it would continue to complete the in flight. |
It should be pointed out, this is not a breaking change. Reloads have only ever been supported in the past for development use, and there has been no method to trigger them externally. |
I think it's okay to do a good job of HTTP link processing with short responses, and the same is true for my personal needs. Restarting the main process, modifying the IP address and port is not considered. For some long-term services, such as websockets, streams, etc., they can be handled by sanic user, including but not limited to deliberately increasing the waiting time for graceful closing. Otherwise, the sanic becomes too complex and even brings more bugs, which is not friendly to the majority of users. |
In my experience with gunicorn to release old version sanic, my approach was to use sanic's 'before_server_stop' listener for the old worker to wait a short time before actually starting to stop the old worker, and this wait time was to estimate the actual startup time of the new worker plus a little more time, usually I use 5 seconds. |
I didn't look at the implementation yet, but are you
I would say that 1 is buggy and very different across different OS and also a security problem, 2 is not exactly zero downtime (but it may be so close that it doesn't matter) and 3 is the elegant way to do it. |
Address, port or unix socket modification needs another socket, and in such case no zero downtime is expected. Development mode auto reloads obviously need to cope with such changes (specifically if Another production mode reload option that could be of use is to make Sanic wait until all old workers have exited before starting the new server and accepting new connections (to avoid any problems if the app cannot cope with different versions of it running simultaneously, due to database or other interprocess communications). If used with option 3 of my previous message, this can still avoid any connection refused errors, only having connections delayed during the grace period. |
This is not supported and I do not have any plans to add that. |
The last commit I pushed to the PR (#2623) adds a I think this should resolve any remaining concerns. @Tronic? |
Adding config options is not my preferable option, and I still didn't get how exactly this is implemented for now (namely which of the three options in my previous message). But given that I don't have the time to dig into the implementation I'll have to concur with the current implementation that was merged even if it isn't perfect. My biggest concern is if future improvements are impaired e.g. because that config option cannot be removed without breaking something. |
Is there an existing issue for this?
Is your feature request related to a problem? Please describe.
Condition: keep service available any time.
---- client code----
when called app.m.restart("ALL_PROCESSES") in a worker, sanic crashed.
---- server code ----
Describe the solution you'd like
graceful restarting and reduce the effect when restarting.
my messy describe:
woker_names = tuple(app.m.workers.keys())
for woker_name in worker_names:
ret_val = app.m.restart(worker_name)
# here, the worker has been graceful restarted, ret_val is meaningful
Additional context
simplify the api,
thanks.
The text was updated successfully, but these errors were encountered: