Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

restart workers online (graceful restart) to hot reload, in production environment. #2619

Closed
1 task done
yangbo1024 opened this issue Dec 8, 2022 · 17 comments · Fixed by #2622 or #2632
Closed
1 task done

Comments

@yangbo1024
Copy link

yangbo1024 commented Dec 8, 2022

Is there an existing issue for this?

  • I have searched the existing issues

Is your feature request related to a problem? Please describe.

Condition: keep service available any time.

---- client code----

async def run():                                                                                        
    while 1:
        await post('http://127.0.0.1:8000/') 

when called app.m.restart("ALL_PROCESSES") in a worker, sanic crashed.

---- server code ----

@app.post("/")
async def handler(request):
    app.m.restart('__ALL_PROCESSES__')
    return response.text('ok')


if __name__ == "__main__":
    app.run(debug=True, workers=2)

Describe the solution you'd like

graceful restarting and reduce the effect when restarting.
my messy describe:

  1. graceful restart workers; restart all workers will not crash, if only 1 worker, block a little while (if worker not started yet) is ok.
  2. a way to graceful restart worker one by one, code eg:
    woker_names = tuple(app.m.workers.keys())
    for woker_name in worker_names:
    ret_val = app.m.restart(worker_name)
    # here, the worker has been graceful restarted, ret_val is meaningful
  3. may combine the above 2, when restarting all workers, 50% workers restarting, 50% old workers keep serving

Additional context

simplify the api,

app.m.restart('__ALL_PROCESSES__')   => app.m.restart_all()

thanks.

@ahopkins
Copy link
Member

ahopkins commented Dec 8, 2022

Thanks, I just opened a PR for this. In the meantime, you can do this manually:

@app.get("/restart")
async def restart_handler(request: Request):
    for name, worker in request.app.m.workers.items():
        if worker.get("server"):
            request.app.m.restart(name)
    return json(None, status=202)

@yangbo1024
Copy link
Author

thanks, it works

@yangbo1024
Copy link
Author

In older version of sanic, I use gunicorn to get gracefully online hot updating.
gunicorn's steps:

  1. start new worker,
  2. then kill old worker.
    sounds good, but after step 1, theres's no gentle notification was sent to arbiter(main process) to tell it that the worker's ready to accept connections, so the whole service will block a while until the first worker finished initializing. If worker has too much init work, the block time is a big 'moment'.
    So, I have to stop the worker one by one with some outer script or manually~ painful.
    I really hope sanic's WorkerManager solve this problem, avoid server blocking when restarting.

@ahopkins
Copy link
Member

ahopkins commented Dec 8, 2022

@yangbo1024 See linked PR: #2623

LMK if this what you had in mind.

@yangbo1024
Copy link
Author

amazing! thats it

@Tronic
Copy link
Member

Tronic commented Dec 8, 2022

A truly hot reload (zero downtime) is also not implemented AFAIK. It would require the use of a UNIX socket (for atomic replace on server restart) or passing of TCP listening socket from old WorkerManager to the new one. Ideally one of those would also be implemented but it is quite complex to do (especially the passing of the old socket, which requires checking that the address/port didn't change etc), and the benefits of that are minimal (the downtime is extremely short but it may still cause some 502 or connection refused errors that the client code might not expect, if a request happens within that short timeframe where nothing is bound/listened to).

@ahopkins
Copy link
Member

ahopkins commented Dec 8, 2022

@Tronic The sockets are opened and listen is called in the master process and passed to each subprocess. Also, any existing requests will complete and the old worker does a graceful shutdown. So there should NOT be any time between when the new starts and the old shutsdown.

🤔 Well, I guess, that is not 100% true. ack happens in app._startup() which is before before_server_start and after_server_start. Maybe we need to move ack until the process is fully ready to receive requests.

@Tronic
Copy link
Member

Tronic commented Dec 8, 2022

During a restart, doesn't the old master process first close its listening socket? Presumably this happens right after a restart is signaled and soon after that the new master process binds and listens a new socket. In between, new connections are refused. Even if old workers are still working with their existing connections.

If the master process was made either to not restart itself at all (only workers get replaced) or to pass the existing listening socket to the new master process, there'd be only one listening socket (of which each process share copies). Because eventually a new worker starts accepting on that socket, the system is truly zero downtime.

In zero-downtime restarts, workers of two or more app versions may be simultaneously running and serving existing requests or websocket connections, up to the graceful shutdown timeout, while at the same time the new server and workers are already serving any new requests. This may lead to some 500 errors due to database incompatibilities or other minor confusion within the "gracefully" handled requests if the new version updated the db schema.

@ahopkins
Copy link
Member

ahopkins commented Dec 8, 2022

Correct, once the replacement is running, there are no new requests to the old. I'll test this some more, but in my testing I could not reach the old worker once the new started, but it would continue to complete the in flight.

@ahopkins
Copy link
Member

ahopkins commented Dec 8, 2022

It should be pointed out, this is not a breaking change. Reloads have only ever been supported in the past for development use, and there has been no method to trigger them externally.

@yangbo1024
Copy link
Author

yangbo1024 commented Dec 9, 2022

I think it's okay to do a good job of HTTP link processing with short responses, and the same is true for my personal needs. Restarting the main process, modifying the IP address and port is not considered. For some long-term services, such as websockets, streams, etc., they can be handled by sanic user, including but not limited to deliberately increasing the waiting time for graceful closing. Otherwise, the sanic becomes too complex and even brings more bugs, which is not friendly to the majority of users.

@yangbo1024
Copy link
Author

yangbo1024 commented Dec 9, 2022

In my experience with gunicorn to release old version sanic, my approach was to use sanic's 'before_server_stop' listener for the old worker to wait a short time before actually starting to stop the old worker, and this wait time was to estimate the actual startup time of the new worker plus a little more time, usually I use 5 seconds.
Even if the old worker has stopped and the new worker has not been started, the request will not be lost, because the main process listens on this port, but does not accept, so if the time estimate is inaccurate, it will block during the handover of the old and new worker, not be lost.

@Tronic
Copy link
Member

Tronic commented Dec 9, 2022

Correct, once the replacement is running, there are no new requests to the old. I'll test this some more, but in my testing I could not reach the old worker once the new started, but it would continue to complete the in flight.

I didn't look at the implementation yet, but are you

  1. employing reuseaddr/reuseport to bind another socket to the same address,
  2. closing the old listening socket first and then opening a new one, or
  3. passing the old socket to new processes (after reload no bind or listen needed because it is the same socket)

I would say that 1 is buggy and very different across different OS and also a security problem, 2 is not exactly zero downtime (but it may be so close that it doesn't matter) and 3 is the elegant way to do it.

@Tronic
Copy link
Member

Tronic commented Dec 9, 2022

Restarting the main process, modifying the IP address and port is not considered.

Address, port or unix socket modification needs another socket, and in such case no zero downtime is expected. Development mode auto reloads obviously need to cope with such changes (specifically if app.run is used and its arguments modified). In production mode reloads there could be options to change socket address, TLS settings etc, which might then not be zero downtime.

Another production mode reload option that could be of use is to make Sanic wait until all old workers have exited before starting the new server and accepting new connections (to avoid any problems if the app cannot cope with different versions of it running simultaneously, due to database or other interprocess communications). If used with option 3 of my previous message, this can still avoid any connection refused errors, only having connections delayed during the grace period.

@ahopkins
Copy link
Member

specifically if app.run is used and its arguments modified

This is not supported and I do not have any plans to add that.

@ahopkins
Copy link
Member

The last commit I pushed to the PR (#2623) adds a app.config.RESTART_ORDER config value. It will default to the existing order, so no change there. But, will allow someone like @yangbo1024 to opt into the new pattern of starting a new process and waiting for ack before terminating the old.

I think this should resolve any remaining concerns. @Tronic?

@Tronic
Copy link
Member

Tronic commented Dec 12, 2022

The last commit I pushed to the PR (#2623) adds a app.config.RESTART_ORDER config value. It will default to the existing order, so no change there. But, will allow someone like @yangbo1024 to opt into the new pattern of starting a new process and waiting for ack before terminating the old.

I think this should resolve any remaining concerns. @Tronic?

Adding config options is not my preferable option, and I still didn't get how exactly this is implemented for now (namely which of the three options in my previous message). But given that I don't have the time to dig into the implementation I'll have to concur with the current implementation that was merged even if it isn't perfect. My biggest concern is if future improvements are impaired e.g. because that config option cannot be removed without breaking something.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants