Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a Status Endpoint and Gracefully Shutdown #2907

Merged
merged 13 commits into from
Mar 3, 2021
Merged

Conversation

kevinkreiser
Copy link
Member

@kevinkreiser kevinkreiser commented Mar 1, 2021

This is a first stab at adding the form of a status endpoint to the service. The idea is that we discuss over in #2736 what should actually be in there and then a future PR can add that info.

This is targetting new prime_server functionality summarized in #2908

First we hook up to prime_servers new SIGTERM handling which allows our daemon processes to gracefully exit when they are asked to shutdown. The shutdown happens in two phases, the drain phase and the shutdown phase. The drain phase is meant to allow traffic to drain off (finish outstanding requests) while the shutdown phase is meant for our worker threads to stop their event loops. Both phases have a configurable amount of time to allow for these operations to happen. In practice an upstream load balancer will hopefully immediately stop new requests, so really we just need to set the drain time based on how long we expect the p99 response time to take for an outstanding request. The shutdown time can be really small so we set it to the smallest time of 1 by default.

We also add a /status endpoint (currently empty json) where we can put very basic status information in the future. This is useful to check that the workers have actually loaded and can take traffic. At the moment I haven't configured it to need to go from loki->thor->odin but I probably should do just to prove the whole system is up. Consider that a TODO and possibly also useful if we want to add information that only each of those modules is privy to. This is now done. The important part about the /status endpoint though is that if you opted into the SIGTERM handling above and you are either in a draining state or a shutting down state, the end point will return HTTP 503. So basically what happens is someone sends SIGTERM to our processes, they begin to drain traffic (finish up outstanding requests). While that is happening new status requests will return 503, ie hey buddy you told me to shut down i dont want any new requests im draining the ones i had when you told me to quit.

Because the status check goes through all workers its harder to unit test. I'm going to make a new unit test to confirm this behavior that runs the whole service.

One common way to run valhalla is in in multiprocess mode using a program like supervisord. Doing so has many benefits as it protectes you from crashes that whipe out a whole machine for example. The changes here make it so that valhalla plays nice with supervisord when it requests that valhalla shutdown. Here is a minimal example of running valhalla with supervisord. Pay attention to the inline comment in the file as its very important.

First make some environment variables that supervisor can use:

#we need this because we dont have a way for the prime server and supervisor stuff to read valhalla configs
export CONFIG_FILE=../valhalla.json
export PRIME_LISTEN=$(jq -r ".httpd.service.listen" ${CONFIG_FILE})
export PRIME_PROXY=$(jq -r ".loki.service.proxy" ${CONFIG_FILE})_in
export PRIME_LOOPBACK=$(jq -r ".httpd.service.loopback" ${CONFIG_FILE})
export PRIME_INTERRUPT=$(jq -r ".httpd.service.interrupt" ${CONFIG_FILE})
export LOKI_PROXY_IN=$(jq -r ".loki.service.proxy" ${CONFIG_FILE})_in
export LOKI_PROXY_OUT=$(jq -r ".loki.service.proxy" ${CONFIG_FILE})_out
export ODIN_PROXY_IN=$(jq -r ".odin.service.proxy" ${CONFIG_FILE})_in
export ODIN_PROXY_OUT=$(jq -r ".odin.service.proxy" ${CONFIG_FILE})_out
export THOR_PROXY_IN=$(jq -r ".thor.service.proxy" ${CONFIG_FILE})_in
export THOR_PROXY_OUT=$(jq -r ".thor.service.proxy" ${CONFIG_FILE})_out
export DRAIN=$(jq -r ".httpd.service.drain_seconds" ${CONFIG_FILE})
export SHUTDOWN=$(jq -r ".httpd.service.shutdown_seconds" ${CONFIG_FILE})
export WORKER_PARALLELISM=2

Then make your supervisor config like so:

[supervisord]
nodaemon=true
loglevel=debug
pidfile=/tmp/supervisord.pid

[program:prime_httpd]
command=prime_httpd %(ENV_PRIME_LISTEN)s %(ENV_PRIME_PROXY)s %(ENV_PRIME_LOOPBACK)s %(ENV_PRIME_INTERRUPT)s true 10485760 '-1' %(ENV_DRAIN)s,%(ENV_SHUTDOWN)s /alive

[program:loki_proxy]
command=prime_proxyd %(ENV_LOKI_PROXY_IN)s %(ENV_LOKI_PROXY_OUT)s %(ENV_DRAIN)s,%(ENV_SHUTDOWN)s

[program:thor_proxy]
command=prime_proxyd %(ENV_THOR_PROXY_IN)s %(ENV_THOR_PROXY_OUT)s %(ENV_DRAIN)s,%(ENV_SHUTDOWN)s

[program:odin_proxy]
command=prime_proxyd %(ENV_ODIN_PROXY_IN)s %(ENV_ODIN_PROXY_OUT)s %(ENV_DRAIN)s,%(ENV_SHUTDOWN)s

[program:loki_worker]
environment=LOCPATH=/usr/local/locales
command=valhalla_loki_worker %(ENV_CONFIG_FILE)s
numprocs=%(ENV_WORKER_PARALLELISM)s
process_name=loki_worker_%(process_num)02d

[program:thor_worker]
environment=LOCPATH=/usr/local/locales
command=valhalla_thor_worker %(ENV_CONFIG_FILE)s
numprocs=%(ENV_WORKER_PARALLELISM)s
process_name=thor_worker_%(process_num)02d

[program:odin_worker]
environment=LOCPATH=/usr/local/locales
command=valhalla_odin_worker %(ENV_CONFIG_FILE)s
numprocs=%(ENV_WORKER_PARALLELISM)s
process_name=odin_worker_%(process_num)02d

# there is a feature/bug in supervisord: https://github.com/Supervisor/supervisor/issues/723#issuecomment-788980891
# in which when you send SIGTERM to it, it will send that signal to each program in a random order but not async.
# it will send it to one and wait for it to exit and then do the next one and wait for that one and so on and so forth.
# because of that, the /status endpoint won't start returning 503 unless you put all of the workers in a group. so that
# all of them get the signal at once and start shutting down.
[group:service]
programs=prime_httpd,loki_proxy,thor_proxy,odin_proxy,loki_worker,thor_worker,odin_worker

then you can test your whole setup by doing:

supervisord -nc supervisord.conf

@kevinkreiser kevinkreiser marked this pull request as ready for review March 2, 2021 16:25
@kevinkreiser
Copy link
Member Author

build is failing because i need to tag to update the docker build image. oops 😄

@kevinkreiser
Copy link
Member Author

actually it looks like i was wrong about that, it was just a timing thing. ssh'd to one of the CI jobs and on that run the new version of prime_server was there

@kevinkreiser
Copy link
Member Author

osx was failing because i didnt update its dependency yet. ive just done that and will restart it: valhalla/homebrew-valhalla@f5a0348


namespace valhalla {
namespace loki {
void loki_worker_t::status(Api&) const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this not need to return something like odin::status ? I guess its not clear to me why loki & thor don't return (even a dummy) status

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just me adding the endpoint. As there is a lot of room for discussion as to what should be in the endpoint I decided to just leave it blank. I suspect that what we want to do is change the proto to have a spot for a status object, then each piece of the pipeline can add information about itself. Loki can mention something about its status (maybe some statistics about requests its handled?) and Thor and Odin can do the same. I just felt like that was such a big question mark as to what should actually be in the status that I just wanted to add the hook here and then maybe discuss further what should be in that status elsewhere and in a separate PR. In fact maybe I shouldnt mark that issue as fixed by this until we at least put some status-y info into the return.

Anyway the design, once we have something to fill out in the proto, will be that each of these will modify the Api object to add its status information and then at the very end Odin will serialize it to json. I hope this makes sense!

Comment on lines +157 to +158
'drain_seconds': 28,
'shutdown_seconds': 1
Copy link
Member Author

@kevinkreiser kevinkreiser Mar 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two new options added to the default config generation

@@ -158,6 +158,7 @@ message Options {
transit_available = 9;
expansion = 10;
centroid = 11;
status = 12;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for now we only have the action. in a future PR we should make a status message and fill it with fields that we expect to populate in loki/thor/odin

// should react by draining traffic (though they are likely doing this as they are usually the ones
// who sent us the request to shutdown)
if (prime_server::draining() || prime_server::shutting_down()) {
throw valhalla_exception_t{102};
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if any process detects that its in the process of shutting down we throw this error which turns into an HTTP 503 with a message about the server shutting down

Comment on lines +119 to +121
auto* avoid_shortcut = co->add_avoid_edges();
avoid_shortcut->set_id(shortcut);
avoid_shortcut->set_percent_along(0);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this variable was shadowed so i fixed that

@@ -318,6 +322,10 @@ loki_worker_t::work(const std::list<zmq::message_t>& job,
}

void run_service(const boost::property_tree::ptree& config) {
// gracefully shutdown when asked via SIGTERM
prime_server::quiesce(config.get<unsigned int>("httpd.service.drain_seconds", 28),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

loki service will listen for SIGTERM and act accordingly

return to_response(status(request), info, request);
}
default: {
// narrate them and serialize them along
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code cov is nuts! literally all the tests would fail if it didnt get in here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh actually this is the service and not the actor so scratch that this code cov is correct 😄 its because we dont actually run the service against actual data in unit tests. we totally could and should though against utrecht. ill add an issue for that and we can coerce the loki_service test into more of an integration test

@@ -40,7 +40,7 @@ if(ENABLE_DATA_TOOLS)
endif()

if(ENABLE_SERVICES)
list(APPEND tests loki_service skadi_service thor_service)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the thor service test got disabled when we switched to completely using protobuf between the service layers

@@ -14,13 +14,16 @@

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i added tests for both valhalla and osrm formats for the status endpoint and i copy pasted the tests from thor service into here for posterity. at somepoint we can test those too when we refactor this to be an integration service test

Comment on lines +448 to +451
// proxies and workers
STAGE(loki);
STAGE(thor);
STAGE(odin);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the status request goes all the way through the service instead of just loki we have to actually run the other stages

mandeepsandhu
mandeepsandhu previously approved these changes Mar 3, 2021
Copy link
Contributor

@mandeepsandhu mandeepsandhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for adding this feature!!

mandeepsandhu
mandeepsandhu previously approved these changes Mar 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants