Add a Status Endpoint and Gracefully Shutdown #2907

kevinkreiser · 2021-03-01T05:33:53Z

This is a first stab at adding the form of a status endpoint to the service. The idea is that we discuss over in #2736 what should actually be in there and then a future PR can add that info.

This is targetting new prime_server functionality summarized in #2908

First we hook up to prime_servers new SIGTERM handling which allows our daemon processes to gracefully exit when they are asked to shutdown. The shutdown happens in two phases, the drain phase and the shutdown phase. The drain phase is meant to allow traffic to drain off (finish outstanding requests) while the shutdown phase is meant for our worker threads to stop their event loops. Both phases have a configurable amount of time to allow for these operations to happen. In practice an upstream load balancer will hopefully immediately stop new requests, so really we just need to set the drain time based on how long we expect the p99 response time to take for an outstanding request. The shutdown time can be really small so we set it to the smallest time of 1 by default.

We also add a /status endpoint (currently empty json) where we can put very basic status information in the future. This is useful to check that the workers have actually loaded and can take traffic. At the moment I haven't configured it to need to go from loki->thor->odin but I probably should do just to prove the whole system is up. Consider that a TODO and possibly also useful if we want to add information that only each of those modules is privy to. This is now done. The important part about the /status endpoint though is that if you opted into the SIGTERM handling above and you are either in a draining state or a shutting down state, the end point will return HTTP 503. So basically what happens is someone sends SIGTERM to our processes, they begin to drain traffic (finish up outstanding requests). While that is happening new status requests will return 503, ie hey buddy you told me to shut down i dont want any new requests im draining the ones i had when you told me to quit.

Because the status check goes through all workers its harder to unit test. I'm going to make a new unit test to confirm this behavior that runs the whole service.

One common way to run valhalla is in in multiprocess mode using a program like supervisord. Doing so has many benefits as it protectes you from crashes that whipe out a whole machine for example. The changes here make it so that valhalla plays nice with supervisord when it requests that valhalla shutdown. Here is a minimal example of running valhalla with supervisord. Pay attention to the inline comment in the file as its very important.

First make some environment variables that supervisor can use:

#we need this because we dont have a way for the prime server and supervisor stuff to read valhalla configs
export CONFIG_FILE=../valhalla.json
export PRIME_LISTEN=$(jq -r ".httpd.service.listen" ${CONFIG_FILE})
export PRIME_PROXY=$(jq -r ".loki.service.proxy" ${CONFIG_FILE})_in
export PRIME_LOOPBACK=$(jq -r ".httpd.service.loopback" ${CONFIG_FILE})
export PRIME_INTERRUPT=$(jq -r ".httpd.service.interrupt" ${CONFIG_FILE})
export LOKI_PROXY_IN=$(jq -r ".loki.service.proxy" ${CONFIG_FILE})_in
export LOKI_PROXY_OUT=$(jq -r ".loki.service.proxy" ${CONFIG_FILE})_out
export ODIN_PROXY_IN=$(jq -r ".odin.service.proxy" ${CONFIG_FILE})_in
export ODIN_PROXY_OUT=$(jq -r ".odin.service.proxy" ${CONFIG_FILE})_out
export THOR_PROXY_IN=$(jq -r ".thor.service.proxy" ${CONFIG_FILE})_in
export THOR_PROXY_OUT=$(jq -r ".thor.service.proxy" ${CONFIG_FILE})_out
export DRAIN=$(jq -r ".httpd.service.drain_seconds" ${CONFIG_FILE})
export SHUTDOWN=$(jq -r ".httpd.service.shutdown_seconds" ${CONFIG_FILE})
export WORKER_PARALLELISM=2

Then make your supervisor config like so:

[supervisord]
nodaemon=true
loglevel=debug
pidfile=/tmp/supervisord.pid

[program:prime_httpd]
command=prime_httpd %(ENV_PRIME_LISTEN)s %(ENV_PRIME_PROXY)s %(ENV_PRIME_LOOPBACK)s %(ENV_PRIME_INTERRUPT)s true 10485760 '-1' %(ENV_DRAIN)s,%(ENV_SHUTDOWN)s /alive

[program:loki_proxy]
command=prime_proxyd %(ENV_LOKI_PROXY_IN)s %(ENV_LOKI_PROXY_OUT)s %(ENV_DRAIN)s,%(ENV_SHUTDOWN)s

[program:thor_proxy]
command=prime_proxyd %(ENV_THOR_PROXY_IN)s %(ENV_THOR_PROXY_OUT)s %(ENV_DRAIN)s,%(ENV_SHUTDOWN)s

[program:odin_proxy]
command=prime_proxyd %(ENV_ODIN_PROXY_IN)s %(ENV_ODIN_PROXY_OUT)s %(ENV_DRAIN)s,%(ENV_SHUTDOWN)s

[program:loki_worker]
environment=LOCPATH=/usr/local/locales
command=valhalla_loki_worker %(ENV_CONFIG_FILE)s
numprocs=%(ENV_WORKER_PARALLELISM)s
process_name=loki_worker_%(process_num)02d

[program:thor_worker]
environment=LOCPATH=/usr/local/locales
command=valhalla_thor_worker %(ENV_CONFIG_FILE)s
numprocs=%(ENV_WORKER_PARALLELISM)s
process_name=thor_worker_%(process_num)02d

[program:odin_worker]
environment=LOCPATH=/usr/local/locales
command=valhalla_odin_worker %(ENV_CONFIG_FILE)s
numprocs=%(ENV_WORKER_PARALLELISM)s
process_name=odin_worker_%(process_num)02d

# there is a feature/bug in supervisord: https://github.com/Supervisor/supervisor/issues/723#issuecomment-788980891
# in which when you send SIGTERM to it, it will send that signal to each program in a random order but not async.
# it will send it to one and wait for it to exit and then do the next one and wait for that one and so on and so forth.
# because of that, the /status endpoint won't start returning 503 unless you put all of the workers in a group. so that
# all of them get the signal at once and start shutting down.
[group:service]
programs=prime_httpd,loki_proxy,thor_proxy,odin_proxy,loki_worker,thor_worker,odin_worker

then you can test your whole setup by doing:

supervisord -nc supervisord.conf

…eady to serve requests

… are up and running

kevinkreiser · 2021-03-02T16:45:47Z

build is failing because i need to tag to update the docker build image. oops 😄

kevinkreiser · 2021-03-02T17:36:49Z

actually it looks like i was wrong about that, it was just a timing thing. ssh'd to one of the CI jobs and on that run the new version of prime_server was there

kevinkreiser · 2021-03-02T18:00:41Z

osx was failing because i didnt update its dependency yet. ive just done that and will restart it: valhalla/homebrew-valhalla@f5a0348

mandeepsandhu · 2021-03-02T23:12:16Z

src/loki/status_action.cc

+
+namespace valhalla {
+namespace loki {
+void loki_worker_t::status(Api&) const {


Does this not need to return something like odin::status ? I guess its not clear to me why loki & thor don't return (even a dummy) status

This is just me adding the endpoint. As there is a lot of room for discussion as to what should be in the endpoint I decided to just leave it blank. I suspect that what we want to do is change the proto to have a spot for a status object, then each piece of the pipeline can add information about itself. Loki can mention something about its status (maybe some statistics about requests its handled?) and Thor and Odin can do the same. I just felt like that was such a big question mark as to what should actually be in the status that I just wanted to add the hook here and then maybe discuss further what should be in that status elsewhere and in a separate PR. In fact maybe I shouldnt mark that issue as fixed by this until we at least put some status-y info into the return.

Anyway the design, once we have something to fill out in the proto, will be that each of these will modify the Api object to add its status information and then at the very end Odin will serialize it to json. I hope this makes sense!

kevinkreiser · 2021-03-03T03:01:00Z

scripts/valhalla_build_config

+      'drain_seconds': 28,
+      'shutdown_seconds': 1


two new options added to the default config generation

kevinkreiser · 2021-03-03T03:01:48Z

proto/options.proto

@@ -158,6 +158,7 @@ message Options {
    transit_available = 9;
    expansion = 10;
    centroid = 11;
+    status = 12;


for now we only have the action. in a future PR we should make a status message and fill it with fields that we expect to populate in loki/thor/odin

kevinkreiser · 2021-03-03T03:03:24Z

src/loki/status_action.cc

+  // should react by draining traffic (though they are likely doing this as they are usually the ones
+  // who sent us the request to shutdown)
+  if (prime_server::draining() || prime_server::shutting_down()) {
+    throw valhalla_exception_t{102};


if any process detects that its in the process of shutting down we throw this error which turns into an HTTP 503 with a message about the server shutting down

kevinkreiser · 2021-03-03T03:03:52Z

src/loki/worker.cc

+                auto* avoid_shortcut = co->add_avoid_edges();
+                avoid_shortcut->set_id(shortcut);
+                avoid_shortcut->set_percent_along(0);


this variable was shadowed so i fixed that

kevinkreiser · 2021-03-03T03:04:29Z

src/loki/worker.cc

@@ -318,6 +322,10 @@ loki_worker_t::work(const std::list<zmq::message_t>& job,
 }

 void run_service(const boost::property_tree::ptree& config) {
+  // gracefully shutdown when asked via SIGTERM
+  prime_server::quiesce(config.get<unsigned int>("httpd.service.drain_seconds", 28),


loki service will listen for SIGTERM and act accordingly

kevinkreiser · 2021-03-03T03:05:53Z

src/odin/worker.cc

+        return to_response(status(request), info, request);
+      }
+      default: {
+        // narrate them and serialize them along


this code cov is nuts! literally all the tests would fail if it didnt get in here

oh actually this is the service and not the actor so scratch that this code cov is correct 😄 its because we dont actually run the service against actual data in unit tests. we totally could and should though against utrecht. ill add an issue for that and we can coerce the loki_service test into more of an integration test

src/tyr/actor.cc

kevinkreiser · 2021-03-03T03:13:03Z

test/CMakeLists.txt

@@ -40,7 +40,7 @@ if(ENABLE_DATA_TOOLS)
 endif()

 if(ENABLE_SERVICES)
-  list(APPEND tests loki_service skadi_service thor_service)


the thor service test got disabled when we switched to completely using protobuf between the service layers

kevinkreiser · 2021-03-03T03:14:12Z

test/loki_service.cc

@@ -14,13 +14,16 @@



i added tests for both valhalla and osrm formats for the status endpoint and i copy pasted the tests from thor service into here for posterity. at somepoint we can test those too when we refactor this to be an integration service test

kevinkreiser · 2021-03-03T03:14:43Z

test/loki_service.cc

+  // proxies and workers
+  STAGE(loki);
+  STAGE(thor);
+  STAGE(odin);


since the status request goes all the way through the service instead of just loki we have to actually run the other stages

mandeepsandhu

LGTM. Thanks for adding this feature!!

… actor action

kevinkreiser added 2 commits February 28, 2021 23:03

add graceful shutdown via SIGTERM to all service executables

35992fa

add currently empty status endpoint for basic check that service is r…

8bb2028

…eady to serve requests

kevinkreiser mentioned this pull request Mar 2, 2021

upgrade to latest prime_server #2908

Merged

kevinkreiser requested review from mandeepsandhu, purew, danpaz and mookerji March 2, 2021 15:38

kevinkreiser added 2 commits March 2, 2021 11:25

pipe the status request all the way through to make sure all the bits…

947d346

… are up and running

Merge branch 'master' into kk_status

981c74c

kevinkreiser marked this pull request as ready for review March 2, 2021 16:25

kevinkreiser added 3 commits March 2, 2021 11:26

changelog

9479a69

Merge remote-tracking branch 'origin/kk_status' into kk_status

2474ac8

oops forgot to remove this

f8854c0

forgot a break

7e0253a

kevinkreiser added 2 commits March 2, 2021 13:40

tidy and unit test

c16b669

Merge branch 'master' into kk_status

6a7894d

mandeepsandhu reviewed Mar 2, 2021

View reviewed changes

kevinkreiser commented Mar 3, 2021

View reviewed changes

src/tyr/actor.cc Show resolved Hide resolved

kevinkreiser commented Mar 3, 2021

View reviewed changes

mandeepsandhu previously approved these changes Mar 3, 2021

View reviewed changes

kevinkreiser added 2 commits March 2, 2021 22:23

actually tyr is the serializer not odin, also add a unit test for the…

55ae999

… actor action

Merge remote-tracking branch 'origin/kk_status' into kk_status

75fb044

kevinkreiser dismissed mandeepsandhu’s stale review via 75fb044 March 3, 2021 03:23

mandeepsandhu previously approved these changes Mar 3, 2021

View reviewed changes

tidy

dfa2d4d

kevinkreiser dismissed mandeepsandhu’s stale review via dfa2d4d March 3, 2021 12:58

gknisely self-requested a review March 3, 2021 14:04

gknisely approved these changes Mar 3, 2021

View reviewed changes

kevinkreiser merged commit c97bafe into master Mar 3, 2021

kevinkreiser mentioned this pull request Mar 4, 2021

default whitelist status endpoint [ci skip] #2913

Merged

nilsnolde mentioned this pull request Mar 24, 2021

/status endpoint for general info #2736

Closed

kevinkreiser deleted the kk_status branch January 5, 2022 03:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a Status Endpoint and Gracefully Shutdown #2907

Add a Status Endpoint and Gracefully Shutdown #2907

kevinkreiser commented Mar 1, 2021 •

edited

Loading

kevinkreiser commented Mar 2, 2021

kevinkreiser commented Mar 2, 2021

kevinkreiser commented Mar 2, 2021

mandeepsandhu Mar 2, 2021

kevinkreiser Mar 3, 2021

kevinkreiser Mar 3, 2021 •

edited

Loading

kevinkreiser Mar 3, 2021

kevinkreiser Mar 3, 2021

kevinkreiser Mar 3, 2021

kevinkreiser Mar 3, 2021

kevinkreiser Mar 3, 2021

kevinkreiser Mar 3, 2021

kevinkreiser Mar 3, 2021

kevinkreiser Mar 3, 2021

kevinkreiser Mar 3, 2021

mandeepsandhu left a comment

Add a Status Endpoint and Gracefully Shutdown #2907

Add a Status Endpoint and Gracefully Shutdown #2907

Conversation

kevinkreiser commented Mar 1, 2021 • edited Loading

kevinkreiser commented Mar 2, 2021

kevinkreiser commented Mar 2, 2021

kevinkreiser commented Mar 2, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevinkreiser Mar 3, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mandeepsandhu left a comment

Choose a reason for hiding this comment

kevinkreiser commented Mar 1, 2021 •

edited

Loading

kevinkreiser Mar 3, 2021 •

edited

Loading