-
Notifications
You must be signed in to change notification settings - Fork 661
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a Status Endpoint and Gracefully Shutdown #2907
Conversation
build is failing because i need to tag to update the docker build image. oops 😄 |
actually it looks like i was wrong about that, it was just a timing thing. ssh'd to one of the CI jobs and on that run the new version of prime_server was there |
osx was failing because i didnt update its dependency yet. ive just done that and will restart it: valhalla/homebrew-valhalla@f5a0348 |
|
||
namespace valhalla { | ||
namespace loki { | ||
void loki_worker_t::status(Api&) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this not need to return something like odin::status
? I guess its not clear to me why loki
& thor
don't return (even a dummy) status
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just me adding the endpoint. As there is a lot of room for discussion as to what should be in the endpoint I decided to just leave it blank. I suspect that what we want to do is change the proto to have a spot for a status object, then each piece of the pipeline can add information about itself. Loki can mention something about its status (maybe some statistics about requests its handled?) and Thor and Odin can do the same. I just felt like that was such a big question mark as to what should actually be in the status that I just wanted to add the hook here and then maybe discuss further what should be in that status elsewhere and in a separate PR. In fact maybe I shouldnt mark that issue as fixed by this until we at least put some status-y info into the return.
Anyway the design, once we have something to fill out in the proto, will be that each of these will modify the Api
object to add its status information and then at the very end Odin will serialize it to json. I hope this makes sense!
'drain_seconds': 28, | ||
'shutdown_seconds': 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
two new options added to the default config generation
@@ -158,6 +158,7 @@ message Options { | |||
transit_available = 9; | |||
expansion = 10; | |||
centroid = 11; | |||
status = 12; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for now we only have the action. in a future PR we should make a status message and fill it with fields that we expect to populate in loki/thor/odin
// should react by draining traffic (though they are likely doing this as they are usually the ones | ||
// who sent us the request to shutdown) | ||
if (prime_server::draining() || prime_server::shutting_down()) { | ||
throw valhalla_exception_t{102}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if any process detects that its in the process of shutting down we throw this error which turns into an HTTP 503 with a message about the server shutting down
auto* avoid_shortcut = co->add_avoid_edges(); | ||
avoid_shortcut->set_id(shortcut); | ||
avoid_shortcut->set_percent_along(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this variable was shadowed so i fixed that
@@ -318,6 +322,10 @@ loki_worker_t::work(const std::list<zmq::message_t>& job, | |||
} | |||
|
|||
void run_service(const boost::property_tree::ptree& config) { | |||
// gracefully shutdown when asked via SIGTERM | |||
prime_server::quiesce(config.get<unsigned int>("httpd.service.drain_seconds", 28), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
loki service will listen for SIGTERM and act accordingly
return to_response(status(request), info, request); | ||
} | ||
default: { | ||
// narrate them and serialize them along |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this code cov is nuts! literally all the tests would fail if it didnt get in here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh actually this is the service and not the actor so scratch that this code cov is correct 😄 its because we dont actually run the service against actual data in unit tests. we totally could and should though against utrecht. ill add an issue for that and we can coerce the loki_service
test into more of an integration test
@@ -40,7 +40,7 @@ if(ENABLE_DATA_TOOLS) | |||
endif() | |||
|
|||
if(ENABLE_SERVICES) | |||
list(APPEND tests loki_service skadi_service thor_service) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the thor service test got disabled when we switched to completely using protobuf between the service layers
@@ -14,13 +14,16 @@ | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i added tests for both valhalla and osrm formats for the status endpoint and i copy pasted the tests from thor service into here for posterity. at somepoint we can test those too when we refactor this to be an integration service test
// proxies and workers | ||
STAGE(loki); | ||
STAGE(thor); | ||
STAGE(odin); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since the status request goes all the way through the service instead of just loki we have to actually run the other stages
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for adding this feature!!
This is a first stab at adding the form of a status endpoint to the service. The idea is that we discuss over in #2736 what should actually be in there and then a future PR can add that info.
This is targetting new prime_server functionality summarized in #2908
First we hook up to prime_servers new SIGTERM handling which allows our daemon processes to gracefully exit when they are asked to shutdown. The shutdown happens in two phases, the drain phase and the shutdown phase. The drain phase is meant to allow traffic to drain off (finish outstanding requests) while the shutdown phase is meant for our worker threads to stop their event loops. Both phases have a configurable amount of time to allow for these operations to happen. In practice an upstream load balancer will hopefully immediately stop new requests, so really we just need to set the drain time based on how long we expect the p99 response time to take for an outstanding request. The shutdown time can be really small so we set it to the smallest time of 1 by default.
We also add a
/status
endpoint (currently empty json) where we can put very basic status information in the future. This is useful to check that the workers have actually loaded and can take traffic.At the moment I haven't configured it to need to go from loki->thor->odin but I probably should do just to prove the whole system is up. Consider that a TODO and possibly also useful if we want to add information that only each of those modules is privy to.This is now done. The important part about the/status
endpoint though is that if you opted into the SIGTERM handling above and you are either in a draining state or a shutting down state, the end point will return HTTP 503. So basically what happens is someone sends SIGTERM to our processes, they begin to drain traffic (finish up outstanding requests). While that is happening new status requests will return 503, ie hey buddy you told me to shut down i dont want any new requests im draining the ones i had when you told me to quit.Because the status check goes through all workers its harder to unit test. I'm going to make a new unit test to confirm this behavior that runs the whole service.
One common way to run valhalla is in in multiprocess mode using a program like
supervisord
. Doing so has many benefits as it protectes you from crashes that whipe out a whole machine for example. The changes here make it so that valhalla plays nice with supervisord when it requests that valhalla shutdown. Here is a minimal example of running valhalla withsupervisord
. Pay attention to the inline comment in the file as its very important.First make some environment variables that supervisor can use:
Then make your supervisor config like so:
then you can test your whole setup by doing: