Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
Move index updates off the web server #1588
This fundamentally changes our workflow for publishing, yanking, and unyanking crates. Rather than synchronously updating the index when the request comes in (and potentially retrying multiple times since we have multiple web servers that can create a race condition), we instead queue the update to be run on another machine at some point in the future.
This will improve the resiliency of index updates -- specifically letting us avoid the case where the index has been updated, but something happened to the web server before the database transaction committed.
This setup assumes that all jobs must complete within a short timeframe, or something is seriously wrong. The only background jobs we have right now are index updates, which are extremely low volume. If a job fails, it most likely means that GitHub is down, or a bug has made it to production which is preventing publishing and/or yanking. For these reasons, this PR includes a monitor binary which will page whoever is on call with extremely low thresholds (defaults to paging if a job has been in the queue for 15 minutes, configurable by env var). The runner is meant to be run on a dedicated worker, while the monitor should be run by some cron-like tool on a regular interval (Heroku scheduler for us)
One side effect of this change is that
As for the queue itself, I've chosen to implement one here based on PostgreSQL's row locking. There are a few reasons for this vs something like RabbitMQ or Faktory. The first is operational. We still have a very small team, and very limited ops bandwidth. If we can avoid introducing another piece to our stack, that is a win both in terms of the amount of work our existing team has to do, and making it easy to grow the team (by lowering the number of technologies one person has to learn). The second reason is that using an existing queue wouldn't actually reduce the amount of code required by that much. The majority of the code here is related to actually running jobs, not interacting with PostgreSQL or serialization. The only Rust libraries that exist for this are low level bindings to other queues, but the majority of the "job" infrastructure would still be needed.
The queue code is intended to eventually be extracted to a library. This portion of the code is the
jtgeibel left a comment
You'll want to add
I see background_worker in
On each index operation, the background job clones the index from GitHub into a temporary directory. I'm a bit worried about this from two perspectives. GitHub might throttle us if we clone too frequently, and I'm not sure how much space Heroku provides within the temporary directory as we accumulate many checkouts in a 24 hour period. We could query the GitHub api (
Finally, the retry semantics allow for jobs to be executed out of order. The most common source of errors will be network connectivity to GitHub. After an outage new requestes will complete succesfully while old jobs are waiting for their retry time to expire. If someone yanks a crate during an outage and then unyanks it a few minutes later after connectivity recovers, then the index operations will complete out of order and the index state (yanked) will not match the database (unyanked). I think there needs to be some mechanism to guarantee that operations on the same
It gets run by any cron-like tool, which is Heroku scheduler for us. I figure we'll start at every 5 minutes, and tweak if needed. It is not configured in the repository.
I disagree that throttling is going to be a concern from cloning, but you make a good point on disk space (I'd assume that
This is intentional, but that's a good point that yanking/unyanking quickly can cause the index to not match the database. I need to give this some more thought. I don't think I agree that we need to guarantee in-order execution for operations affecting
referenced this pull request
Jan 28, 2019
added a commit
this pull request
Feb 1, 2019
@jtgeibel I've updated the yanking logic to handle database updates in the background job instead of on the web server.
I have not changed the clone into a temp dir behavior, and I don't think we need to in this PR. The concern about disk space is actually already handled. We're using
I do want to change this to reuse the same checkout over multiple runs for other reasons, but I think there's enough tradeoffs/discussions for that to be its own PR. The concern about disk usage is resolved, so I don't think it needs to block this on its own.
So I believe this is good to go if you want to give it a last look. If it is good to merge, I'd appreciate a comment letting me know but holding off on merging for now. I consider this a pretty high risk PR so I want to make sure it's deployed by itself when I have plenty of time to do some final checks on staging and have overridden myself to be on call.
jtgeibel left a comment
@sgrif the updated yank logic looks good to me. Sorry I missed your ping on this last week.
I though I looked into the drop behavior of
I'll let you merge this when you're ready to deploy.
Delay for publishes was too long with the full index, so I've set it up to maintain a local checkout. This eliminates the delay once the runner has booted, but there may still be a delay if a publish is done immediately after the server boots, as the web server now boots instantly rather than waiting for the clone to complete.