Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.Sign up
GitHub is where the world builds software
Millions of developers and companies build, ship, and maintain their software on GitHub — the largest and most advanced development platform in the world.
[DO NOT MERGE] WIP background job framework #1466
This PR is to make it easier to review the "guts" code as it gets worked out before we actually add the background jobs to fix #1384.
Since this was last reviewed I have fixed the race condition on failed jobs, and added some bare bones tests around the queueing infrastructure. I think this code is ready to go, but wanted to get some more eyes on it before I submit the final PR to move git logic over here.
Note: This is intended to possibly be extracted into a library, so the docs are written as if this were its own library. Cargo specific code (besides the use of `CargoResult`) will go in its own module for the same reason. This adds an MVP background queueing system intended to be used in place of the "try to commit 20 times" loop in `git.rs`. This is a fairly simple queue, that is intended to be "the easiest thing that fits our needs, with the least operational impact". There's a few reasons I've opted to go with our own queuing system here, rather than an existing solution like Faktory or Beanstalkd. - We'd have to write the majority of this code ourselves no matter what. - Client libraries for beanstalkd don't deal with the actual job logic, only `storage.rs` would get replaced - The only client for faktory that exists made some really odd API choices. Faktory also hasn't seen a lot of use in the wild yet. - I want to limit the number of services we have to manage. We have extremely limited ops bandwidth today, and every new part of the stack we have to manage is a huge cost. Right now we only have our server and PG. I'd like to keep it that way for as long as possible. This system takes advantage of the `SKIP LOCKED` feature in PostgreSQL 9.5 to handle all of the hard stuff for us. We use PG's row locking to treat a row as "currently being processed", which means we don't have to worry about returning it to the queue if the power goes out on one of our workers. This queue is intended only for jobs with "at least once" semantics. That means the entire job has to be idempotent. If the entire job completes successfully, but the power goes out before we commit the transaction, we will run the whole thing again. The code today also makes a few additional assumptions based on our current needs. We expect all jobs to complete successfully the first time, and the most likely reason a job would fail is due to an incident happening at GitHub, hence the extremely high retry timeout. I'm also assuming that all jobs will eventually complete, and that any job failing N (likely 5) times is an event that should page whoever is on call. (Paging is not part of this PR). Finally, it's unlikely that this queue will be appropriate for high thoughput use cases, since it requires one PG connection per worker (a real connection, adding pg bouncer wouldn't help here). Right now our only background work that happens is something that comes in on average every 5 minutes, but if we start moving more code to be run here we may want to revisit this in the future.
I've updated this with what I think will be the final queuing code (I will be working on moving one of the git functions over to this tomorrow). The main change I've made here is not passing the runner's connection to the job, which means that we no longer have to juggle two transactions. I've also set us up to catch panics