New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DO NOT MERGE] WIP background job framework #1466

Open
wants to merge 1 commit into
base: master
from

Conversation

Projects
None yet
2 participants
@sgrif
Contributor

sgrif commented Jul 25, 2018

This PR is to make it easier to review the "guts" code as it gets worked out before we actually add the background jobs to fix #1384.

Since this was last reviewed I have fixed the race condition on failed jobs, and added some bare bones tests around the queueing infrastructure. I think this code is ready to go, but wanted to get some more eyes on it before I submit the final PR to move git logic over here.

@sgrif

This comment has been minimized.

Show comment
Hide comment
@sgrif

sgrif Jul 25, 2018

Contributor

@carols10cents This was an absolute nightmare to get proper tests for, but I think this should address your concerns?

Contributor

sgrif commented Jul 25, 2018

@carols10cents This was an absolute nightmare to get proper tests for, but I think this should address your concerns?

Show outdated Hide outdated src/job.rs Outdated
Add a minimal background queueing framework
Note: This is intended to possibly be extracted into a library, so the
docs are written as if this were its own library. Cargo specific code
(besides the use of `CargoResult`) will go in its own module for the
same reason.

This adds an MVP background queueing system intended to be used in place
of the "try to commit 20 times" loop in `git.rs`. This is a fairly
simple queue, that is intended to be "the easiest thing that fits our
needs, with the least operational impact".

There's a few reasons I've opted to go with our own queuing system here,
rather than an existing solution like Faktory or Beanstalkd.

- We'd have to write the majority of this code ourselves no matter
  what.
  - Client libraries for beanstalkd don't deal with the actual job
    logic, only `storage.rs` would get replaced
  - The only client for faktory that exists made some really odd API
    choices. Faktory also hasn't seen a lot of use in the wild yet.
- I want to limit the number of services we have to manage. We have
  extremely limited ops bandwidth today, and every new part of the stack
  we have to manage is a huge cost. Right now we only have our server
  and PG. I'd like to keep it that way for as long as possible.

This system takes advantage of the `SKIP LOCKED` feature in PostgreSQL
9.5 to handle all of the hard stuff for us. We use PG's row locking to
treat a row as "currently being processed", which means we don't have to
worry about returning it to the queue if the power goes out on one of
our workers.

This queue is intended only for jobs with "at least once" semantics.
That means the entire job has to be idempotent. If the entire job
completes successfully, but the power goes out before we commit the
transaction, we will run the whole thing again.

The code today also makes a few additional assumptions based on our
current needs. We expect all jobs to complete successfully the first
time, and the most likely reason a job would fail is due to an incident
happening at GitHub, hence the extremely high retry timeout.

I'm also assuming that all jobs will eventually complete, and that any
job failing N (likely 5) times is an event that should page whoever is
on call. (Paging is not part of this PR).

Finally, it's unlikely that this queue will be appropriate for high
thoughput use cases, since it requires one PG connection per worker (a
real connection, adding pg bouncer wouldn't help here). Right now our
only background work that happens is something that comes in on average
every 5 minutes, but if we start moving more code to be run here we may
want to revisit this in the future.
@sgrif

This comment has been minimized.

Show comment
Hide comment
@sgrif

sgrif Oct 11, 2018

Contributor

I've updated this with what I think will be the final queuing code (I will be working on moving one of the git functions over to this tomorrow). The main change I've made here is not passing the runner's connection to the job, which means that we no longer have to juggle two transactions. I've also set us up to catch panics

Contributor

sgrif commented Oct 11, 2018

I've updated this with what I think will be the final queuing code (I will be working on moving one of the git functions over to this tomorrow). The main change I've made here is not passing the runner's connection to the job, which means that we no longer have to juggle two transactions. I've also set us up to catch panics

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment