Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DO NOT MERGE] WIP background job framework #1466

Closed
wants to merge 1 commit into from

Conversation

sgrif
Copy link
Contributor

@sgrif sgrif commented Jul 25, 2018

This PR is to make it easier to review the "guts" code as it gets worked out before we actually add the background jobs to fix #1384.

Since this was last reviewed I have fixed the race condition on failed jobs, and added some bare bones tests around the queueing infrastructure. I think this code is ready to go, but wanted to get some more eyes on it before I submit the final PR to move git logic over here.

@sgrif
Copy link
Contributor Author

sgrif commented Jul 25, 2018

@carols10cents This was an absolute nightmare to get proper tests for, but I think this should address your concerns?

src/job.rs Outdated

if let Some(job) = job {
conn.transaction(|| {
delete(&job).execute(conn)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand the locking strategy here, the first transaction above ensures that no 2 threads will fetch the same job. Once a job is fetched and the retry info is updated, the query will not fetch that row again for at least 1 minute.

Very shortly after the first transaction is completed (and that row lock released), the delete statement in this transaction will again lock the row for update such that even a job that runs for longer than a minute will not be fetched again by another thread until the row is unlocked.

Is my understanding here correct? Is there an advantage/requirement to do these steps in 2 separate transactions? I ask because it took me a little while to realize that this delete is re-locking the row and that it is important to issue this delete within the transaction before executing the job in the next line to ensure the row is locked while the job runs.

If somehow it took more than a minute for a thread to progress from line 72 to 76, and another thread fetched the same job, then I believe one of them would block on this delete (or maybe abort due to a detected deadlock). If the other transaction completed successfully (deleting the row), then this would return an error. If the other transaction failed, then the row would still exist and be unlocked so the other thread would immediately retry the job. I don't think this is a reasonable failure mode, but I think a single execution of the job would still be guaranteed under this extreme scenario.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is my understanding here correct?

Yes

? Is there an advantage/requirement to do these steps in 2 separate transactions?

It's a requirement. We cannot update the retry counter in the same transaction as the job itself, since if the job fails the transaction may be aborted meaning we cannot commit those changes.

If somehow it took more than a minute for a thread to progress from line 72 to 76

We will hit a TCP timeout before a minute passes. The only way it could take more than a minute to get from line 72 to 76 is if the OS has actually pre-empted our process and we receive no CPU resources for a full minute, which is not realistic.

But yes, you're correct that this behavior relies on less than a minute passing between the first transaction committing and the second transaction locking the row.

and another thread fetched the same job, then I believe one of them would block on this delete (or maybe abort due to a detected deadlock).

Yeah, it'd block on the delete. I think you're right that we should add a sanity check to ensure the row was actually deleted before running the job as a sanity check. At the point where the delete line returns, we can be sure that we have an exclusive lock on the row, so that would be the point to make sure "no seriously we are the only thing that has access to this 100%". It's not a realistic failure scenario, but it doesn't hurt to handle it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a requirement. We cannot update the retry counter in the same transaction as the job itself, since if the job fails the transaction may be aborted meaning we cannot commit those changes.

Ah, great point! I knew I was missing some subtlety here.

Overall this looks really good to me. While we'll only need one consumer for git operations, this is well future proofed for additional use cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've written this code to be extracted into its own library at some point. The only thing that should be coupled to crates.io is the use of CargoResult. To make this generically applicable to us I will specifically need to add a way to have a differing number of workers for different job types (having >1 worker for the git job makes no sense), but we don't have to deal with that for now since the git job will be our only job

Note: This is intended to possibly be extracted into a library, so the
docs are written as if this were its own library. Cargo specific code
(besides the use of `CargoResult`) will go in its own module for the
same reason.

This adds an MVP background queueing system intended to be used in place
of the "try to commit 20 times" loop in `git.rs`. This is a fairly
simple queue, that is intended to be "the easiest thing that fits our
needs, with the least operational impact".

There's a few reasons I've opted to go with our own queuing system here,
rather than an existing solution like Faktory or Beanstalkd.

- We'd have to write the majority of this code ourselves no matter
  what.
  - Client libraries for beanstalkd don't deal with the actual job
    logic, only `storage.rs` would get replaced
  - The only client for faktory that exists made some really odd API
    choices. Faktory also hasn't seen a lot of use in the wild yet.
- I want to limit the number of services we have to manage. We have
  extremely limited ops bandwidth today, and every new part of the stack
  we have to manage is a huge cost. Right now we only have our server
  and PG. I'd like to keep it that way for as long as possible.

This system takes advantage of the `SKIP LOCKED` feature in PostgreSQL
9.5 to handle all of the hard stuff for us. We use PG's row locking to
treat a row as "currently being processed", which means we don't have to
worry about returning it to the queue if the power goes out on one of
our workers.

This queue is intended only for jobs with "at least once" semantics.
That means the entire job has to be idempotent. If the entire job
completes successfully, but the power goes out before we commit the
transaction, we will run the whole thing again.

The code today also makes a few additional assumptions based on our
current needs. We expect all jobs to complete successfully the first
time, and the most likely reason a job would fail is due to an incident
happening at GitHub, hence the extremely high retry timeout.

I'm also assuming that all jobs will eventually complete, and that any
job failing N (likely 5) times is an event that should page whoever is
on call. (Paging is not part of this PR).

Finally, it's unlikely that this queue will be appropriate for high
thoughput use cases, since it requires one PG connection per worker (a
real connection, adding pg bouncer wouldn't help here). Right now our
only background work that happens is something that comes in on average
every 5 minutes, but if we start moving more code to be run here we may
want to revisit this in the future.
@sgrif
Copy link
Contributor Author

sgrif commented Oct 11, 2018

I've updated this with what I think will be the final queuing code (I will be working on moving one of the git functions over to this tomorrow). The main change I've made here is not passing the runner's connection to the job, which means that we no longer have to juggle two transactions. I've also set us up to catch panics

@sgrif
Copy link
Contributor Author

sgrif commented Jan 3, 2019

Going to open a new PR with this and the git migration

@sgrif sgrif closed this Jan 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Move all of our git logic off the web server
2 participants