[DO NOT MERGE] WIP background job framework #1466

sgrif · 2018-07-25T22:02:00Z

This PR is to make it easier to review the "guts" code as it gets worked out before we actually add the background jobs to fix #1384.

Since this was last reviewed I have fixed the race condition on failed jobs, and added some bare bones tests around the queueing infrastructure. I think this code is ready to go, but wanted to get some more eyes on it before I submit the final PR to move git logic over here.

sgrif · 2018-07-25T22:03:00Z

@carols10cents This was an absolute nightmare to get proper tests for, but I think this should address your concerns?

jtgeibel · 2018-08-02T00:09:31Z

src/job.rs

+
+    if let Some(job) = job {
+        conn.transaction(|| {
+            delete(&job).execute(conn)?;


If I understand the locking strategy here, the first transaction above ensures that no 2 threads will fetch the same job. Once a job is fetched and the retry info is updated, the query will not fetch that row again for at least 1 minute.

Very shortly after the first transaction is completed (and that row lock released), the delete statement in this transaction will again lock the row for update such that even a job that runs for longer than a minute will not be fetched again by another thread until the row is unlocked.

Is my understanding here correct? Is there an advantage/requirement to do these steps in 2 separate transactions? I ask because it took me a little while to realize that this delete is re-locking the row and that it is important to issue this delete within the transaction before executing the job in the next line to ensure the row is locked while the job runs.

If somehow it took more than a minute for a thread to progress from line 72 to 76, and another thread fetched the same job, then I believe one of them would block on this delete (or maybe abort due to a detected deadlock). If the other transaction completed successfully (deleting the row), then this would return an error. If the other transaction failed, then the row would still exist and be unlocked so the other thread would immediately retry the job. I don't think this is a reasonable failure mode, but I think a single execution of the job would still be guaranteed under this extreme scenario.

Is my understanding here correct?

Yes

? Is there an advantage/requirement to do these steps in 2 separate transactions?

It's a requirement. We cannot update the retry counter in the same transaction as the job itself, since if the job fails the transaction may be aborted meaning we cannot commit those changes.

If somehow it took more than a minute for a thread to progress from line 72 to 76

We will hit a TCP timeout before a minute passes. The only way it could take more than a minute to get from line 72 to 76 is if the OS has actually pre-empted our process and we receive no CPU resources for a full minute, which is not realistic.

But yes, you're correct that this behavior relies on less than a minute passing between the first transaction committing and the second transaction locking the row.

and another thread fetched the same job, then I believe one of them would block on this delete (or maybe abort due to a detected deadlock).

Yeah, it'd block on the delete. I think you're right that we should add a sanity check to ensure the row was actually deleted before running the job as a sanity check. At the point where the delete line returns, we can be sure that we have an exclusive lock on the row, so that would be the point to make sure "no seriously we are the only thing that has access to this 100%". It's not a realistic failure scenario, but it doesn't hurt to handle it.

It's a requirement. We cannot update the retry counter in the same transaction as the job itself, since if the job fails the transaction may be aborted meaning we cannot commit those changes.

Ah, great point! I knew I was missing some subtlety here.

Overall this looks really good to me. While we'll only need one consumer for git operations, this is well future proofed for additional use cases.

Yes, I've written this code to be extracted into its own library at some point. The only thing that should be coupled to crates.io is the use of CargoResult. To make this generically applicable to us I will specifically need to add a way to have a differing number of workers for different job types (having >1 worker for the git job makes no sense), but we don't have to deal with that for now since the git job will be our only job

Note: This is intended to possibly be extracted into a library, so the docs are written as if this were its own library. Cargo specific code (besides the use of `CargoResult`) will go in its own module for the same reason. This adds an MVP background queueing system intended to be used in place of the "try to commit 20 times" loop in `git.rs`. This is a fairly simple queue, that is intended to be "the easiest thing that fits our needs, with the least operational impact". There's a few reasons I've opted to go with our own queuing system here, rather than an existing solution like Faktory or Beanstalkd. - We'd have to write the majority of this code ourselves no matter what. - Client libraries for beanstalkd don't deal with the actual job logic, only `storage.rs` would get replaced - The only client for faktory that exists made some really odd API choices. Faktory also hasn't seen a lot of use in the wild yet. - I want to limit the number of services we have to manage. We have extremely limited ops bandwidth today, and every new part of the stack we have to manage is a huge cost. Right now we only have our server and PG. I'd like to keep it that way for as long as possible. This system takes advantage of the `SKIP LOCKED` feature in PostgreSQL 9.5 to handle all of the hard stuff for us. We use PG's row locking to treat a row as "currently being processed", which means we don't have to worry about returning it to the queue if the power goes out on one of our workers. This queue is intended only for jobs with "at least once" semantics. That means the entire job has to be idempotent. If the entire job completes successfully, but the power goes out before we commit the transaction, we will run the whole thing again. The code today also makes a few additional assumptions based on our current needs. We expect all jobs to complete successfully the first time, and the most likely reason a job would fail is due to an incident happening at GitHub, hence the extremely high retry timeout. I'm also assuming that all jobs will eventually complete, and that any job failing N (likely 5) times is an event that should page whoever is on call. (Paging is not part of this PR). Finally, it's unlikely that this queue will be appropriate for high thoughput use cases, since it requires one PG connection per worker (a real connection, adding pg bouncer wouldn't help here). Right now our only background work that happens is something that comes in on average every 5 minutes, but if we start moving more code to be run here we may want to revisit this in the future.

sgrif · 2018-10-11T22:42:08Z

I've updated this with what I think will be the final queuing code (I will be working on moving one of the git functions over to this tomorrow). The main change I've made here is not passing the runner's connection to the job, which means that we no longer have to juggle two transactions. I've also set us up to catch panics

sgrif · 2019-01-03T19:20:48Z

Going to open a new PR with this and the git migration

jtgeibel reviewed Aug 2, 2018

View reviewed changes

sgrif force-pushed the sg-background-jobs branch from 15c878d to bbd3073 Compare October 11, 2018 22:39

sgrif force-pushed the sg-background-jobs branch from bbd3073 to dea249d Compare October 11, 2018 22:39

jtgeibel mentioned this pull request Oct 16, 2018

General outline of proposed updates to publishing logic #1522

Closed

sgrif closed this Jan 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DO NOT MERGE] WIP background job framework #1466

[DO NOT MERGE] WIP background job framework #1466

sgrif commented Jul 25, 2018

sgrif commented Jul 25, 2018

jtgeibel Aug 2, 2018

sgrif Aug 2, 2018

jtgeibel Aug 3, 2018

sgrif Aug 3, 2018

sgrif commented Oct 11, 2018

sgrif commented Jan 3, 2019

[DO NOT MERGE] WIP background job framework #1466

[DO NOT MERGE] WIP background job framework #1466

Conversation

sgrif commented Jul 25, 2018

sgrif commented Jul 25, 2018

jtgeibel Aug 2, 2018

Choose a reason for hiding this comment

sgrif Aug 2, 2018

Choose a reason for hiding this comment

jtgeibel Aug 3, 2018

Choose a reason for hiding this comment

sgrif Aug 3, 2018

Choose a reason for hiding this comment

sgrif commented Oct 11, 2018

sgrif commented Jan 3, 2019