Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move index updates off the web server #1588

Merged
merged 16 commits into from Mar 9, 2019

Conversation

Projects
None yet
3 participants
@sgrif
Copy link
Contributor

sgrif commented Jan 3, 2019

This fundamentally changes our workflow for publishing, yanking, and unyanking crates. Rather than synchronously updating the index when the request comes in (and potentially retrying multiple times since we have multiple web servers that can create a race condition), we instead queue the update to be run on another machine at some point in the future.

This will improve the resiliency of index updates -- specifically letting us avoid the case where the index has been updated, but something happened to the web server before the database transaction committed.

This setup assumes that all jobs must complete within a short timeframe, or something is seriously wrong. The only background jobs we have right now are index updates, which are extremely low volume. If a job fails, it most likely means that GitHub is down, or a bug has made it to production which is preventing publishing and/or yanking. For these reasons, this PR includes a monitor binary which will page whoever is on call with extremely low thresholds (defaults to paging if a job has been in the queue for 15 minutes, configurable by env var). The runner is meant to be run on a dedicated worker, while the monitor should be run by some cron-like tool on a regular interval (Heroku scheduler for us)

One side effect of this change is that cargo publish returning with a 0 exit status does not mean that the crate can immediately be used. This has always technically been true, since S3 and GitHub both can have delays before they update as well, but it's going to consistently be a bit longer with this PR. It should only be a few seconds the majority of the time, and no more than a minute in the worst case. One enhancement I'd like to make, which is not included in this PR, is a UI to show the status of a publish. I did not include it here, as this PR is already huge, and I do not think that feature is strictly required to land this. In the common case, it will take longer to navigate to that UI than it will take for the job to complete. This enhancement will also go nicely with work on staging publishes if we want to add those (see #1503). There are also some low hanging fruit we can tackle to lower the job's running time if we feel it's necessary.

As for the queue itself, I've chosen to implement one here based on PostgreSQL's row locking. There are a few reasons for this vs something like RabbitMQ or Faktory. The first is operational. We still have a very small team, and very limited ops bandwidth. If we can avoid introducing another piece to our stack, that is a win both in terms of the amount of work our existing team has to do, and making it easy to grow the team (by lowering the number of technologies one person has to learn). The second reason is that using an existing queue wouldn't actually reduce the amount of code required by that much. The majority of the code here is related to actually running jobs, not interacting with PostgreSQL or serialization. The only Rust libraries that exist for this are low level bindings to other queues, but the majority of the "job" infrastructure would still be needed.

The queue code is intended to eventually be extracted to a library. This portion of the code is the background module, and is why a lot of the code in that module is written a bit more generically than crates.io specifically needs. It's still a bit too coupled to crates.io to be extracted right now, though -- and I'd like to have it in the wild for a bit before extracting it. The background_jobs module is our code for interacting with this "library".

@sgrif sgrif referenced this pull request Jan 11, 2019

Closed

failed to yank #1591

@bors

This comment has been minimized.

Copy link
Contributor

bors commented Jan 16, 2019

☔️ The latest upstream changes (presumably #1529) made this pull request unmergeable. Please resolve the merge conflicts.

@jtgeibel
Copy link
Member

jtgeibel left a comment

You'll want to add background-worker and monitor to bin_names in script/ci/prune-cache.sh. This will prune the final build artifacts for the binaries to keep our cache sizes down on CI.

I see background_worker in Procfile, but how (and how often) does monitor get run?

On each index operation, the background job clones the index from GitHub into a temporary directory. I'm a bit worried about this from two perspectives. GitHub might throttle us if we clone too frequently, and I'm not sure how much space Heroku provides within the temporary directory as we accumulate many checkouts in a 24 hour period. We could query the GitHub api (curl -H "Accept: application/vnd.github.v3.sha" https://api.github.com/repos/rust-lang/crates.io-index/commits/master returns the current SHA) to see if we already have the latest commit and do the equivalent of a fetch and reset if not.

Finally, the retry semantics allow for jobs to be executed out of order. The most common source of errors will be network connectivity to GitHub. After an outage new requestes will complete succesfully while old jobs are waiting for their retry time to expire. If someone yanks a crate during an outage and then unyanks it a few minutes later after connectivity recovers, then the index operations will complete out of order and the index state (yanked) will not match the database (unyanked). I think there needs to be some mechanism to guarantee that operations on the same (crate_name, version) complete in order.

Show resolved Hide resolved src/background/storage.rs
@sgrif

This comment has been minimized.

Copy link
Contributor Author

sgrif commented Jan 21, 2019

I see background_worker in Procfile, but how (and how often) does monitor get run?

It gets run by any cron-like tool, which is Heroku scheduler for us. I figure we'll start at every 5 minutes, and tweak if needed. It is not configured in the repository.

I disagree that throttling is going to be a concern from cloning, but you make a good point on disk space (I'd assume that /tmp is cleared if we're out of disk space but I'm not sure whether that happens and don't really want to find out). Ultimately the clone per worker was something I think is likely to cause more delay than we'd like, and I was planning on tackling it shortly after this was merged, so I'll fix it now.

Finally, the retry semantics allow for jobs to be executed out of order.

This is intentional, but that's a good point that yanking/unyanking quickly can cause the index to not match the database. I need to give this some more thought. I don't think I agree that we need to guarantee in-order execution for operations affecting (crate_name, version), but I do agree that the index not matching the DB would be a problem.

@sgrif

This comment has been minimized.

Copy link
Contributor Author

sgrif commented Jan 26, 2019

This is blocked on #1607

@sgrif sgrif added the S-blocked label Jan 26, 2019

bors added a commit that referenced this pull request Feb 1, 2019

Auto merge of #1610 - sgrif:sg-test-connection, r=jtgeibel
Explicitly share a single connection in tests

This is required to move forward on #1588, I have spun it off into a standalone PR for ease of review, since it's isolated and a change that makes sense on its own. This requires #1609 and includes those commits

Right now the way our test suite works is that we set the database
connection pool size to 1, and tell that single connection to start a
test transaction. This works, but has the caveat that only a single
reference to the connection can be live at any given time.

This becomes a problem when we add background jobs into the mix, which
will always want to grab two connections from the pool (and we still
need them to be the same connection). The code that we end up running
looks something like:

```
// In request
let conn = pool.get();
insert_background_job(&conn)?;

// In middleware that runs jobs
for _ in 0..job_count {
    thread::spawn(|| {
        conn = pool.get();
        lock_job(&conn);
        // in actual job
        conn = pool.get();
        do_stuff(&conn);
        unlock_or_delete_job(&conn);
    });
}
```

Note that the worker and the job itself both need a connection. In order
to make this work, we need to change our pool to actually just be a
single connection given out multiple times in test mode. I've chosen to
use a reentrant mutex for this for two reasons:

1. We need to satisfy the `Sync` bound on `App`, even though we're only
  running in a single thread in tests.
2. The background worker actually does end up running in another thread,
  so we do actually need to be thread safe.

The result of this is that we can have the connection checked out more
than once at the same time, but we still can't be more relaxed about
when it gets dropped in our test code, since we need it to be unlocked
if we try to get a connection from another thread as the result of
running a background job. Our tests shouldn't know or care about whether
background jobs were run async or not, so they should assume every
request might run one.

The mutex guard holds a reference to the mutex, so this introduces a
lifetime constraint that hasn't existed since the Diesel port. The
return type of `req.db_conn()` is now effectively `&PgConnection`
instead of `Box<PgConnection>`. Since the method exists on `Request`,
this means it gets treated as an immutable borrow of the whole request.
The fixes are all pretty simple, and boil down to either cloning the
app and getting the DB directly from there, taking immutable references
instead of mutable ones, or dropping the connection when we're done with
it.

@sgrif sgrif added S-waiting-on-author and removed S-blocked labels Feb 13, 2019

@sgrif sgrif force-pushed the sgrif:sg-background-jobs branch from 4b5b9c4 to 5400248 Feb 13, 2019

@sgrif

This comment has been minimized.

Copy link
Contributor Author

sgrif commented Feb 13, 2019

@jtgeibel I've updated the yanking logic to handle database updates in the background job instead of on the web server.

I have not changed the clone into a temp dir behavior, and I don't think we need to in this PR. The concern about disk space is actually already handled. We're using TempDir for this, which will already clean up the directory when it's dropped.

I do want to change this to reuse the same checkout over multiple runs for other reasons, but I think there's enough tradeoffs/discussions for that to be its own PR. The concern about disk usage is resolved, so I don't think it needs to block this on its own.

So I believe this is good to go if you want to give it a last look. If it is good to merge, I'd appreciate a comment letting me know but holding off on merging for now. I consider this a pretty high risk PR so I want to make sure it's deployed by itself when I have plenty of time to do some final checks on staging and have overridden myself to be on call.

@bors

This comment has been minimized.

Copy link
Contributor

bors commented Feb 13, 2019

☔️ The latest upstream changes (presumably #1623) made this pull request unmergeable. Please resolve the merge conflicts.

@sgrif sgrif force-pushed the sgrif:sg-background-jobs branch from 09deccf to 8fc4b32 Feb 22, 2019

@sgrif sgrif requested a review from jtgeibel Feb 22, 2019

@bors

This comment has been minimized.

Copy link
Contributor

bors commented Feb 22, 2019

☔️ The latest upstream changes (presumably #1618) made this pull request unmergeable. Please resolve the merge conflicts.

@jtgeibel
Copy link
Member

jtgeibel left a comment

@sgrif the updated yank logic looks good to me. Sorry I missed your ping on this last week.

I though I looked into the drop behavior of TempDir, but I guess not since it is right there in the documentation. This looks good to me as well.

I'll let you merge this when you're ready to deploy.

@sgrif

This comment has been minimized.

Copy link
Contributor Author

sgrif commented Mar 2, 2019

Going to do some final testing on staging and deploy this next week

@sgrif

This comment has been minimized.

Copy link
Contributor Author

sgrif commented Mar 8, 2019

Staging revealed a few issues which I've pushed fixes for. I'm confident that this actually works, I'm going to quickly test on staging with the full index instead of an empty one before merging.

@bors

This comment has been minimized.

Copy link
Contributor

bors commented Mar 8, 2019

☔️ The latest upstream changes (presumably #1650) made this pull request unmergeable. Please resolve the merge conflicts.

sgrif added some commits May 3, 2018

Add a minimal background queueing framework
Note: This is intended to possibly be extracted into a library, so the
docs are written as if this were its own library. Cargo specific code
(besides the use of `CargoResult`) will go in its own module for the
same reason.

This adds an MVP background queueing system intended to be used in place
of the "try to commit 20 times" loop in `git.rs`. This is a fairly
simple queue, that is intended to be "the easiest thing that fits our
needs, with the least operational impact".

There's a few reasons I've opted to go with our own queuing system here,
rather than an existing solution like Faktory or Beanstalkd.

- We'd have to write the majority of this code ourselves no matter
  what.
  - Client libraries for beanstalkd don't deal with the actual job
    logic, only `storage.rs` would get replaced
  - The only client for faktory that exists made some really odd API
    choices. Faktory also hasn't seen a lot of use in the wild yet.
- I want to limit the number of services we have to manage. We have
  extremely limited ops bandwidth today, and every new part of the stack
  we have to manage is a huge cost. Right now we only have our server
  and PG. I'd like to keep it that way for as long as possible.

This system takes advantage of the `SKIP LOCKED` feature in PostgreSQL
9.5 to handle all of the hard stuff for us. We use PG's row locking to
treat a row as "currently being processed", which means we don't have to
worry about returning it to the queue if the power goes out on one of
our workers.

This queue is intended only for jobs with "at least once" semantics.
That means the entire job has to be idempotent. If the entire job
completes successfully, but the power goes out before we commit the
transaction, we will run the whole thing again.

The code today also makes a few additional assumptions based on our
current needs. We expect all jobs to complete successfully the first
time, and the most likely reason a job would fail is due to an incident
happening at GitHub, hence the extremely high retry timeout.

I'm also assuming that all jobs will eventually complete, and that any
job failing N (likely 5) times is an event that should page whoever is
on call. (Paging is not part of this PR).

Finally, it's unlikely that this queue will be appropriate for high
thoughput use cases, since it requires one PG connection per worker (a
real connection, adding pg bouncer wouldn't help here). Right now our
only background work that happens is something that comes in on average
every 5 minutes, but if we start moving more code to be run here we may
want to revisit this in the future.
Encapsulate some of our test behavior into a `Runner`
The `Runner` is basically the main interface into the job queue, and
stores the environment, thread pool, and database connection pool. We
technically don't need a thread pool since our jobs are so infrequent
that we don't need more than one thread. But our tests are simplified by
having it, and a general purpose lib will need it.

The other oddity is that we will want to panic instead of returning a
result if updating or fetching a job failed for some reason. This isn't
something we expect to occur unless there's an issue in the DB, so this
seems fine. A thread panicking won't kill the runner. We are also only
panicking if changes to the `background_jobs` table fails, not if the
job itself fails.
Move our index updates to be run in background jobs
This fundamentally changes the workflow for all operations we perform
involving git, so that they are not performed on the web server and do
not block the response. This will improve the response times of `cargo
publish`, and make the publish process more resilient, reducing the
liklihood of an inconsistency occurring such as the index getting
updated, but not our database.

Previously, our workflow looked something like this:

- When the server boots, do a full clone of the index into a known
  location
- Some request comes in that needs to update the index
- Database transaction is opened
- Local checkout is modified, we attempt to commit & push (note: This
  involves a mutex to avoid contention with another request to update
  the index on the same server)
- If push fails, we fetch, `reset --hard`, and try again up to 20 times
- Database transaction is committed
- We send a successful response

The reason for the retry logic is that we have more than one web server,
meaning no server can be sure that its local checkout is actually up to
date. There's also a major opportunity for an inconsistent state to be
reached here. If the power goes out, the server is restarted, something
crashes, etc, in between the index being updated and the database
transaction being committed, we will never retry it.

The new workflow looks like this:

- Some request comes in that needs to update the index
- A job is queued in the database to update the index at some point in
  the future.
- We send a successful response
- A separate process pulls the job out of the database
- A full clone of the index is performed into a temporary directory
- The new checkout is modified, committed, and pushed
- If push succeeds, job is removed from database
- If push fails, job is marked as failed and will be retried at some
  point in the future

While a background worker can be spread across multiple machines and/or
threads, we will be able to avoid the race conditions that were
previously possible by ensuring that we only have one worker with one
thread that handles index updates. Right now that's easy since index
updates are the only background job we have, but as we add more we will
need to add support for multiple queues to account for this.

I've opted to do a fresh checkout in every job, rather than relying on
some state that was setup when the machine booted. This is mostly for
simplicity's sake. It also means that if we need to scale to multiple
threads/processes for other jobs, we can punt the multi-queue
enhancement for a while if we wish. However, it does mean the job will
take a bit longer to run. If this turns out to be a problem, it's easy
to address.

This should eliminate the opportunity for the index to enter an
inconsistent state from our database -- or at least they should become
eventually consistent. If the power goes out before the job is committed
as done, it is assumed the job failed and it will be retried. The job
itself is idempotent, so even if the power goes out after the index is
updated, the retry should succeed.

One other side effect of this change is that when `cargo publish`
returns with an exit status of 0, that does not mean that your crate/new
version is immediately available for use -- if you try to point to it in
Cargo.toml seconds after publishing, you may get an error that it could
not find that version. This was technically already true, since neither
S3 nor GitHub guarantee that uploads/pushes are immediately visible.
However, this does increase the timescale beyond the delay we would have
seen there. In most cases it should be under 10 seconds, and at most a
minute.

One enhancement that will come as a followup, but is not included in
this PR is a UI to see the status of your upload. This is definitely
nice to have, but is not something I think is necessary for this feature
to land. The time it would take to navigate to that UI is going to be
longer than the time it takes the background job to run in most cases.
That enhancement is something I think can go hand in hand with #1503
(which incidentally becomes much easier to implement with this PR, since
a "staging" publish just skips queuing the background job, and the only
thing the button to full publish needs to do is queue the job).

This setup does assume that all background jobs *must* eventually
succeed. If any job fails, the index is in an inconsistent state with
our database, and we are having an outage of some kind. Due to the
nature of our background jobs, this likely means that GitHub is down, or
there is a bug in our code. Either way, we page whoever is on-call,
since it means publishing is broken. Since publishing crates is such an
infrequent event, I've set the thresholds to be extremely low.
Add a binary for running background jobs and monitoring their status
These binaries are pretty simple (and a bit spaghetti) for the time
being, since we have so little actually using the background job
framework. The runner is meant to be run on a worker dyno continuously,
while the monitor should be run by a cron-like tool roughly every 5
minutes.

We should move `update-downloads` to be scheduled as well, since it
spends so little time actually working.

sgrif added some commits Feb 12, 2019

Fix rebase failures
Now that we're wrapping the connection in our own smart pointer that
doesn't implement `Connection` we need some explicit derefs (adding the
manual `Connection` impl isn't worth avoiding these `*`s). We need to be
able to clone the connection pool (only in tests, but this also only
requires modifying the test variant), so we need the `Arc`.

Similarly, since in tests the connection is a re-entrant mutex, we can't
grab the connection before spawning the worker thread. The lock isn't
`Send` that's for a very good reason. So we instead need to clone a
handle to the pool and grab the connection on the thread we intend to
use it.
Yanking only updates the database after the index is updated
Previously, we'd immediately update the database and then queue the
index update to be run later. Jobs aren't guaranteed to run in the
order they were queued (though in practice they likely will). This means
that if someone yanked and unyanked a crate in rapid succession, the
database may end up permanently out of sync with the index.

With this change, the final result may still be out of order, but the
database will be in sync with the index whatever the outcome is. Since
rapidly yanking and unyanking isn't something we expect users to do, and
even if they did jobs should be running in order under normal
circumstances, I don't think we need to do anything to ensure more
consistency than this.
Fix pool size, don't crash if we can't get a connection
When testing in staging, the runner started crashing on boot. This was
because we only set the connection pool size to 1 when we meant to set
it to 2 (one for the runner, one for the worker thread). So in each
loop, if all jobs took greater than 1 second to run, the runner would
crash.

This fixes the pool size, and also does not return an error if no
database connection could be retrieved from the pool.
Check if a crate is already yanked/unyanked in the job
Right now we check if a crate is already yanked/unyanked before enqueing
the job at all. This means that if you try to undo a yank before the
yank finishes running, we won't enqueue the unyank at all. This isn't an
expected scenario, and we still might not do the unyank if the jobs are
run out of order, but this means that we should always end up in the
expected state under normal operation.

@sgrif sgrif force-pushed the sgrif:sg-background-jobs branch from aa3712b to 9d38d19 Mar 8, 2019

sgrif added some commits Mar 8, 2019

Maintain a persistent checkout of the git index
The time a full clone takes caused too long of a delay when publishing a
crate. This instead performs a full clone when the runner boots, and
then locks and does `git fetch && git reset --hard origin/master` at the
top of every job. The reset means we can (somewhat) scale to multiple
runners, and retains unwind safety in the environment.

Note that there can still be a delay measured in minutes if a publish is
performed right after the server boots, as the job runner starting up
does not block the web server from starting.
@sgrif

This comment has been minimized.

Copy link
Contributor Author

sgrif commented Mar 8, 2019

Delay for publishes was too long with the full index, so I've set it up to maintain a local checkout. This eliminates the delay once the runner has booted, but there may still be a delay if a publish is done immediately after the server boots, as the web server now boots instantly rather than waiting for the clone to complete.

@sgrif

This comment has been minimized.

Copy link
Contributor Author

sgrif commented Mar 8, 2019

Going to do one more test in staging and then deploy this.

@bors: r+

@bors

This comment has been minimized.

Copy link
Contributor

bors commented Mar 8, 2019

📌 Commit 6777978 has been approved by sgrif

@bors

This comment has been minimized.

Copy link
Contributor

bors commented Mar 8, 2019

⌛️ Testing commit 6777978 with merge d4289d2...

bors added a commit that referenced this pull request Mar 8, 2019

Auto merge of #1588 - sgrif:sg-background-jobs, r=sgrif
Move index updates off the web server

This fundamentally changes our workflow for publishing, yanking, and unyanking crates. Rather than synchronously updating the index when the request comes in (and potentially retrying multiple times since we have multiple web servers that can create a race condition), we instead queue the update to be run on another machine at some point in the future.

This will improve the resiliency of index updates -- specifically letting us avoid the case where the index has been updated, but something happened to the web server before the database transaction committed.

This setup assumes that all jobs *must* complete within a short timeframe, or something is seriously wrong. The only background jobs we have right now are index updates, which are extremely low volume. If a job fails, it most likely means that GitHub is down, or a bug has made it to production which is preventing publishing and/or yanking. For these reasons, this PR includes a monitor binary which will page whoever is on call with extremely low thresholds (defaults to paging if a job has been in the queue for 15 minutes, configurable by env var). The runner is meant to be run on a dedicated worker, while the monitor should be run by some cron-like tool on a regular interval (Heroku scheduler for us)

One side effect of this change is that `cargo publish` returning with a 0 exit status does not mean that the crate can immediately be used. This has always technically been true, since S3 and GitHub both can have delays before they update as well, but it's going to consistently be a bit longer with this PR. It should only be a few seconds the majority of the time, and no more than a minute in the worst case. One enhancement I'd like to make, which is not included in this PR, is a UI to show the status of a publish. I did not include it here, as this PR is already huge, and I do not think that feature is strictly required to land this. In the common case, it will take longer to navigate to that UI than it will take for the job to complete. This enhancement will also go nicely with work on staging publishes if we want to add those (see #1503). There are also some low hanging fruit we can tackle to lower the job's running time if we feel it's necessary.

As for the queue itself, I've chosen to implement one here based on PostgreSQL's row locking. There are a few reasons for this vs something like RabbitMQ or Faktory. The first is operational. We still have a very small team, and very limited ops bandwidth. If we can avoid introducing another piece to our stack, that is a win both in terms of the amount of work our existing team has to do, and making it easy to grow the team (by lowering the number of technologies one person has to learn). The second reason is that using an existing queue wouldn't actually reduce the amount of code required by that much. The majority of the code here is related to actually running jobs, not interacting with PostgreSQL or serialization. The only Rust libraries that exist for this are low level bindings to other queues, but the majority of the "job" infrastructure would still be needed.

The queue code is intended to eventually be extracted to a library. This portion of the code is the `background` module, and is why a lot of the code in that module is written a bit more generically than crates.io specifically needs. It's still a bit too coupled to crates.io to be extracted right now, though -- and I'd like to have it in the wild for a bit before extracting it. The `background_jobs` module is our code for interacting with this "library".
@bors

This comment has been minimized.

Copy link
Contributor

bors commented Mar 9, 2019

💔 Test failed - checks-travis

@sgrif

This comment has been minimized.

Copy link
Contributor Author

sgrif commented Mar 9, 2019

@bors r+

@bors

This comment has been minimized.

Copy link
Contributor

bors commented Mar 9, 2019

📌 Commit 1ab2c08 has been approved by sgrif

@bors

This comment has been minimized.

Copy link
Contributor

bors commented Mar 9, 2019

⌛️ Testing commit 1ab2c08 with merge 7ca518d...

bors added a commit that referenced this pull request Mar 9, 2019

Auto merge of #1588 - sgrif:sg-background-jobs, r=sgrif
Move index updates off the web server

This fundamentally changes our workflow for publishing, yanking, and unyanking crates. Rather than synchronously updating the index when the request comes in (and potentially retrying multiple times since we have multiple web servers that can create a race condition), we instead queue the update to be run on another machine at some point in the future.

This will improve the resiliency of index updates -- specifically letting us avoid the case where the index has been updated, but something happened to the web server before the database transaction committed.

This setup assumes that all jobs *must* complete within a short timeframe, or something is seriously wrong. The only background jobs we have right now are index updates, which are extremely low volume. If a job fails, it most likely means that GitHub is down, or a bug has made it to production which is preventing publishing and/or yanking. For these reasons, this PR includes a monitor binary which will page whoever is on call with extremely low thresholds (defaults to paging if a job has been in the queue for 15 minutes, configurable by env var). The runner is meant to be run on a dedicated worker, while the monitor should be run by some cron-like tool on a regular interval (Heroku scheduler for us)

One side effect of this change is that `cargo publish` returning with a 0 exit status does not mean that the crate can immediately be used. This has always technically been true, since S3 and GitHub both can have delays before they update as well, but it's going to consistently be a bit longer with this PR. It should only be a few seconds the majority of the time, and no more than a minute in the worst case. One enhancement I'd like to make, which is not included in this PR, is a UI to show the status of a publish. I did not include it here, as this PR is already huge, and I do not think that feature is strictly required to land this. In the common case, it will take longer to navigate to that UI than it will take for the job to complete. This enhancement will also go nicely with work on staging publishes if we want to add those (see #1503). There are also some low hanging fruit we can tackle to lower the job's running time if we feel it's necessary.

As for the queue itself, I've chosen to implement one here based on PostgreSQL's row locking. There are a few reasons for this vs something like RabbitMQ or Faktory. The first is operational. We still have a very small team, and very limited ops bandwidth. If we can avoid introducing another piece to our stack, that is a win both in terms of the amount of work our existing team has to do, and making it easy to grow the team (by lowering the number of technologies one person has to learn). The second reason is that using an existing queue wouldn't actually reduce the amount of code required by that much. The majority of the code here is related to actually running jobs, not interacting with PostgreSQL or serialization. The only Rust libraries that exist for this are low level bindings to other queues, but the majority of the "job" infrastructure would still be needed.

The queue code is intended to eventually be extracted to a library. This portion of the code is the `background` module, and is why a lot of the code in that module is written a bit more generically than crates.io specifically needs. It's still a bit too coupled to crates.io to be extracted right now, though -- and I'd like to have it in the wild for a bit before extracting it. The `background_jobs` module is our code for interacting with this "library".
@bors

This comment has been minimized.

Copy link
Contributor

bors commented Mar 9, 2019

☀️ Test successful - checks-travis
Approved by: sgrif
Pushing 7ca518d to master...

@bors bors merged commit 1ab2c08 into rust-lang:master Mar 9, 2019

2 checks passed

homu Test successful
Details
percy/crates.io Visual review automatically approved, no visual changes found.
Details

@sgrif sgrif deleted the sgrif:sg-background-jobs branch Mar 9, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.