Skip to content

Commit

Permalink
Merge pull request #538 from pietroalbini/cratesio-db-maintenance
Browse files Browse the repository at this point in the history
Add crates.io database maintenance checklist
  • Loading branch information
pietroalbini committed Jun 8, 2021
2 parents 53abab6 + 449dfc9 commit 4f797b8
Show file tree
Hide file tree
Showing 2 changed files with 137 additions and 0 deletions.
1 change: 1 addition & 0 deletions src/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@
- [How to run a design meeting](./compiler/steering-meeting/how-to-run-design.md)
- [crates.io](./crates-io/README.md)
- [Crate removal](./crates-io/crate-removal.md)
- [Database maintenance](./crates-io/db-maintenance.md)
- [docs.rs](./docs-rs/README.md)
- [Adding dependencies to the build environment](./docs-rs/add-dependencies.md)
- [Developing without docker-compose](./docs-rs/no-docker-compose.md)
Expand Down
136 changes: 136 additions & 0 deletions src/crates-io/db-maintenance.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
# Database maintenance

There are times when Heroku needs to perform a maintenance on our database
instances, for example to apply system updates or upgrade to a newer database
server.

We must **not** let Heroku run maintenances during the maintenance window to
avoid disrupting production users (move the maintenance window if necessary).
This page contains the instructions on how to perform the maintenance with the
minimum amount of disruption.

# Primary database

Performing maintenance on the primary database requires us to temporarily put
the application in read-only mode. Heroku performs maintenances by creating a
hidden database follower and switching over to it, so we need to prevent writes
on the primary to let the follower catch up.

Maintenance should take less than 5 minutes of read-only time, but we should
still announce it ahead of time on our status page. This is a sample message we
can use:

> The crates.io team will perform a database maintenance on YYYY-MM-DD from
> hh:mm to hh:mm UTC.
>
> We expect this to take less than 5 minutes to complete. During maintenance
> crates.io will only be available in read-only mode: downloading crates and
> visiting the website will still work, but logging in, publishing crates,
> yanking crates or changing owners will not work.
## Primary Database Checklist

**1 hour before the maintenance**

1. Go into the Heroku Scheduler and disable the job enqueueing the downloads
count updater. You can "disable" it by changing its schedule not to run
during the maintenance window. The job uses a lot of database resources, and
we should not run it during maintenance.

**5 minutes before the maintenance**

2. Scale the background worker to 0 instances:

```
heroku ps:scale -a crates-io background_worker=0
```

**At the start of the maintenance**

3. Update the status page with this message:

> Scheduled maintenance on our database is starting.
>
> We expect this to take less than 5 minutes to complete. During maintenance
> crates.io will only be available in read-only mode: downloading crates and
> visiting the website will still work, but logging in, publishing crates,
> yanking crates or changing owners will not work.
3. Configure the application to be in read-only mode without the follower:

```
heroku config:set -a crates-io READ_ONLY_MODE=1 DB_OFFLINE=follower
```

The follower is removed because while Heroku tries to prevent connections to
the primary database from failing during maintenance we observed that the
same does not apply to the follower database, and there could be brief
periods while the follower is not available.

3. Confirm the application is in read-only mode by trying to publish a crate
and logging in.

3. Run the database maintenance:

```
heroku pg:maintenance:run --force -a crates-io
```

3. Confirm all the databases are online:

```
heroku pg:info -a crates-io
```

3. Confirm the primary database fully recovered (should output `false`):

```
echo "SELECT pg_is_in_recovery();" | heroku pg:psql -a crates-io DATABASE
```

3. Switch off read-only mode:

```
heroku config:unset -a crates-io READ_ONLY_MODE
```

**WARNING:** the Heroku Dashboard's UI is misleading when removing an
environment variable. A red badge with a "-" (minus) in it means the
variable was *successfully removed*, it doesn't mean removing the variable
failed. Failures are indicated with a red badge with a "x" (cross) in it.

3. Confirm the application is working by trying to publish a crate and logging
in.

3. Update the status page and mark the maintenance as completed with this
message:

> Scheduled maintenance finished successfully.
The message is posted right now and not at the end because this is when
production users are not impacted by the maintenance anymore.

3. Scale the background worker up again:

```
heroku ps:scale -a crates-io background_worker=1
```

3. Confirm the follower database is available:

```
echo "SELECT 1;" | heroku pg:psql -a crates-io READ_ONLY_REPLICA
```

3. Enable connections to the follower:

```
heroku config:unset -a crates-io DB_OFFLINE
```

3. Re-enable the background job disabled during step 1.

# Follower database

Instructions and checklists for follower database maintenace aren't written
yet.

0 comments on commit 4f797b8

Please sign in to comment.