-
Notifications
You must be signed in to change notification settings - Fork 160
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #538 from pietroalbini/cratesio-db-maintenance
Add crates.io database maintenance checklist
- Loading branch information
Showing
2 changed files
with
137 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,136 @@ | ||
# Database maintenance | ||
|
||
There are times when Heroku needs to perform a maintenance on our database | ||
instances, for example to apply system updates or upgrade to a newer database | ||
server. | ||
|
||
We must **not** let Heroku run maintenances during the maintenance window to | ||
avoid disrupting production users (move the maintenance window if necessary). | ||
This page contains the instructions on how to perform the maintenance with the | ||
minimum amount of disruption. | ||
|
||
# Primary database | ||
|
||
Performing maintenance on the primary database requires us to temporarily put | ||
the application in read-only mode. Heroku performs maintenances by creating a | ||
hidden database follower and switching over to it, so we need to prevent writes | ||
on the primary to let the follower catch up. | ||
|
||
Maintenance should take less than 5 minutes of read-only time, but we should | ||
still announce it ahead of time on our status page. This is a sample message we | ||
can use: | ||
|
||
> The crates.io team will perform a database maintenance on YYYY-MM-DD from | ||
> hh:mm to hh:mm UTC. | ||
> | ||
> We expect this to take less than 5 minutes to complete. During maintenance | ||
> crates.io will only be available in read-only mode: downloading crates and | ||
> visiting the website will still work, but logging in, publishing crates, | ||
> yanking crates or changing owners will not work. | ||
## Primary Database Checklist | ||
|
||
**1 hour before the maintenance** | ||
|
||
1. Go into the Heroku Scheduler and disable the job enqueueing the downloads | ||
count updater. You can "disable" it by changing its schedule not to run | ||
during the maintenance window. The job uses a lot of database resources, and | ||
we should not run it during maintenance. | ||
|
||
**5 minutes before the maintenance** | ||
|
||
2. Scale the background worker to 0 instances: | ||
|
||
``` | ||
heroku ps:scale -a crates-io background_worker=0 | ||
``` | ||
|
||
**At the start of the maintenance** | ||
|
||
3. Update the status page with this message: | ||
|
||
> Scheduled maintenance on our database is starting. | ||
> | ||
> We expect this to take less than 5 minutes to complete. During maintenance | ||
> crates.io will only be available in read-only mode: downloading crates and | ||
> visiting the website will still work, but logging in, publishing crates, | ||
> yanking crates or changing owners will not work. | ||
3. Configure the application to be in read-only mode without the follower: | ||
|
||
``` | ||
heroku config:set -a crates-io READ_ONLY_MODE=1 DB_OFFLINE=follower | ||
``` | ||
|
||
The follower is removed because while Heroku tries to prevent connections to | ||
the primary database from failing during maintenance we observed that the | ||
same does not apply to the follower database, and there could be brief | ||
periods while the follower is not available. | ||
|
||
3. Confirm the application is in read-only mode by trying to publish a crate | ||
and logging in. | ||
|
||
3. Run the database maintenance: | ||
|
||
``` | ||
heroku pg:maintenance:run --force -a crates-io | ||
``` | ||
|
||
3. Confirm all the databases are online: | ||
|
||
``` | ||
heroku pg:info -a crates-io | ||
``` | ||
|
||
3. Confirm the primary database fully recovered (should output `false`): | ||
|
||
``` | ||
echo "SELECT pg_is_in_recovery();" | heroku pg:psql -a crates-io DATABASE | ||
``` | ||
|
||
3. Switch off read-only mode: | ||
|
||
``` | ||
heroku config:unset -a crates-io READ_ONLY_MODE | ||
``` | ||
|
||
**WARNING:** the Heroku Dashboard's UI is misleading when removing an | ||
environment variable. A red badge with a "-" (minus) in it means the | ||
variable was *successfully removed*, it doesn't mean removing the variable | ||
failed. Failures are indicated with a red badge with a "x" (cross) in it. | ||
|
||
3. Confirm the application is working by trying to publish a crate and logging | ||
in. | ||
|
||
3. Update the status page and mark the maintenance as completed with this | ||
message: | ||
|
||
> Scheduled maintenance finished successfully. | ||
The message is posted right now and not at the end because this is when | ||
production users are not impacted by the maintenance anymore. | ||
|
||
3. Scale the background worker up again: | ||
|
||
``` | ||
heroku ps:scale -a crates-io background_worker=1 | ||
``` | ||
|
||
3. Confirm the follower database is available: | ||
|
||
``` | ||
echo "SELECT 1;" | heroku pg:psql -a crates-io READ_ONLY_REPLICA | ||
``` | ||
|
||
3. Enable connections to the follower: | ||
|
||
``` | ||
heroku config:unset -a crates-io DB_OFFLINE | ||
``` | ||
|
||
3. Re-enable the background job disabled during step 1. | ||
|
||
# Follower database | ||
|
||
Instructions and checklists for follower database maintenace aren't written | ||
yet. |