diff --git a/documentation/error-codes.md b/documentation/error-codes.md new file mode 100644 index 00000000..07ef20e7 --- /dev/null +++ b/documentation/error-codes.md @@ -0,0 +1,41 @@ +--- +title: QuestDB error codes +description: Errors encountered in log files +--- + +Some of QuestDBs errors have associated codes. We document them here to provide +more context and provide a troubleshooting starting point. + +# Replication Error Codes + +## ER001 + +TODO! TODO! TODO! TODO! TODO! +TODO! TODO! TODO! TODO! TODO! +TODO! TODO! TODO! TODO! TODO! +TODO! TODO! TODO! TODO! TODO! +TODO! TODO! TODO! TODO! TODO! + +## ER002 + +TODO! TODO! TODO! TODO! TODO! +TODO! TODO! TODO! TODO! TODO! +TODO! TODO! TODO! TODO! TODO! +TODO! TODO! TODO! TODO! TODO! +TODO! TODO! TODO! TODO! TODO! + +## ER003 + +TODO! TODO! TODO! TODO! TODO! +TODO! TODO! TODO! TODO! TODO! +TODO! TODO! TODO! TODO! TODO! +TODO! TODO! TODO! TODO! TODO! +TODO! TODO! TODO! TODO! TODO! + +## ER004 + +TODO! TODO! TODO! TODO! TODO! +TODO! TODO! TODO! TODO! TODO! +TODO! TODO! TODO! TODO! TODO! +TODO! TODO! TODO! TODO! TODO! +TODO! TODO! TODO! TODO! TODO! diff --git a/documentation/guides/replication-disaster-recovery.md b/documentation/guides/replication-disaster-recovery.md new file mode 100644 index 00000000..8ce87250 --- /dev/null +++ b/documentation/guides/replication-disaster-recovery.md @@ -0,0 +1,288 @@ +--- +title: Database replication disaster recovery +sidebar_label: Replication disaster recovery +description: + Explains the workflows to recover from the failure of a primary + or replica instance. +--- + +Before we dig into it, consider: + +- Performing regular [backups](/docs/operations/backup/) +- How to [enable and setup replication](/docs/operations/replication/) +- Learn [replication concepts](/docs/concept/replication/) +- Full + [replication configuration options](/docs/configuration/#database-replication) + +## Failure Scenarios + +Things go wrong. It's a fact of software. This section will help you plan to +mitigate against data loss and provide you with the commands you need to run +to recover from a failed database instance and minimise downtime. + +Note that the replication features in QuestDB rely on regular scheduled +[backups](/docs/operations/backup/). We recommend daily backups. + +These are the core things that can go wrong and the workflows to handle them: + +* **Instance degradation**: [Debug workflow](#debug-workflow) to investigate and +fix issues such as low disk, high CPU or memory usage, or network issues. + +* **Primary instance failure**: Perform a [new primary election recovery](#new-primary-election-recovery) +to elect a new primary instance and resume availability after a primary instance +failed or is otherwise unreachable. + +* **Replica instance rebuild**: Perform a [replica instance rebuild](#replica-instance-rebuild) +to rebuild a replica instance from backup. + +* **Data or database corruption**: Restore from backup and perform a +[point-in-time recovery](#point-in-time-recovery) to epartially restore a "primary" +database up to a specific timestamp. + +* **Object store corruption**: Perform a [new object store election](#new-object-store-election) +to elect a new object store and resume availability after taking a new object +store online. + +## Workflows + +### Debug workflow + +TODO! TODO! TODO! TODO! TODO! + +TODO! TODO! TODO! TODO! TODO! + +TODO! TODO! TODO! TODO! TODO! + +TODO! TODO! TODO! TODO! TODO! + +### New primary election recovery + +Should a primary instance fail, you can elect an existing replica instance as the +new primary, or bring up a new primary instance from backup. +Either way, the process is similar. + +The next step depend if you still have access to the existing primary instance +to perform a controlled election, or if the primary instance is unavailable and +needs to be recovered in an emergency. + +#### Non-emergency controlled election + +At times, you may need to elect a new replica even if the current one is still +available. + +Examples when you might want to do this include: + +* Performing a careful QuestDB version upgrade, while retaining the old version + available as fallback. + +* Upgrading to new hardware + +* Moving hardware so it is in a different datacenter for better ingestion + throughput with lower latency to the ingestion clients. + +In such case, first stop the primary instance: + +##### Step 1: Stop the primary instance + +```bash +$ ssh existing-primary.example.com +$ questdb.sh stop +``` + +##### Step 2: Complete writes to the object store + +The primary writes its WAL data to the configured object store asynchronously. +You now need to start the database in a special mode that completes commiting +(uploading) the WAL data to the object store. You can do this by starting with +the `QDB_REPLICATION_ROLE` environment variable set to `primary-catchup-uploads`, +followed by a regular start. +The process should exit with a status code of `0` if successful. + +```bash +$ ssh existing-primary.example.com +$ export QDB_REPLICATION_ROLE="primary-catchup-uploads" +$ questdb.sh start +$ echo $? # checking the exit code +0 +``` + +In addition to check the exit code, you should also inspect the logs, these +should include an INFO (` I `) message indicating that the upload is complete. + +```log +2025-01-20T17:45:38.069686Z I qdb_ent::wal::uploader L1633 completed all WAL uploads to the object store as requested by QDB_REPLICATION_ROLE=primary-catchup-uploads. +``` + +**Note**: _This step is idempotent, if you're unsure, just start the database +again with the same `QDB_REPLICATION_ROLE` environment variable set._ + +#### Emergency lossy election + +If instead the primary instance is unavailable (offline, irrecoverably crashed +or otherwise unreachable), then you should be aware that there is a risk of +data loss. + +Primary instances upload data to the object store asynchronously, after +committing to the WAL data on disk first. You can continue electing a new +primary, but any data that was not uploaded to the object store will be lost +and will not be recoverable at a later point in time. + +It is difficult to predict how much data will be lost, but it is likely to be +in the order of seconds to minutes, depending on the degradation of the primary +instance and its network access at the time of failure. + +You now need to decide if you want to bring up a new primary database instance +from backup, or elect an existing replica as the new primary. + +#### Restoring a new primary from backup + +If you would rather bring up a new primary instance from backup, first restore +a [backups](/docs/operations/backup/) onto new hardware. + +The backup should be as fresh as possible to minimise start-up time. + +At this stage, keep the database instance offline and follow the same steps +as if you were electing an existing replica as the new primary. In other words, +continue with "Step 2: Reconfigure as primary". + +#### Electing an existing replica as the new primary + +The fastest way to resume availability is to elect an existing replica as the +new primary. This operation will ensure that the new primary is up to date with +the object store, so there is no need to wait. + +##### Step 1: Stop the replica instance + +Stop the replica instance you want to promote to primary: + +```bash +$ ssh existing-db.example.com +$ questdb.sh stop +``` + +##### Step 2: Reconfigure as primary + +Reconfigure the database instance as primary: +```bash +$ ssh existing-db.example.com +$ vim path/to/db/conf/server.conf +``` + +Edit the config, change the `replication.role` so it reads: + +``` +replication.role=primary +``` + +Save and exit the config file. + +##### Step 3: Start with the new primary election recovery mode + +Restart the database in the "new primary election recovery" mode: + +```bash +$ ssh existing-db.example.com +$ export QDB_RECOVERY_TIMESTAMP="latest" +$ questdb.sh start +``` + +This process performs the following steps: +* Ensures that the instance is up to date with the object store. + +* Waits a period of time to check to see if other primary instances are still +running. + +* If no other primary instances are found, the instance will be elected as the +the new primary, adopting the state in the object store. + +* The instance will then continue running as a "primary" instance, no restart +is required. + +#### Future restarts + +It is important that the `QDB_RECOVERY_TIMESTAMP` environment variable +is only ever used during a new primary election recovery. It should not be used +during normal restarts. + +After a successful re-election, restart as normal: + +```bash +$ ssh primary.example.com +$ questdb.sh stop +$ questdb.sh start # no `QDB_RECOVERY_TIMESTAMP` environment variable +``` + +### Replica instance rebuild + +If a replica instance fails: + +* Delete all its state. +* Rebuild it from the most recent [backup](/docs/operations/backup/) of the primary. +* Edit `server.conf` to set `replication.role=replica`. +* Start it as normal. + +So long as the `replication.object.store` config is the same as the primary, +it should catch up to the primary and resume replication. + +### Point-in-time recovery + +To avoid things going very wrong, you should perform regular +[backups](/docs/operations/backup/). + +A typical schedule is to back up daily, for example at midnight. + +This leaves out a window of time where data could be lost. If the instance +you're recovering is additionally set up as a "primary" instance ( +`replication.role=primary` in `server.conf`), then you can perform a +point-in-time recovery immediately after recovering from a backup. + +A point-in-time recovery will not recover the database to its latest state. +for that you should perform a [new primary election recovery](#new-primary-election-recovery). +Instead, it will recover up to a specified timestamp. This is useful if you +need to recover to a last known good state, or if you need to recover to a state +before a specific event. + +#### Step 1: Restore the backup + +Find the backup that is immediately before the point in time you wish to restore. +For example, if you need to restore to 5pm on the 20th of January 2025 and you perform +daily backups at 00:01 am, you should restore the backup from the 20th of January 2025. + +``` +$ ssh primary.example.com +$ cp -r /path/to/backup /path/to/primary # or your backup restore tool +``` + +#### Step 2: Reconfigure replication + +At this point, you need to decide how to reconfigure the replication setup. + +This depends on why you are performing a point in time recovery. + +* If you want to recover a database and make it a primary: + * Edit `server.conf` and set `replication.role=primary`. + * Edit the `replication.object.store` to point to a new empty object store location. + It needs to be different from the original object store location, otherwise + the database would attempt to overwrite it before its failsafe checks are + triggered. + +* If you are recovering simply for testing purposes and want to disable replication, +simply remove all the `replication.*` configuration from `server.conf`. + +$ export QDB_RECOVERY_OBJECT_STORE="s3://my-bucket/path/to/primary" +$ export QDB_RECOVERY_TIMESTAMP="2025-01-20T17:45:38.069686Z" +$ questdb.sh start +``` + +The `QDB_RECOVERY_TIMESTAMP` environment variable is a UTC timestamp in the +format `YYYY-MM-DDTHH:MM:SS.ssssssZ`. It will recover the database from the +state present after the backup restore, applying transactions from the WAL data +taken from the specified recovery object store for all WAL tables. + +Note that it will only work for WAL tables. + +Do not use the `QDB_RECOVERY_TIMESTAMP` environment variable during normal +restarts. + +Limitations: +* \ No newline at end of file diff --git a/documentation/operations/backup.md b/documentation/operations/backup.md index 2ed557ae..71bf8bb9 100644 --- a/documentation/operations/backup.md +++ b/documentation/operations/backup.md @@ -7,7 +7,7 @@ description: You should back up QuestDB to be prepared for the case where your original database or data is lost, or if your database or table is corrupted. The backup -& restore process also speeds up the creation of +& restore process is also necessary to create [replica instances](/docs/operations/replication/) in QuestDB Enterprise. ## Overview diff --git a/documentation/sidebars.js b/documentation/sidebars.js index b45c1845..ae9fa0a2 100644 --- a/documentation/sidebars.js +++ b/documentation/sidebars.js @@ -400,6 +400,7 @@ module.exports = { "guides/import-csv", "guides/modifying-data", "guides/replication-tuning", + "guides/replication-disaster-recovery", "guides/working-with-timestamps-timezones", "web-console", { @@ -483,6 +484,7 @@ module.exports = { items: [ "troubleshooting/faq", "troubleshooting/os-error-codes", + "error-codes", ], }, ].filter(Boolean),