Skip to content

Commit

Permalink
add broken index backup doc
Browse files Browse the repository at this point in the history
  • Loading branch information
ykadowak committed Jul 3, 2023
1 parent 3d39dc5 commit e38a1aa
Showing 1 changed file with 62 additions and 0 deletions.
62 changes: 62 additions & 0 deletions docs/user-guides/backup-configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -247,3 +247,65 @@ Agent Sidecar tries to get the backup file from S3, unpacks it, and starts index

In using both the PV and S3 case, the backup file used for restoration will prioritize the file on PV.
If the backup file does not exist on the PV, the backup file will be retrieved from S3 via the Vald Agent Sidecar and restored.

## Broken index backup

If a backup file of an index is corrupted for some reason, Vald agent fails to load the index file, and the index file is then identified as a broken index.

> Causes of broken index could be agent crash during save index operation, partial storage corruption, etc.
When an index is broken, the default behavior is to discard it and continue running the Pod. This is useful for saving storage space, but sometimes you may need to inspect the contents of a broken index at a later time. By enabling the `broken index backup` feature, a backup is created without deleting the broken index before running the Pod. This feature can help you investigate the cause of index corruption at a later time.

### Settings

To enable this feature, set the `agent.ngt.broken_index_history_limit` setting to at least 1 (default: 0). The system stores backups of broken indexes up to the number of generations specified by this variable. If a backup of a broken index is needed that goes beyond this value, the system will delete the oldest backup.

```
agent:
ngt:
...
broken_index_history_limit: 3
...
```

### Backup location

The backup is stored under `${index_path}/broken`. Each directory name represents the Unix nanosecond when an attempt was made to read the broken index.

```
${index_path}/
origin/
ngt-meta.kvsdb
ngt-timestamp.kvsdb
metadata.json
prf
grp
tre
obj
broken/
1611271735938403848/
ngt-meta.kvsdb
...
1611271749583028942/
ngt-meta.kvsdb
...
1611271759849304593/
ngt-meta.kvsdb
...
```

### Restore

#### CoW: disabled

If an index file exists under `${index_path}/origin`, restore is attempted based on that index file. If the restore fails, the index file is backed up as a broken index. The agent starts in its initial state.

#### CoW: enabled

If an index file exists under `${index_path}/origin`, restore is attempted based on that index file. If the restore fails, `${index_path}/origin` is backed up as a broken index at that point. Then, restore is attempted based on the index file in `${index_path}/backup` (one generation older index file). If the restore fails again, the agent starts in its initial state.

### Metrics

The number of generations of broken indexes currently stored can be obtained as a metric `agent_core_ngt_broken_index_store_count`.

Reference: [vald/k8s/metrics/grafana/dashboards/01-vald-agent.yaml](https://github.com/vdaas/vald/blob/main/k8s/metrics/grafana/dashboards/01-vald-agent.yaml)

0 comments on commit e38a1aa

Please sign in to comment.