Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto-restore occurring even with existing data #1561

Closed
jtackaberry opened this issue Jan 4, 2024 · 19 comments · Fixed by #1564
Closed

Auto-restore occurring even with existing data #1561

jtackaberry opened this issue Jan 4, 2024 · 19 comments · Fixed by #1564

Comments

@jtackaberry
Copy link
Contributor

What version are you running?

8.14.1

Are you using Docker or Kubernetes to run your system?

K8s

Are you running a single node or a cluster?

Single node

What did you do?

Configured both auto-backup (aggressive 15-second interval for testing purposes) and auto-restore. I observed that when restarting a pod with the underlying storage intact between restarts, rqlite is auto-restoring from cloud backup:

[rqlited] 2024/01/04 03:01:50 preexisting node state detected in /rqlite
[rqlited] 2024/01/04 03:01:50 auto-restore requested, initiating download
[rqlited] 2024/01/04 03:01:50 auto-restore file downloaded in 192.145381ms

Here it detects preexisting node state but auto-restores anyway, despite the fact that the docs say:

Note that rqlite will only do this if the node has no pre-existing data, and is not already part of a cluster. If either of these conditions is true, any request to automatically restore will be ignored.

I confirmed this by inserting a row into a table, restarting the pod, and then selecting that table where the row I inserted was missing.

rqlite logs attached:

rqlite-0.log

What did you expect to happen?

Not to auto-restore from cloud backup given that existing data is present.

Please include the Status, Nodes, and Expvar output from each node (or at least the Leader!)

127.0.0.1:4001> .status
disco:
  last_addresses: [[2001:470:b0d6:fd:ce70:bd4e:b908:3b65]:4002 10.3.59.101:4002]
  last_contact: 2024-01-04T03:01:50.797492401Z
  mode: dns-srv
  name: rqlite-headless
  service: raft
  dns_name:: _raft._tcp.rqlite-headless.
http:
  auth: enabled
  bind_addr: [::]:4001
  cluster:
    conn_pool_stats:
      10.3.59.101:4002:
        open_connections: 4
        idle: 4
        max_open_connections: 64
      [2001:470:b0d6:fd:ce70:bd4e:b908:3b65]:4002:
        idle: 4
        max_open_connections: 64
        open_connections: 4
    local_node_addr: rqlite-0.rqlite-headless.data.svc.cluster.local:4002
    timeout: 30s
  queue:
    _default:
      batch_size: 128
      max_size: 1024
      sequence_number: 0
      timeout: 50ms
  tls:
    enabled: false
node:
  current_time: 2024-01-04T03:06:48.398252944Z
  start_time: 2024-01-04T03:01:50.725667481Z
  uptime: 4m57.672586155s
auto_backups:
  upload_interval: 15s
  compress: true
  last_upload_duration: 184.870449ms
  last_upload_sum: 11c71630b4f55453ed241aa0e81c83f0e10b9587ef383dd8ea8e124493d81129
  last_upload_time: 2024-01-04T03:02:08Z
  upload_destination: s3://rqlite-kkx5q9/backups/test.sqlite.gz
build:
  compiler: gc
  version: v8.14.1
  branch: master
  build_time: 2023-12-31T15:32:09-0500
  commit: 4b27607127c6e799fb14a1cc71c81116239595d6
cluster:
  addr: rqlite-0.rqlite-headless.data.svc.cluster.local:4002
  api_addr: rqlite-0.rqlite-headless.data.svc.cluster.local:4001
  https: false
network:
  interfaces:
    eth0:
      flags: up|broadcast|multicast|running
      hardware_address: be:50:2f:5e:9c:c2
      addresses: [map[address:10.3.59.101/32] map[address:2001:470:b0d6:fd:ce70:bd4e:b908:3b65/128] map[address:fe80::bc50:2fff:fe5e:9cc2/64]]
    lo:
      addresses: [map[address:127.0.0.1/8] map[address:::1/128]]
      flags: up|loopback|running
      hardware_address:
os:
  executable: /bin/rqlited
  hostname: rqlite-0
  page_size: 4096
  pid: 1
  ppid: 0
runtime:
  version: go1.21.5
  GOARCH: amd64
  GOMAXPROCS: 10
  GOOS: linux
  num_cpu: 10
  num_goroutine: 31
store:
  raft:
    log_size: 65536
    protocol_version_max: 3
    snapshot_version_min: 0
    term: 34
    last_snapshot_index: 2
    commit_index: 23
    last_log_index: 23
    last_snapshot_term: 5
    latest_configuration_index: 0
    state: Leader
    voter: true
    applied_index: 23
    num_peers: 0
    fsm_pending: 0
    last_contact: 0
    last_log_term: 34
    latest_configuration: [{Suffrage:Voter ID:rqlite-0 Address:rqlite-0.rqlite-headless.data.svc.cluster.local:4002}]
    protocol_version: 3
    protocol_version_min: 0
    snapshot_version_max: 1
    bolt:
      TxStats:
        Rebalance: 0
        Split: 0
        Spill: 9
        PageCount: 17
        NodeCount: 14
        NodeDeref: 0
        SpillTime: 86423
        Write: 25
        WriteTime: 87875630
        PageAlloc: 69632
        CursorCount: 273
        RebalanceTime: 0
      FreePageN: 1
      PendingPageN: 2
      FreeAlloc: 12288
      FreelistInuse: 40
      TxN: 124
      OpenTxN: 0
  dir_size: 110824
  no_freelist_sync: false
  node_id: rqlite-0
  last_applied_index: 23
  snapshot_store:
    snapshots: [5-2-1704334267679]
    db_path: /rqlite/rsnapshots/5-2-1704334267679.db
    dir: /rqlite/rsnapshots
  dir: /rqlite
  fsm_index: 23
  heartbeat_timeout: 1s
  open: true
  ready: true
  reap_timeout: 0s
  election_timeout: 1s
  nodes: [map[addr:rqlite-0.rqlite-headless.data.svc.cluster.local:4002 id:rqlite-0 suffrage:Voter]]
  reap_read_only_timeout: 0s
  dir_size_friendly: 111 kB
  snapshot_interval: 10s
  trailing_logs: 10240
  addr: rqlite-0.rqlite-headless.data.svc.cluster.local:4002
  apply_timeout: 10s
  leader:
    node_id: rqlite-0
    addr: rqlite-0.rqlite-headless.data.svc.cluster.local:4002
  observer:
    dropped: 0
    observed: 1
  snapshot_threshold: 8192
  sqlite3:
    rw_dsn: file:/rqlite/db.sqlite?_fk=false
    conn_pool_stats:
      ro:
        max_open_connections: 0
        wait_count: 0
        max_idle_closed: 0
        max_lifetime_closed: 0
        open_connections: 1
        in_use: 0
        idle: 1
        wait_duration: 0
        max_idle_time_closed: 0
      rw:
        open_connections: 1
        wait_count: 0
        wait_duration: 0
        max_idle_closed: 0
        max_lifetime_closed: 0
        max_open_connections: 1
        in_use: 0
        idle: 1
        max_idle_time_closed: 0
    db_size_friendly: 8.2 kB
    mem_stats:
      cache_size: -2000
      freelist_count: 0
      hard_heap_limit: 0
      max_page_count: 1073741823
      page_count: 2
      page_size: 4096
      soft_heap_limit: 0
    pragmas:
      ro:
        foreign_keys: 0
        journal_mode: wal
        synchronous: 1
        wal_autocheckpoint: 1000
      rw:
        foreign_keys: 0
        journal_mode: wal
        synchronous: 0
        wal_autocheckpoint: 0
    size: 8192
    version: 3.44.0
    wal_size: 0
    compile_options: [ATOMIC_INTRINSICS=1 COMPILER=gcc-9.4.0 DEFAULT_AUTOVACUUM DEFAULT_CACHE_SIZE=-2000 DEFAULT_FILE_FORMAT=4 DEFAULT_JOURNAL_SIZE_LIMIT=-1 DEFAULT_MMAP_SIZE=0 DEFAULT_PAGE_SIZE=4096 DEFAULT_PCACHE_INITSZ=20 DEFAULT_RECURSIVE_TRIGGERS DEFAULT_SECTOR_SIZE=4096 DEFAULT_SYNCHRONOUS=2 DEFAULT_WAL_AUTOCHECKPOINT=1000 DEFAULT_WAL_SYNCHRONOUS=1 DEFAULT_WORKER_THREADS=0 ENABLE_DBSTAT_VTAB ENABLE_FTS3 ENABLE_FTS3_PARENTHESIS ENABLE_FTS5 ENABLE_RTREE ENABLE_UPDATE_DELETE_LIMIT MALLOC_SOFT_LIMIT=1024 MAX_ATTACHED=10 MAX_COLUMN=2000 MAX_COMPOUND_SELECT=500 MAX_DEFAULT_PAGE_SIZE=8192 MAX_EXPR_DEPTH=1000 MAX_FUNCTION_ARG=127 MAX_LENGTH=1000000000 MAX_LIKE_PATTERN_LENGTH=50000 MAX_MMAP_SIZE=0x7fff0000 MAX_PAGE_COUNT=1073741823 MAX_PAGE_SIZE=65536 MAX_SQL_LENGTH=1000000000 MAX_TRIGGER_DEPTH=1000 MAX_VARIABLE_NUMBER=32766 MAX_VDBE_OP=250000000 MAX_WORKER_THREADS=8 MUTEX_PTHREADS OMIT_DEPRECATED OMIT_LOAD_EXTENSION OMIT_SHARED_CACHE SYSTEM_MALLOC TEMP_STORE=1 THREADSAFE=1]
    db_size: 8192
    path: /rqlite/db.sqlite
    ro_dsn: file:/rqlite/db.sqlite?mode=ro&_fk=false
  db_applied_index: 23
  db_conf:
    fk_constraints: false
  request_marshaler:
    compression_batch: 50
    compression_size: 1024
    force_compression: false
127.0.0.1:4001> .nodes
rqlite-0:
  id: rqlite-0
  api_addr: http://rqlite-0.rqlite-headless.data.svc.cluster.local:4001
  addr: rqlite-0.rqlite-headless.data.svc.cluster.local:4002
  voter: true
  reachable: true
  leader: true
  time: 7.71e-7
  time_s: 1.252µs
127.0.0.1:4001> .expvar
memstats:
  TotalAlloc: 32002512
  Frees: 42461
  HeapSys: 20119552
  MCacheInuse: 12000
  BuckHashSys: 1450752
  Sys: 28677384
  PauseEnd: [1704337310529900477 1704337310608595122 1704337310800953468 1704337327886187030 1704337327889970609 1704337448369794381 1704337492886789392 1704337552888033678 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
  EnableGC: true
  DebugGC: false
  HeapIdle: 10756096
  HeapObjects: 27607
  GCSys: 4345728
  Lookups: 0
  Mallocs: 70068
  MSpanInuse: 198576
  GCCPUFraction: 0.000016244763177819158
  Alloc: 7703472
  HeapAlloc: 7703472
  HeapInuse: 9363456
  HeapReleased: 9904128
  NextGC: 9202448
  StackSys: 851968
  PauseTotalNs: 883336
  PauseNs: [67427 101692 227382 100341 61476 130157 72347 122514 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
  NumGC: 8
  NumForcedGC: 0
  StackInuse: 851968
  MSpanSys: 211848
  OtherSys: 1681936
  MCacheSys: 15600
  LastGC: 1704337552888033678
  BySize: [map[Frees:0 Mallocs:0 Size:0] map[Frees:702 Mallocs:2119 Size:8] map[Frees:9861 Mallocs:16529 Size:16] map[Frees:3974 Mallocs:5527 Size:24] map[Frees:6595 Mallocs:7252 Size:32] map[Frees:6315 Mallocs:8817 Size:48] map[Frees:1937 Mallocs:2850 Size:64] map[Frees:2371 Mallocs:2985 Size:80] map[Frees:1797 Mallocs:2434 Size:96] map[Frees:845 Mallocs:1031 Size:112] map[Frees:395 Mallocs:614 Size:128] map[Frees:272 Mallocs:9532 Size:144] map[Frees:685 Mallocs:816 Size:160] map[Frees:95 Mallocs:278 Size:176] map[Frees:240 Mallocs:302 Size:192] map[Frees:276 Mallocs:660 Size:208] map[Frees:455 Mallocs:505 Size:224] map[Frees:23 Mallocs:31 Size:240] map[Frees:115 Mallocs:210 Size:256] map[Frees:45 Mallocs:719 Size:288] map[Frees:353 Mallocs:396 Size:320] map[Frees:176 Mallocs:371 Size:352] map[Frees:19 Mallocs:100 Size:384] map[Frees:25 Mallocs:172 Size:416] map[Frees:31 Mallocs:67 Size:448] map[Frees:24 Mallocs:30 Size:480] map[Frees:36 Mallocs:87 Size:512] map[Frees:182 Mallocs:271 Size:576] map[Frees:23 Mallocs:54 Size:640] map[Frees:26 Mallocs:49 Size:704] map[Frees:5 Mallocs:27 Size:768] map[Frees:48 Mallocs:87 Size:896] map[Frees:25 Mallocs:88 Size:1024] map[Frees:59 Mallocs:189 Size:1152] map[Frees:87 Mallocs:126 Size:1280] map[Frees:319 Mallocs:348 Size:1408] map[Frees:31 Mallocs:64 Size:1536] map[Frees:16 Mallocs:48 Size:1792] map[Frees:15 Mallocs:23 Size:2048] map[Frees:51 Mallocs:197 Size:2304] map[Frees:21 Mallocs:27 Size:2688] map[Frees:7 Mallocs:14 Size:3072] map[Frees:0 Mallocs:1 Size:3200] map[Frees:0 Mallocs:5 Size:3456] map[Frees:41 Mallocs:70 Size:4096] map[Frees:19 Mallocs:43 Size:4864] map[Frees:10 Mallocs:19 Size:5376] map[Frees:3 Mallocs:11 Size:6144] map[Frees:0 Mallocs:0 Size:6528] map[Frees:0 Mallocs:2 Size:6784] map[Frees:6 Mallocs:6 Size:6912] map[Frees:7 Mallocs:23 Size:8192] map[Frees:10 Mallocs:21 Size:9472] map[Frees:0 Mallocs:0 Size:9728] map[Frees:0 Mallocs:0 Size:10240] map[Frees:0 Mallocs:1 Size:10880] map[Frees:2 Mallocs:3 Size:12288] map[Frees:0 Mallocs:1 Size:13568] map[Frees:0 Mallocs:0 Size:14336] map[Frees:0 Mallocs:4 Size:16384] map[Frees:1 Mallocs:1 Size:18432]]
proto:
  num_compressed_requests: 0
  num_compression_misses: 0
  num_precompressed_bytes: 0
  num_requests: 0
  num_uncompressed_bytes: 0
  num_uncompressed_requests: 0
  num_compressed_bytes: 0
snapshot:
  latest_persist_duration: 0
  latest_persist_size: 0
  snapshots_reaped: 0
  upgrade_fail: 0
  upgrade_ok: 0
cluster:
  num_get_node_api_req_local: 62
  num_notify_req: 2
  num_query_req: 0
  num_remove_node_req: 0
  num_client_retries: 0
  num_execute_req: 0
  num_get_node_api_req: 0
  num_load_req: 0
  num_request_req: 0
  num_backup_req: 0
  num_get_node_api_resp: 0
  num_join_req: 2
cmdline: [/bin/rqlited -node-id rqlite-0 -http-addr 0.0.0.0:4001 -http-adv-addr rqlite-0.rqlite-headless.data.svc.cluster.local:4001 -raft-addr 0.0.0.0:4002 -raft-adv-addr rqlite-0.rqlite-headless.data.svc.cluster.local:4002 -disco-mode=dns-srv -disco-config={"name":"rqlite-headless","service":"raft"} -bootstrap-expect=1 -auto-backup=/config/sensitive/backup.json -auto-restore=/config/sensitive/restore.json -join-interval=1s -join-attempts=120 -raft-shutdown-stepdown /rqlite]
db:
  checkpointed_pages: 0
  queries: 28
  query_transactions: 0
  requests: 0
  checkpoint_duration_ns: 0
  checkpointed_moves: 0
  execute_transactions: 0
  executions: 28
  open_duration_ms: 0
  checkpoint_errors: 0
  execution_errors: 1
  query_errors: 0
  request_transactions: 0
  checkpoints: 0
queue:
  num_flush: 0
  num_timeout: 0
  statements_rx: 0
  statements_tx: 0
store:
  leader_changes_dropped: 0
  nodes_reaped_failed: 0
  nodes_reaped_ok: 0
  num_restores: 1
  failed_heartbeat_observed: 0
  num_auto_restores_skipped: 0
  num_boots: 0
  num_provides: 19
  num_snapshots: 0
  snapshot_create_duration: 0
  num_snapshots_full: 0
  leader_changes_observed: 0
  num_compressed_commands: 0
  num_ignored_joins: 0
  num_joins: 0
  num_loads: 0
  num_removed_before_joins: 0
  num_snapshots_failed: 0
  num_auto_restores: 1
  num_db_stats_errors: 0
  num_recoveries: 0
  num_restores_failed: 0
  num_uncompressed_commands: 0
  num_backups: 0
  num_user_snapshots_failed: 0
  num_wal_snapshots: 0
  num_user_snapshots: 0
  num_wal_snapshots_failed: 0
  snapshot_persist_duration: 0
  snapshot_precompact_wal_size: 0
  num_auto_restores_failed: 0
  num_snapshots_incremental: 0
  snapshot_wal_size: 0
uploader:
  total_upload_bytes: 233
  last_upload_bytes: 233
  num_uploads_fail: 0
  num_uploads_ok: 1
  num_uploads_skipped: 18
downloader:
  download_bytes: 233
  num_downloads_fail: 0
  num_downloads_ok: 1
http:
  remote_requests: 0
  remote_requests_failed: 0
  requests: 0
  num_status: 3
  execute_stmts_rx: 0
  query_stmts_rx: 1
  queued_executions_num_stmts_rx: 0
  queued_executions_wait: 0
  remote_executions: 0
  remote_loads: 0
  authOK: 68
  queued_executions: 0
  queued_executions_failed: 0
  queued_executions_no_leader: 0
  remote_executions_failed: 0
  remote_remove_node: 0
  authFail: 0
  executions: 0
  loads_aborted: 0
  queued_executions_not_leader: 0
  queued_executions_ok: 0
  backups: 0
  boot: 0
  num_readyz: 60
  queued_executions_num_stmts_tx: 0
  remote_queries: 0
  remote_queries_failed: 0
  request_stmts_rx: 0
  leader_not_found: 0
  loads: 0
  queries: 1
  queued_executions_leadership_lost: 0
  queued_executions_unknown_error: 0
  remote_backups: 0
mux:
  num_connections_handled: 8
  num_unregistered_handlers: 0
@otoolep
Copy link
Member

otoolep commented Jan 4, 2024

Thanks for the report -- it may be a documentation bug, I'll need to think about the right behaviour here. I remember discussing this feature at the time I implemented and making a specific choice, for a specific reason. I just can't remember why it works with the way it does in the code. :-)

@jtackaberry
Copy link
Contributor Author

Ack, thanks @otoolep. I'd be interested in the use case for this behavior if it's intentional. It eludes me :)

@otoolep
Copy link
Member

otoolep commented Jan 4, 2024

Agreed, it does seem like a bug. However the docs are not right either, as this feature is meant to work with clusters. There should be nothing stopping a 3-node cluster doing an autorestore (the code as it exists today specifically handles clustered systems correctly). But the question is should this flag be ignored if at least one write has been sent to the system at any point in the past. I think the answer is yes, but need to think about it.

@otoolep
Copy link
Member

otoolep commented Jan 4, 2024

Did some digging, connected with the person who requested this feature. This looks like a bug. I'll fix within the next day or so and release 8.15. The docs will need updating too.

I do not consider this a breaking change, it's a bug fix.

@otoolep
Copy link
Member

otoolep commented Jan 4, 2024

The docs will need fixing too.

@jtackaberry
Copy link
Contributor Author

The docs will need fixing too.

So the intended behavior is different than both the current code and the docs? How should it work? Just wondering how this will fit in the context of the helm chart (where I am working on adding direct support for auto backup/restore)

@otoolep
Copy link
Member

otoolep commented Jan 4, 2024

The original implementation should have worked like this:

  1. Node is started. It may, or may not, be part of a cluster. That doesn't matter.

  2. If autorestore is enabled on a given node, the data is downloaded.

  3. When a node detects a leadership change for the first time since start-up, it does the following:

    1. If the node is the leader, check if there is any existing SQLite data. If there is, do not install the downloaded data and terminate the auto-restore process.
    2. If the node is not the leader, don't do anything. Terminate the autorestore process.

However, as part of fixing this bug, I will look into modifying these steps so the download only happens at stage 3 if it's actually needed. But the goal of downloading it at start-up is make installation robust -- we want to have the data installed very quickly after the system becomes the Leader. But that is pretty wasteful of network resources since installation is a once-off.

@jtackaberry
Copy link
Contributor Author

Yeah, deferring download until it's known to be necessary makes sense to me, especially since as you say the data size could be large and usually elections complete fairly quickly.

So with this change, it sounds like it should be feasible to always run with auto-restore enabled as part of normal operation.

Will the nodes be returning 200 on /readyz before the backup is restored, or is the download+restore considered in band of the bootstrap process and the node isn't reporting ready at that point? And if yes to the former, will rqlite will reject queries with HTTP 503 until the restore completes?

Just wondering what the client experience would be with the Helm chart at this time, whether they'll get connection timeouts (because no pods in the rqlite cluster are passing readiness probes yet) or they'll get HTTP errors, or (worse) queries/executes will actually succeed before restore completes.

@otoolep
Copy link
Member

otoolep commented Jan 4, 2024

So with this change, it sounds like it should be feasible to always run with auto-restore enabled as part of normal operation.

Yes, the goal is that the flag is safe to leave set, knowing it will be ignored if it would stomp on existing data (which is the problem today).

Will the nodes be returning 200 on /readyz before the backup is restored, or is the download+restore considered in band of the bootstrap process and the node isn't reporting ready at that point?

Auto-restore is performing using the normal "write" operations, so that system must be ready before it can proceed. I think, therefore, it will report "ready" before the restore completes. Writing data to rqlite while a restore is in progress currently results in undefined behavior. I can probably change that if it makes sense i.e. redefine ready.

@jtackaberry
Copy link
Contributor Author

jtackaberry commented Jan 4, 2024

Writing data to rqlite while a restore is in progress currently results in undefined behavior. I can probably change that if it makes sense i.e. redefine ready.

Which means that clients waiting for cluster to come back up, retrying their transactions, could actually break the restoration process, if e.g. the restore tries to insert into a table that now violates a (say) unique constraint because the client slipped something in first.

Yeah, I'd revise "ready" to not pass until restoration completes. This will insulate rqlite from client access (at least for K8s or other deployments that use a reverse proxy that checks /readyz) until the data is restored.

@otoolep
Copy link
Member

otoolep commented Jan 4, 2024

the restore tries to insert into a table that now violates a (say) unique constraint because the client slipped something in first.

You'd probably be interested in knowing that that couldn't happen. That's because the restore just stomps on whatever SQLite file it finds there, completely replacing it with the file it downloaded from S3. But yes, the experience is not good either way and enhancing the definition of /ready makes sense.

@jtackaberry
Copy link
Contributor Author

That's because the restore just stomps on whatever SQLite file it finds there, completely replacing it with the file it downloaded from S3.

Ah, that's an interesting tidbit, yeah. I'd misunderstood what you meant when you said restore uses normal write operations.

@otoolep
Copy link
Member

otoolep commented Jan 4, 2024

By "normal" I mean it sends the entire SQLite database (compressed) through the Raft log (as a single log entry), so it gets distributed to every node in the cluster, but when applied to the database it doesn't "execute" any SQL commands, it just decompresses the data and writes it over any existing SQLite file.

@jtackaberry
Copy link
Contributor Author

jtackaberry commented Jan 5, 2024

Only loosely related, but I'm noticing rqlite always uploads on startup even if the data hasn't changed since last upload, despite the docs saying:

and will not upload a backup if the SQLite database hasn't changed since the last upload took place

So there's another optimization opportunity here, but may be tricky. It'd be easy enough to update StorageClient (and the S3 implementation of that interface) to include a function GetObjectTimestamp() or some such to get the mtime from S3, but I guess the challenging part is getting the mtime from sqlite to have something to compare it with?

If that were possible -- i.e. sqlite could tell us when the last mutation occurred -- we could skip the upload on startup and the full database dump + compress to compute the hash every interval.

@otoolep
Copy link
Member

otoolep commented Jan 5, 2024

Only loosely related, but I'm noticing rqlite always uploads on startup even if the data hasn't changed since last upload, despite the docs saying:

Good point, but that's because it stores the hash of the upload in RAM, which is lost, of course, on restart. I could store it on disk perhaps, best effort. Hadn't occurred to me!

https://github.com/rqlite/rqlite/blob/master/auto/backup/uploader.go#L170

@jtackaberry
Copy link
Contributor Author

I could store it on disk perhaps, best effort. Hadn't occurred to me!

The main drawback with that is that the next restart will probably have a different leader which won't have the stored hash. Most robust option would be to lean on the storage provider to hold that state.

@otoolep otoolep linked a pull request Jan 5, 2024 that will close this issue
@otoolep
Copy link
Member

otoolep commented Jan 5, 2024

The main drawback with that is that the next restart will probably have a different leader which won't have the stored hash.

True, that's probably why I didn't bother.

Most robust option would be to lean on the storage provider to hold that state.

It also introduces a dependency on an external system which I'm not too keen on. Not ruling out, could be a best effort thing.

@jtackaberry
Copy link
Contributor Author

It also introduces a dependency on an external system which I'm not too keen on.

The dependency already exists for cloud storage though.

Barring that, is it possible that could be rqlite state shared with other nodes over the raft log? Maybe that's getting more complicated than it's worth.

@otoolep
Copy link
Member

otoolep commented Jan 5, 2024

Yeah, you're right, it's there already, it's just a little bit complicated. Gotta pull down the sum before the first upload, compare, etc etc.

TBH, I built it before I thought rqlite would support very large data sets, and the price of a single upload at the start wasn't going to be huge (and you get simpler code in return). But with 8.x explicitly about larger data sets, it might be time to add it. A, say, 2GB upload at the start for no good reason is definitely wasteful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants