Primary should check backup server before marking ready #419

stephen · 2023-11-14T22:11:08Z

Hello,

Background

I was testing out the disaster recovery procedure. I noticed that after setting the clusterid file and restarting the app, litefs will start my application before the db is fully restored.

In litefs.yml, I have skip-sync turned off.

These logs start from after setting the cluster id and restarting on a new machine/volume:

[info] INFO Starting init (commit: 15238e9)...
[info] INFO Mounting /dev/vdb at /data w/ uid: 0, gid: 0 and chmod 0755
[info] INFO Resized /data to 1069547520 bytes
[info] INFO Preparing to run: `/bin/sh -c litefs mount` as root
[info] INFO [fly api proxy] listening at /.fly/api
[info]2023/11/14 21:30:02 listening on <snip>
[info]config file read from /litefs.yml
[info]LiteFS v0.5.8, commit=fd5808d0250cd17ec420225a010ec273971a471b
[info]level=INFO msg="host environment detected" type=fly.io
[info]level=INFO msg="litefs cloud backup client configured: https://litefs.fly.io"
[info]level=INFO msg="Using Consul to determine primary"
[info]level=INFO msg="initializing consul: <snip>
[info]level=INFO msg="using existing cluster id: \"<snip>\""
[info]level=INFO msg="LiteFS mounted to: /data/db/"
[info]level=INFO msg="http server listening on: http://localhost:20202"
[info]level=INFO msg="waiting to connect to cluster"
[info]level=INFO msg="<snip>: primary lease acquired, advertising as <snip>"
[info]level=INFO msg="connected to cluster, ready"
[info]level=INFO msg="node is a candidate, automatically promoting to primary"
[info]level=INFO msg="node is already primary, skipping promotion"
[info]level=INFO msg="proxy server listening on: http://localhost:3003"
[info]level=INFO msg="starting background subprocess: /app []"
[info]level=INFO msg="begin primary backup stream: url=https://litefs.fly.io"
[info]waiting for signal or subprocess to exit
[info]level=INFO msg="connecting to litefs event stream"
[info]level=INFO msg="--> GET http://localhost:20202/events"
[info]level=INFO msg="<-- 200 http://localhost:20202/events (2.12ms)"
[info]level=INFO msg="connected to litefs"
[info]level=INFO msg="database file is zero length on initialization: /data/litefs/dbs/server.db/database"
[info]level=INFO msg="pending migration found" id=<snip>
[info]level=INFO msg="pending migration found" id=<snip>
[info]level=INFO msg="pending migration found" id=<snip>
[info]level=INFO msg="pending migration found" id=<snip>
[info]level=INFO msg="pending migration found" id=<snip>
[info]level=INFO msg="migration status" applied=0 pending=5
[info]level=INFO msg="applying migrations..."
[info]level=INFO msg="migration log" msg="Creating: file:/data/db/server.db"
[info]level=INFO msg="migration log" msg="Applying: <snip>"
[info]level=INFO msg="migration log" msg="Applying: <snip>"
[info]level=INFO msg="migration log" msg="Applying: <snip>"
[info]level=INFO msg="migration log" msg="Applying: <snip>"
[info]level=INFO msg="migration log" msg="Applying: <snip>"
[info]level=INFO msg="starting server" host=0.0.0.0:3001
[info]level=INFO msg="starting metrics server" host=0.0.0.0:3002
[info]level=INFO msg="begin streaming backup" full-sync-interval=10s
[info]level=WARN msg="restoring from backup" name=server.db pos.local=0000000000000006/f58e8f661774b269 pos.remote=0000000000000072/f9ef589ae11992e6 reason=remote-ahead
[info]level=INFO msg="backup of database \"server.db\" is ahead of local copy, restoring from backup"
[info]level=INFO msg="snapshot received for \"server.db\", removing other ltx files except: 0000000000000001-0000000000000072.ltx.tmp"
[info]level=WARN msg="database restore complete" name=server.db pos=0000000000000072/f9ef589ae11992e6 elapsed=277.539349ms
[info]Health check on port 3003 is now passing.

The issue:

My application attempts to run any pending migrations if its the litefs primary. Since litefs hasn't yet fully restored the db by the time the startup migrator runs, it runs all of the migrations on the empty db. A second later, litefs starts catching up the db and then completes restoring the data.

This seems surprising to me, even though I do eventually get the correct data in the db by the end of this log snippet.

Some questions/thoughts:

Is there a way I can make litefs (or read a signal for my application) to wait on a restore in this scenario?

What happens to the writes that occur before the restore is complete? ~~are they discarded? if they aren't discarded, is the restore "replayed" on top of those writes and I'm just lucky that the schema is the same and it works out?~~

EDIT: answer is that the db is discarded and fully restored from backup:

litefs/store.go

Lines 1287 to 1290 in d861e91

    
           // restoreDBFromBackup pulls the current snapshot from the backup service and 
        
           // restores it to the local database. The backup service acts as the data 
        
           // authority so this occurs when we cannot provide a contiguous series of 
        
           // transaction files to the backup service.

The text was updated successfully, but these errors were encountered:

benbjohnson · 2023-11-16T17:06:33Z

We should update the LiteFS code to check for backup state before marking the node as ready. I'll mark this issues as a bug. Thanks for reporting it.

benbjohnson changed the title ~~When running disaster recovery from 0 nodes, primary starts application before db is fully restored~~ Primary should check backup server before marking ready Nov 16, 2023

benbjohnson added the bug Something isn't working label Nov 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Primary should check backup server before marking ready #419

Primary should check backup server before marking ready #419

stephen commented Nov 14, 2023 •

edited

Loading

benbjohnson commented Nov 16, 2023

Primary should check backup server before marking ready #419

Primary should check backup server before marking ready #419

Comments

stephen commented Nov 14, 2023 • edited Loading

Background

The issue:

benbjohnson commented Nov 16, 2023

stephen commented Nov 14, 2023 •

edited

Loading