Skip to content

Getting non-matching LTX checksum on fresh volume #134

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kentcdodds opened this issue Oct 24, 2022 · 19 comments · Fixed by #144
Closed

Getting non-matching LTX checksum on fresh volume #134

kentcdodds opened this issue Oct 24, 2022 · 19 comments · Fixed by #144
Labels
bug Something isn't working
Milestone

Comments

@kentcdodds
Copy link

kentcdodds commented Oct 24, 2022

https://github.com/kentcdodds/kentcdodds.com/actions/runs/3316512422/jobs/5478478215

cannot open store: open databases: open database("sqlite.db"): verify database file: database checksum (e3d3906d74cc0273) does not match latest LTX checksum (0000000000000000)

This volume is brand new and completely empty. @benbjohnson said this is a bug that needs fixing and asked me to open this issue. More context at https://www.youtube.com/watch?v=vTNPJGKqsYQ

Thanks!

@benbjohnson
Copy link
Collaborator

@kentcdodds Thanks for writing this up. I realized I had an old version (pr-109) still on the litefs-example which it looks like you have in your Dockerfile as well. Sorry about that. Can you try changing this line here to:

FROM flyio/litefs:0.2 AS litefs

I'm surprised to see that error on a brand new volume as it happens when LiteFS is validating the existing database state. Can you retry with the new litefs version and let me know if you still have the same issue?

@benbjohnson benbjohnson added the bug Something isn't working label Oct 25, 2022
@benbjohnson benbjohnson added this to the v0.3.0 milestone Oct 25, 2022
@kentcdodds
Copy link
Author

@kentcdodds
Copy link
Author

I'm a bit stuck on deploying LiteFS until this is resolved. Any ideas?

@benbjohnson
Copy link
Collaborator

benbjohnson commented Oct 25, 2022

@kentcdodds The error is strange because it is essentially saying that the database state on disk exists (checksum e3d3906d74cc0273) but there's no associated replication data (checksum 0000000000000000). However, you're seeing that error even when you deploy with a clean volume so there shouldn't be any database state.

2022-10-25T15:35:44Z   [info]cannot open store: open databases: open database("sqlite.db"): verify database file: database checksum (e3d3906d74cc0273) does not match latest LTX checksum (0000000000000000)

Can you try removing the volumes on your staging set up and re-deploying and seeing if you still have the same error?

@kentcdodds
Copy link
Author

I think I've figured out what's going on. When I create the new volume, my old (pre litefs) app restarts and applies migrations to the new db in the volume which is what causes this issue.

What I'm trying now is to deploy a version of my app that does not do anything to the database so then I can have that one running when I recreate the volume, and then deploy the litefs version. Will let you know what happens.

@benbjohnson
Copy link
Collaborator

Ok, cool. Thanks for digging into it more. I also created an issue for keeping litefs running on error so it's easier to ssh in and debug the state. #136

@benbjohnson
Copy link
Collaborator

I pushed up a PR for it so it's available at pr-137 in Docker now. That'll keep litefs running even if it hits some kind of error on startup so the fly instance will be accessible via ssh.

@kentcdodds
Copy link
Author

Good news! It's running now.

Now I'm going to try to create more regions. It just occurred to me that I'll need to create volumes for the regions first right? If I try to deploy my app to a region without a persistent volume things will break right?

@kentcdodds
Copy link
Author

Interestingly, I added a volume to maa, and then tried to add a region there and got this error message:

Error App 'kcd-staging' uses volumes to control regions. Add or remove volumes to change region placement.

So I just scaled up to a count of 2 and maa started right up! Wahoo! Thanks a ton for the help!

Now I just need to figure out how to determine the primary region via that .primary file and then I think I should be ready to go with this to prod!

@kentcdodds
Copy link
Author

I got this again on a new deploy of the app:

2022-10-25T22:33:59.223 app[18d0f7c2] den [info] ERROR: cannot open store: open databases: open database("sqlite.db"): verify database file: database checksum (f13013272ddb586c) does not match latest LTX checksum (da9624ecbb43ad42)

I'm not sure what I'm doing wrong :(

@kentcdodds
Copy link
Author

Here's the failed build, not sure how useful it'll be: https://github.com/kentcdodds/kentcdodds.com/actions/runs/3324709637/jobs/5496681848

@benbjohnson
Copy link
Collaborator

@kentcdodds This is a known bug that can occur on restart with the rollback journal. I have a fix for this one. We should have a v0.3.0 release coming early next week that will have WAL support and stability fixes in it.

@kentcdodds
Copy link
Author

@benbjohnson
Copy link
Collaborator

@kentcdodds Thanks for trying it. Is this running on a clean volume or the existing one?

@benbjohnson benbjohnson reopened this Oct 29, 2022
@kentcdodds
Copy link
Author

Existing one

@benbjohnson
Copy link
Collaborator

I added a possible fix for this with #157. Although, depending on the exact nature of the issue, #158 could help too. It's hard to say without looking at the data files in the LiteFS directory.

This may resolve the issue on the existing volume but if it's a bug that was resolved by #158 then you'll need to wipe the volume and start with a clean database.

I'm going to close this for now but please reopen if you hit the issue again. Thanks, @kentcdodds!

@AlexBlokh
Copy link

well, my production instance is now dead, been up for 7 months
seems like due to that issue

I have 1 container and 1 volume

@AlexBlokh
Copy link

neither I'm able to connect to the instance, it's in pending state

@benbjohnson
Copy link
Collaborator

@AlexBlokh I'm sorry to hear that. Do you know what version of LiteFS you were running?

Also, you may be able to recover your underlying database. If you copy out the database file and wal file to a different directory with SQLite standard names, you can open in SQLite and then do an integrity check to ensure it's valid:

# Replace LITEFS_DATA_DIR & #DBNAME with your appropriate values.
# You only need to copy the "wal" file if you're using WAL mode and if the file exists.
$ cp $LITEFS_DATA_DIR/dbs/$DBNAME/database /tmp/db
$ cp $LITEFS_DATA_DIR/dbs/$DBNAME/wal /tmp/db-wal

# Open using SQLite & run an integrity check.
$ sqlite3 /tmp/db
sqlite> PRAGMA integrity_check

If it returns ok then the underlying database is valid. If it returns errors then you'll need to recover from a backup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants