Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve reliability, simplicity and recoverability of BGSAVE #153

Merged
merged 2 commits into from
May 16, 2021

Conversation

ohsayan
Copy link
Member

@ohsayan ohsayan commented May 16, 2021

This is a complement to #152 that further improves the reliability of BGSAVE. Refer to the commit messages for more information. BGSAVE can now self recover which it was supposed to be able to but due to implementation inconsistencies it wasn't able to.

This fix is a very important one in two ways. Say we have an user A.
They go ahead and launch skyd. skyd creates a data.bin file. Now A just
deletes the data.bin file for fun. Funny enough, this never causes flock
to error!
Why? Well because the descriptor/handle is still valid and was just
unlinked from the current directory. But this might seem silly since
the user exits with a 'successfully saved notice' only to find that the
file never existed and all of their data was lost. That's bad.
There's a hidden problem in our current approach too, apart from this.
Our writing process begins by truncating the old file and then writing
to it by placing the cursor at 0. Nice, but what if this operation just
crashes. So we lost the current data AND the old data. Not good.

This commit does a better thing: it creates a new temporary file, locks
it before writing and then flushes the current data to the temporary
file. Once that succeeds, it replaces the old data.bin file with the
newly created file.

This solves both the problems mentioned here for us:
1. No more of the silly error
2. If BGSAVE crashes in between, we can be sure that at least the last
data.bin file is in proper shape and not half truncated or so.

This commit further moves the background services into their
own module(s) for easy management.
Fixes:
1. Our custom runner (drone/.ci.yml) was modified to kill the skyd
process once done since this pipeline is not ephemeral.
2. GHA for some reason ignores any error in the test step and proceeds
to kill the skyd process without erroring. Since GHA runners are
ephemeral, we don't need to do this manually.
@ohsayan ohsayan added C-bug Something isn't working D-server Related to the server C-performance This relates to performance P-high High priority C-storage Relating to storage C-tests This is related to the test suite A-independent Architecture independent issue/PR C-reliability This issue/PR relates to reliability C-Stability labels May 16, 2021
@ohsayan ohsayan added this to the v0.6.0 milestone May 16, 2021
@glydr glydr added the D-workflow Related to workflow files label May 16, 2021
@ohsayan
Copy link
Member Author

ohsayan commented May 16, 2021

@stellaroid r=me

@ohsayan ohsayan merged commit 790558d into next May 16, 2021
@ohsayan ohsayan deleted the simplefwrite branch October 10, 2021 02:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-independent Architecture independent issue/PR C-bug Something isn't working C-performance This relates to performance C-reliability This issue/PR relates to reliability C-Stability C-storage Relating to storage C-tests This is related to the test suite D-server Related to the server D-workflow Related to workflow files P-high High priority
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants