-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix move_and_async_delete_path #32020
Conversation
@@ -478,37 +479,47 @@ pub fn move_and_async_delete_path_contents(path: impl AsRef<Path>) { | |||
} | |||
|
|||
/// Delete directories/files asynchronously to avoid blocking on it. | |||
/// First, in sync context, rename the original path to *_deleted, | |||
/// then spawn a thread to delete the renamed path. | |||
/// If the process is killed and the deleting process is not done, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deleted this part of the comment. There's nothing about this function itself which would do it the next time the process life.
return; | ||
} | ||
|
||
path_delete.push("_to_be_deleted"); | ||
if let Err(err) = std::fs::rename(&path, &path_delete) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we can't rename it - there's a few reasons:
- permission issues (can't do anything about that 😦 )
path_delete
existspath
does not exist (already confirmed it does, while holding the lock)
if path_delete
exists, then we have both path
and path_exists
, potentially because of a crash mid-delete on the last validator run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another case is that we have move_and_async_delete_path_contents
which re-creates the original directory!
if let Err(err) = std::fs::rename(&path, &path_delete) { | ||
warn!( | ||
"Path renaming failed: {}. Falling back to rm_dir in sync mode", | ||
err.to_string() | ||
); | ||
// Although the delete here is synchronous, we want to prevent another thread |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issues were not explicitly from multiple threads, but w/ the mutex lock it's easy enough to make this thread-safe (I think?)
Codecov Report
@@ Coverage Diff @@
## master #32020 +/- ##
=========================================
- Coverage 81.8% 81.8% -0.1%
=========================================
Files 763 763
Lines 207671 207676 +5
=========================================
- Hits 170073 170068 -5
- Misses 37598 37608 +10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, I'll test this version out.
Also, any reason this PR is still in draft?
Ran this branch on a validator that I tried to put into the same error cases (i.e. orphaned accounts/snapshot/ dirs that were renamed but not deleted yet), and it worked/didn't crash. Yay! |
I had left for the day before CI finished 😴 |
From logs:
We begin background deleting The check on line 501 prevented us from renaming, and the call to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
sry apparently i queued this up as a review message instead of sending a single as intended... wouldn't it be cleaner to make the delete thread long living and pipe the paths to be deleted over to it with a channel. delete thread could inspect the directory on startup to snag any remnants from previous, interrupted work before beginning to consume the channel |
Heh, I suggested this offline with Andrew as well. We can impl the channel approach in master/v1.17. We need a fix for v1.16, so probably a smaller change is safer. Wdyt? |
heh, i'm not sure this is actually "smaller" in terms of complexity (or complete, for that matter due to how branchy that function is). also seems like the number bg delete threads is effectively unbounded, which probably isn't great for io |
I'd like to add the v1.16 backport label so that we'll have it ready in case we need it for a release. Don't need to merge it unless needed. Is that ok?
Yeah there can be multiple delete threads, but we know all the spots its called. This is mostly solving an issue at startup, and I think it's ok if there's an io spike then. We save a decent amount of time with this async-delete code.
Yeah, totally agree. Mostly not sure how high to prioritize an alternative. |
We could do it either way. I tried to keep things effectively the same but with protection. Not too strong of an opinion on it, just tried to keep things as simple as possible for the fix. |
(cherry picked from commit 3ba05d9)
Problem
purge_old_bank_snapshots_at_startup
async deletes account snapshot paths.The rename breaks symlinks, and the
_to_be_deleted
paths are now considered orphaned.Then
clean_orphaned_account_snapshot_dirs
tries to delete them again, via rename & delete.This rugs the original threads which were doing the delete.
Summary of Changes
Use a static set to track which paths are currently being deleted by
move_and_async_delete_path
(asynchronously or synchronously).This prevents double calls, as well as trying to delete an in-progress
*_to_be_deleted
.Fixes #