Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix move_and_async_delete_path #32020

Merged
merged 3 commits into from
Jun 8, 2023

Conversation

apfitzge
Copy link
Contributor

@apfitzge apfitzge commented Jun 7, 2023

Problem

purge_old_bank_snapshots_at_startup async deletes account snapshot paths.
The rename breaks symlinks, and the _to_be_deleted paths are now considered orphaned.
Then clean_orphaned_account_snapshot_dirs tries to delete them again, via rename & delete.
This rugs the original threads which were doing the delete.

Summary of Changes

Use a static set to track which paths are currently being deleted by move_and_async_delete_path (asynchronously or synchronously).
This prevents double calls, as well as trying to delete an in-progress *_to_be_deleted.

Fixes #

@apfitzge apfitzge changed the title static set to track in-progress async deletes Fix move_and_async_delete_path Jun 7, 2023
@apfitzge apfitzge requested a review from brooksprumo June 7, 2023 20:56
@@ -478,37 +479,47 @@ pub fn move_and_async_delete_path_contents(path: impl AsRef<Path>) {
}

/// Delete directories/files asynchronously to avoid blocking on it.
/// First, in sync context, rename the original path to *_deleted,
/// then spawn a thread to delete the renamed path.
/// If the process is killed and the deleting process is not done,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted this part of the comment. There's nothing about this function itself which would do it the next time the process life.

return;
}

path_delete.push("_to_be_deleted");
if let Err(err) = std::fs::rename(&path, &path_delete) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we can't rename it - there's a few reasons:

  1. permission issues (can't do anything about that 😦 )
  2. path_delete exists
  3. path does not exist (already confirmed it does, while holding the lock)

if path_delete exists, then we have both path and path_exists, potentially because of a crash mid-delete on the last validator run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another case is that we have move_and_async_delete_path_contents which re-creates the original directory!

if let Err(err) = std::fs::rename(&path, &path_delete) {
warn!(
"Path renaming failed: {}. Falling back to rm_dir in sync mode",
err.to_string()
);
// Although the delete here is synchronous, we want to prevent another thread
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issues were not explicitly from multiple threads, but w/ the mutex lock it's easy enough to make this thread-safe (I think?)

@codecov
Copy link

codecov bot commented Jun 7, 2023

Codecov Report

Merging #32020 (97ce0c2) into master (8596e00) will decrease coverage by 0.1%.
The diff coverage is 66.6%.

@@            Coverage Diff            @@
##           master   #32020     +/-   ##
=========================================
- Coverage    81.8%    81.8%   -0.1%     
=========================================
  Files         763      763             
  Lines      207671   207676      +5     
=========================================
- Hits       170073   170068      -5     
- Misses      37598    37608     +10     

Copy link
Contributor

@brooksprumo brooksprumo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, I'll test this version out.

Also, any reason this PR is still in draft?

@brooksprumo
Copy link
Contributor

I'll test this version out.

Ran this branch on a validator that I tried to put into the same error cases (i.e. orphaned accounts/snapshot/ dirs that were renamed but not deleted yet), and it worked/didn't crash. Yay!

@apfitzge apfitzge marked this pull request as ready for review June 8, 2023 17:19
@apfitzge
Copy link
Contributor Author

apfitzge commented Jun 8, 2023

Also, any reason this PR is still in draft?

I had left for the day before CI finished 😴

@apfitzge
Copy link
Contributor Author

apfitzge commented Jun 8, 2023

From logs:

[2023-06-08T14:00:12.550304957Z TRACE solana_runtime::snapshot_utils] background deleting /home/sol/ledger/accounts/snapshot/198280062_to_be_deleted...
[2023-06-08T14:00:12.550347197Z INFO  solana_runtime::snapshot_utils] Purged incomplete snapshot dir: /home/sol/ledger/snapshot/198280062
[2023-06-08T14:00:12.550540425Z TRACE solana_runtime::snapshot_utils] background deleting /home/sol/ledger/accounts/snapshot/198279943_to_be_deleted...
[2023-06-08T14:00:12.633155292Z DEBUG solana_runtime::snapshot_utils] Retained bank snapshot for slot 198279838, and purged the rest.
[2023-06-08T14:00:12.633181301Z INFO  solana_core::validator] Cleaning orphaned account snapshot directories..
[2023-06-08T14:00:12.633229062Z INFO  solana_runtime::snapshot_utils] Removing orphaned account snapshot hardlink directory: /home/sol/ledger/accounts/snapshot/198280062_to_be_deleted
[2023-06-08T14:00:12.633238470Z INFO  solana_runtime::snapshot_utils] Removing orphaned account snapshot hardlink directory: /home/sol/ledger/accounts/snapshot/198279943_to_be_deleted
...
[2023-06-08T14:00:15.075610449Z TRACE solana_runtime::snapshot_utils] background deleting /home/sol/ledger/accounts/run_to_be_deleted... Done, and took 2.5s
[2023-06-08T14:00:15.139898521Z TRACE solana_runtime::snapshot_utils] background deleting /home/sol/ledger/accounts/snapshot/198279943_to_be_deleted... Done, and took 2.6s
[2023-06-08T14:00:15.292164501Z TRACE solana_runtime::snapshot_utils] background deleting /home/sol/ledger/accounts/snapshot/198280062_to_be_deleted... Done, and took 2.7s
[2023-06-08T14:00:16.985215590Z INFO  solana_ledger::bank_forks_utils] Initializing bank snapshot path: /home/sol/ledger/snapshot
[2023-06-08T14:00:16.985682749Z TRACE solana_runtime::snapshot_utils] background deleting /home/sol/ledger/accounts/snapshot/198279838_to_be_deleted...

We begin background deleting 198279943 (now 198279943_to_be_deleted), then we see the orphan clean up also try to clean it up at 14:00:12.633238470Z. This would have previously renamed the dir to 198279943_to_be_deleted_to_be_deleted and rugged the original background delete task.

The check on line 501 prevented us from renaming, and the call to move_and_async_delete_path effectively does nothing.

Copy link
Contributor

@brooksprumo brooksprumo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@apfitzge apfitzge merged commit 3ba05d9 into solana-labs:master Jun 8, 2023
@apfitzge apfitzge deleted the async_delete_tracking branch June 8, 2023 17:55
@t-nelson
Copy link
Contributor

t-nelson commented Jun 8, 2023

sry apparently i queued this up as a review message instead of sending a single as intended...

wouldn't it be cleaner to make the delete thread long living and pipe the paths to be deleted over to it with a channel. delete thread could inspect the directory on startup to snag any remnants from previous, interrupted work before beginning to consume the channel

@brooksprumo
Copy link
Contributor

Heh, I suggested this offline with Andrew as well. We can impl the channel approach in master/v1.17. We need a fix for v1.16, so probably a smaller change is safer. Wdyt?

@t-nelson
Copy link
Contributor

t-nelson commented Jun 8, 2023

heh, i'm not sure this is actually "smaller" in terms of complexity (or complete, for that matter due to how branchy that function is). also seems like the number bg delete threads is effectively unbounded, which probably isn't great for io

@brooksprumo
Copy link
Contributor

I'd like to add the v1.16 backport label so that we'll have it ready in case we need it for a release. Don't need to merge it unless needed. Is that ok?

also seems like the number bg delete threads is effectively unbounded, which probably isn't great for io

Yeah there can be multiple delete threads, but we know all the spots its called. This is mostly solving an issue at startup, and I think it's ok if there's an io spike then. We save a decent amount of time with this async-delete code.

heh, i'm not sure this is actually "smaller" in terms of complexity (or complete, for that matter due to how branchy that function is).

Yeah, totally agree. Mostly not sure how high to prioritize an alternative.

@apfitzge
Copy link
Contributor Author

apfitzge commented Jun 8, 2023

wouldn't it be cleaner to make the delete thread long living and pipe the paths to be deleted over to it with a channel. delete thread could inspect the directory on startup to snag any remnants from previous, interrupted work before beginning to consume the channel

We could do it either way. I tried to keep things effectively the same but with protection.
I didn't want to add an extra thread waiting around the entire app lifetime, even if it is effectively sleeping. also can delete all account paths in parallel instead of sequentially in a "delete thread".

Not too strong of an opinion on it, just tried to keep things as simple as possible for the fix.

@brooksprumo brooksprumo added the v1.16 PRs that should be backported to v1.16 label Jun 9, 2023
mergify bot pushed a commit that referenced this pull request Jun 9, 2023
mergify bot added a commit that referenced this pull request Jun 9, 2023
Fix move_and_async_delete_path (#32020)

(cherry picked from commit 3ba05d9)

Co-authored-by: Andrew Fitzgerald <apfitzge@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
v1.16 PRs that should be backported to v1.16
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants