Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fs::remove_dir_all rarely succeeds for large directories on windows #29497

Open
Diggsey opened this issue Nov 1, 2015 · 25 comments
Open

fs::remove_dir_all rarely succeeds for large directories on windows #29497

Diggsey opened this issue Nov 1, 2015 · 25 comments

Comments

@Diggsey
Copy link
Contributor

@Diggsey Diggsey commented Nov 1, 2015

I've been trying to track this one down for a while with little success. So far I've been able to confirm that no external programs are holding long-standing locks on any of the files in the directory, and I've confirmed that it happens with trivial rust programs such as the following:

use std::fs;

fn main() {
    println!("{:?}", fs::remove_dir_all("<path>"));
}

I've also confirmed that deleting the folder from explorer (equivalent to using SHFileOperation), or using rmdir on the command line will always succeed.

Currently my best guess is that either windows or other programs are holding transient locks on some of the files in the directory, possibly as part of the directory indexing windows performs to enable efficient searching.

Several factors led me to think this:

  • fs::remove_dir_all will pretty much always succeed on the second or third invocations.
  • it seems to happen most commonly when dealing with large numbers of text files
  • unlocker and other tools show no active handles to any files in the directory

Maybe there should be some sort of automated retry system, such that temporary locks do not hinder progress?

edit:
The errors returned seem somewhat random: sometimes it gives a "directory is not empty" error, while sometimes it will show up as an "access denied" error.

@retep998
Copy link
Member

@retep998 retep998 commented Nov 1, 2015

I think this is just race conditions at work. When you delete a file another program might get a notification and check out what happened and that might hold a temporary lock on something. Or it might be that the previous deletions weren't actually done by the time you tried to delete the directory.

@petrochenkov
Copy link
Contributor

@petrochenkov petrochenkov commented Nov 1, 2015

Hm, https://github.com/CppCon/CppCon2015/blob/master/Tutorials/Racing%20the%20Filesystem/Racing%20the%20Filesystem%20-%20Niall%20Douglas%20-%20CppCon%202015.pdf slide "3. Deleting a directory tree" may be related. I accidentally watched this presentation yesterday. If I understood correctly, DeleteFileW used in remove_dir_all doesn't actually delete a file, but only marks it for deletion, it still can be alive when we try to delete the parent directory causing the "directory is not empty" error.

@Diggsey
Copy link
Contributor Author

@Diggsey Diggsey commented Nov 1, 2015

@petrochenkov Great link - that seems like it's exactly the problem I'm experiencing.

It would be great to see some of these portable race-free idioms implemented in std.

@alexcrichton
Copy link
Member

@alexcrichton alexcrichton commented Nov 2, 2015

I'd be interested to dig a bit more into this to see what's going on. According to the DeleteFile documentation the only reason the file would be flagged for deletion would be if there are active open handles, which seems like a legitimate race condition? Note though that the same docs also recommend using SHFileOperation for recursively deleting a tree.

Those slides are certainly an interesting read though! I'd be a little wary of putting it into the standard library as the "transactional semantics" are a little weird. For example if you successfully rename a file to a temporary directory and then fail to delete it, what happens?

It may be the case that the best implementation for this function on Windows is to just use SHFileOperation, and that may end up just taking care of these sorts of issues on its own.

@pitdicker
Copy link
Contributor

@pitdicker pitdicker commented Feb 19, 2016

I will give fixing this a try.

On Windows XP we could use SHFileOperation, and on Vista and up IFileOperation. For Windows store applications there is no easy function. But I would still like to try implementing it by walking the directories and deleting recursively.

Current problems with remove_dir_all:

  • can not remove contents if the path becomes longer than MAX_PATH
  • files may not be deleted immediately, causing remove_dir to fail
  • unable to remove read-only files

And what I hope will fix it:

  • use canonicalise() on the base dir to get support for long paths
  • generate hash
  • read-only files (possibly with hard links) TODO: can the read-only flag be stale?
    • open file
    • remove read-only flag
    • ReOpenFile with FILE_FLAG_DELETE_ON_CLOSE
    • set read-only flag
    • continue as with normal files
  • normal files/dirs
    • open with FILE_FLAG_BACKUP_SEMANTICS (for opening dirs)
      FILE_FLAG_OPEN_REPARSE_POINT (for opening reparse points)
      FILE_FLAG_DELETE_ON_CLOSE
    • atomically rename with SetFileInformationByHandle and FILE_RENAME_INFO
      (normal rename on xp)
    • move all files and subdirectories to the parent of the dir to remove
      (not to %TEMP%, may not be on the same drive)
      rename them to rm-[hash]-[nr]

I think we know deleting will fail when opening the file with FILE_FLAG_DELETE_ON_CLOSE fails, so before moving.

@pitdicker
Copy link
Contributor

@pitdicker pitdicker commented Feb 19, 2016

O, and I walked against this bug when rustbuild (is that the current name) failed after make clean because of this.

@alexcrichton
Copy link
Member

@alexcrichton alexcrichton commented Feb 19, 2016

@pitdicker thanks for taking this on! We may want to discuss this a bit before moving much more on a solution. One question is how principled should something like fs::remove_dir_all be. For example if fs::remove_dir_all automatically handled read-only files, should fs::remove_file also do the same? Currently we have this "duality" where some fs functions are straight counterparts to the underlying system operations, but some are somewhat fancier like create_dir_all and remove_dir_all. We'd just want to make sure to have an explicitly drawn line here. Also note I'm not really thinking of an RFC here, just some more discussion before jumping to a PR.

I personally feel that remove_dir_all is fine to be fancy and do lots of implicit operations. I would continue to want, however, that remove_file and remove_dir are as straightforward as possible (e.g. don't mirror what's happening here unless it's a similarly small set of operations as to what happens today). Your proposed strategy seems reasonable to me at first glance, and I wouldn't mind reviewing more over a PR.

Curious what others think, though? cc @rust-lang/libs

@Diggsey
Copy link
Contributor Author

@Diggsey Diggsey commented Feb 19, 2016

@alexcrichton Hmm, I would prefer remove_file and remove_dir to try to give the same guarantees they have on linux if possible. Specifically that if they succeed, the file or directory should be completely gone.

It's inevitable that people will build their own abstractions over the top of these primitive operations, and I'd very much like to avoid these kind of bugs where it's sufficiently rare that it's almost impossible to track down the true cause of the problem.

@retep998
Copy link
Member

@retep998 retep998 commented Feb 19, 2016

Specifically that if they succeed, the file or directory should be completely gone.

This wouldn't really work because files don't necessarily get deleted right away. You can't just move the file somewhere else and then delete it because there isn't always a reasonable place on the same volume to move the file to. Sitting around and indefinitely waiting for the file to vanish isn't ideal either. I'd personally just use IFileOperation for remove_dir_all and call it a day.

@Diggsey
Copy link
Contributor Author

@Diggsey Diggsey commented Feb 19, 2016

What if the file is opened with FILE_FLAG_DELETE_ON_CLOSE, no share flags specified, and then is immediately closed? Successfully opening the file means that you have exclusive access, and so closing it should delete the file.

@retep998
Copy link
Member

@retep998 retep998 commented Feb 19, 2016

No share flags means nobody else can open it to read/write/delete it. However another handle can still be opened without read/write/delete permissions. Also what happens if you fail to open the file with no share flags? Does it error or fallback to the normal deletion method that provides no guarantees?

@pitdicker
Copy link
Contributor

@pitdicker pitdicker commented Feb 19, 2016

Great to see so many comments. Just what I hoped for!

@alexcrichton I completely agree to keep remove_file and remove_dir simple, and only let remove_dir_all do magic. Actually removing read-only files is pretty difficult when you also don't want to break hard links. http://stackoverflow.com/questions/3055668/delete-link-to-file-without-clearing-readonly-bit is an example, but I am sure I found some better way somewhere.

@Diggsey I would also like cross-platform consistency, but sometimes Windows is just to different... I think the standard library should at least expose primitives that only need one system call. Otherwise we lose performance and maybe get more vulnerable to race conditions. Maybe we can add a remove_no_matter_what in the future :)

@retep998 I have not yet written the code, so I don't know if this will really solve the problem... But if this doesn't work out I am all for IFileOperation. Meanwhile it is a nice way for me and maybe the standard library to get better at avoiding filesystem races.

I would have to test what happens with FILE_FLAG_DELETE_ON_CLOSE, but in theory if the file opens successfully we only have to move it to avoid a race of a couple of milliseconds. Directly after opening the file no one else can open the file anymore (it is not a share flag).

What is difficult is where to move a file / dir to. I think the parent of the dir we are removing is good enough. It should be on the same volume, and I think we also have write permission to it because otherwise deleting a dir is not possible (wild speculation here, it is late :))

@brson
Copy link
Contributor

@brson brson commented Feb 20, 2016

I feel ok about going to extra effort to make remove_dir_all work as expected.

@alexcrichton
Copy link
Member

@alexcrichton alexcrichton commented Feb 22, 2016

@pitdicker perhaps you could draw up a list of the possible implementation strategies for these functions? That'd also help in terms of weighing the pros/cons of each strategy and we could get a good idea of which terminology is associated with what. So far it sounds like IFileOperation may be the best option here, but I'm personally at least not following what FILE_FLAG_DELETE_ON_CLOSE would mean, especially in the case that you can't open it as someone else does.

@pitdicker
Copy link
Contributor

@pitdicker pitdicker commented Feb 22, 2016

To be fair IFileOperation is a bit to diffcult for me to implement. It should "just work" when removing directories. But as far is I can see it is an entirally different kind of API. For example it does not work on handles but IShellItems, and is mostly made for gui applications.

The steps I proposed are more ugly. They use the same simpler but more low-level APIs as the rest of the standard library. But there is a little window where files may temporarily show up in the parent directory of the one we removed. But these files would otherwise be the ones that cause remove_dir_all to fail currently, which is not all that often.

On Windows deleting a file is not guaranteed to be complete before the function returns. Instead a file is scheduled for deletion, when no one else has it open anymore. When Unix on the other hand deletes the file, it looks really deleted. But deletion of the inode is scheduled until everyone is done with it.

FILE_FLAG_DELETE_ON_CLOSE is a flag we can set when opening a file, together with DELETE permission as access mode. It will schedule the file (or directory) for deletion. When the flag is set (i.e. as soon as open succeeds), it is not possible for others to open the file anymore. The only question is what happens if others already have the file open. If they did not have the file open with FILE_SHARE_DELETE we should fail to open it (as we do not have DELETE permission).

Using this flag should be identical to using DeleteFile. The only reason I want to use it, is that it keeps a hanle open. And with this handle the file can be moved to a temporary location to not block removing the parent dir.

But I am not sure of anything yet. Somehow I just can't get it to work the renaming part yet, so all this is theory, not tested :(.

@Diggsey
Copy link
Contributor Author

@Diggsey Diggsey commented Feb 22, 2016

Having looked at the IFileOperation, I agree with @pitdicker that it doesn't seem like the right kind of API for low-level filesystem operations: it's very much oriented towards interactive use. (The .net framework doesn't use it either, but then Directory.Delete suffers from the exact same problem!)

I'm starting to think that maybe just using the current implementation and retrying on failure is the best option. http://stackoverflow.com/a/1703799 (The second code example has the advantage that it only retries more than once if some progress is made, and retries more often, for more deeply nested directories)

@pitdicker
Copy link
Contributor

@pitdicker pitdicker commented Feb 25, 2016

YES! I've got the manual method working.

Now to clean up my messy code, and do a lot of testing...

@pitdicker
Copy link
Contributor

@pitdicker pitdicker commented Feb 25, 2016

@Diggsey Thanks for looking up how .NET handles this!

And I like the simplicity of just retrying on failure. But I don't like that it needs a little luck. So I prefer moving files the instant before they are deleted.

@timvisee
Copy link
Contributor

@timvisee timvisee commented Nov 15, 2016

Has a fix already been applied to std, or do we have a working fix available yet? I'm currently bumping into the same issue on Windows, and this issue seems to be open for quite a while. Sadly, the problem seems to be quite inconsistent on my machine.

Also, I don't think that retrying to delete a file a few times sound like a good solution to put into std.

@orvly
Copy link

@orvly orvly commented Jul 22, 2017

I just ran into a related problem myself, where after calling remove_dir_all on a directory which had just a single empty sub-directory, it completed successfully, but then trying to re-create this directory hierarchy immediately failed with Access Denied.
(I imagine this should be a rather common scenario with testing code which relies on the filesystem, first cleaning up any artifacts from the previous run and then re-creating them).

The problem is basically not just with remove_dir_all, but with any attempt to delete a directory - it might not be executed immediately, so any action done right after it, which assumes it doesn't exist, might fail.

But after going to the issue referenced right above this comment (995 in rustup) I found out about this new "remove_dir_all" crate from @aaronepower (hosting just this single function), which uses a trick - it moves the directory first, then deletes the moved directory (as documented extensively there).
That solved my problem as well as the more general problem of deleting a directory on Windows.
remove_dir_all crate

Could its implementation replace the one in std::fs?

Malkaviel pushed a commit to Maskerad-rs/Maskerad_GameEngine that referenced this issue Nov 25, 2017
…r_all anymore, which can be buggy on Windows. RemoveDirectory mark files in the directory for deletion, it doesn't remove them actually. It means we can try to delete the directory, but the files might not be deleted at time, causing an error "The directory isn't empty".

See this issue : rust-lang/rust#29497
@steveklabnik
Copy link
Member

@steveklabnik steveklabnik commented Mar 7, 2019

Triage: I'm not aware of any movement on this issue.

@hntd187
Copy link

@hntd187 hntd187 commented Mar 23, 2019

I can confirm this still happens @steveklabnik

Error: Os { code: 145, kind: Other, message: "The directory is not empty." }

Just a folder with some plain text json files in it. Also, not a large directory or have any nested structure.

bors added a commit to rust-lang/cargo that referenced this issue Jun 19, 2019
Revert test directory cleaning change.

#6900 changed it so that the entire `cit` directory was cleaned once when tests started. Previously, each `t#` directory was deleted just before each test ran. This restores the old behavior due to problems on Windows.

The problem is that the call to `rm_rf` would fail with various errors ("Not found", "directory not empty", etc.) if you run `cargo test` twice. The first panic would poison the lazy static initializer, causing all subsequent tests to fail.

There are a variety of reasons deleting a file on Windows is difficult. My hypothesis in this case is that services like the indexing service and Defender swoop in and temporarily hold handles to files. This seems to be worse on slower systems, where presumably these services take longer to process all the files created by the test suite. It may also be related to how files are "marked for deletion" but are not immediately deleted.

The solution here is to spread out the deletion over time, giving Windows more of an opportunity to release its handles. This is a poor solution, and should only help reduce the frequency, but not entirely fix it.

I believe that this cannot be solved using `DeleteFileW`. There are more details at rust-lang/rust#29497, which is a long-standing problem that there are no good Rust implementations for recursively deleting a directory.

An example of something that implements a "safe" delete is [Cygwin's unlink implementation](https://github.com/cygwin/cygwin/blob/ad101bcb0f55f0eb1a9f60187f949c3decd855e4/winsup/cygwin/syscalls.cc#L675-L1064). As you can see, it is quite complex. Of course our use case does not need to handle quite as many edge cases, but I think any implementation is going to be nontrivial, and require Windows-specific APIs not available in std.

Note: Even before #6900 I still get a lot of errors on a slow VM (particularly "directory not empty"), with Defender and Indexing off. I'm not sure why. This PR should make it more bearable, though.
@XAMPPRocky
Copy link
Member

@XAMPPRocky XAMPPRocky commented Jan 11, 2020

Hey, so there is a solution to this. Someone needs to merge the remove_dir_all code into the standard library. This is a crate that created as a quick workaround for crates from an implementation from #31944 that never got merged. It's been pretty stable over the past few years, and is used in cargo and rustup.

Since creating the crate, I no longer have a Windows machine. As a result I do not intend to maintain the crate. It would be great if someone could port the code into std so that everyone can have a reliable implementation on windows. :)

@ehuss
Copy link
Contributor

@ehuss ehuss commented Jan 30, 2020

I would not consider the current remove_dir_all crate as a complete solution to this. It is better, but it still frequently fails in some situations. See rust-lang/cargo#7042 as an example where clearing Cargo's test suite doesn't succeed. See that PR for a link to cygwin's implementation which always works for me, and some commentary from a former cygwin developer.

I'm not opposed to merging it, just saying it isn't a complete fix.

@programmerjake
Copy link

@programmerjake programmerjake commented Jul 21, 2020

Some potentially useful reference code: Wine's implementation of SHFileOperation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.