New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fs::remove_dir_all rarely succeeds for large directories on windows #29497

Open
Diggsey opened this Issue Nov 1, 2015 · 20 comments

Comments

Projects
None yet
10 participants
@Diggsey
Copy link
Contributor

Diggsey commented Nov 1, 2015

I've been trying to track this one down for a while with little success. So far I've been able to confirm that no external programs are holding long-standing locks on any of the files in the directory, and I've confirmed that it happens with trivial rust programs such as the following:

use std::fs;

fn main() {
    println!("{:?}", fs::remove_dir_all("<path>"));
}

I've also confirmed that deleting the folder from explorer (equivalent to using SHFileOperation), or using rmdir on the command line will always succeed.

Currently my best guess is that either windows or other programs are holding transient locks on some of the files in the directory, possibly as part of the directory indexing windows performs to enable efficient searching.

Several factors led me to think this:

  • fs::remove_dir_all will pretty much always succeed on the second or third invocations.
  • it seems to happen most commonly when dealing with large numbers of text files
  • unlocker and other tools show no active handles to any files in the directory

Maybe there should be some sort of automated retry system, such that temporary locks do not hinder progress?

edit:
The errors returned seem somewhat random: sometimes it gives a "directory is not empty" error, while sometimes it will show up as an "access denied" error.

@sfackler sfackler added the O-windows label Nov 1, 2015

@retep998

This comment has been minimized.

Copy link
Member

retep998 commented Nov 1, 2015

I think this is just race conditions at work. When you delete a file another program might get a notification and check out what happened and that might hold a temporary lock on something. Or it might be that the previous deletions weren't actually done by the time you tried to delete the directory.

@petrochenkov

This comment has been minimized.

Copy link
Contributor

petrochenkov commented Nov 1, 2015

Hm, https://github.com/CppCon/CppCon2015/blob/master/Tutorials/Racing%20the%20Filesystem/Racing%20the%20Filesystem%20-%20Niall%20Douglas%20-%20CppCon%202015.pdf slide "3. Deleting a directory tree" may be related. I accidentally watched this presentation yesterday. If I understood correctly, DeleteFileW used in remove_dir_all doesn't actually delete a file, but only marks it for deletion, it still can be alive when we try to delete the parent directory causing the "directory is not empty" error.

@Diggsey

This comment has been minimized.

Copy link
Contributor Author

Diggsey commented Nov 1, 2015

@petrochenkov Great link - that seems like it's exactly the problem I'm experiencing.

It would be great to see some of these portable race-free idioms implemented in std.

@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented Nov 2, 2015

I'd be interested to dig a bit more into this to see what's going on. According to the DeleteFile documentation the only reason the file would be flagged for deletion would be if there are active open handles, which seems like a legitimate race condition? Note though that the same docs also recommend using SHFileOperation for recursively deleting a tree.

Those slides are certainly an interesting read though! I'd be a little wary of putting it into the standard library as the "transactional semantics" are a little weird. For example if you successfully rename a file to a temporary directory and then fail to delete it, what happens?

It may be the case that the best implementation for this function on Windows is to just use SHFileOperation, and that may end up just taking care of these sorts of issues on its own.

@pitdicker

This comment has been minimized.

Copy link
Contributor

pitdicker commented Feb 19, 2016

I will give fixing this a try.

On Windows XP we could use SHFileOperation, and on Vista and up IFileOperation. For Windows store applications there is no easy function. But I would still like to try implementing it by walking the directories and deleting recursively.

Current problems with remove_dir_all:

  • can not remove contents if the path becomes longer than MAX_PATH
  • files may not be deleted immediately, causing remove_dir to fail
  • unable to remove read-only files

And what I hope will fix it:

  • use canonicalise() on the base dir to get support for long paths
  • generate hash
  • read-only files (possibly with hard links) TODO: can the read-only flag be stale?
    • open file
    • remove read-only flag
    • ReOpenFile with FILE_FLAG_DELETE_ON_CLOSE
    • set read-only flag
    • continue as with normal files
  • normal files/dirs
    • open with FILE_FLAG_BACKUP_SEMANTICS (for opening dirs)
      FILE_FLAG_OPEN_REPARSE_POINT (for opening reparse points)
      FILE_FLAG_DELETE_ON_CLOSE
    • atomically rename with SetFileInformationByHandle and FILE_RENAME_INFO
      (normal rename on xp)
    • move all files and subdirectories to the parent of the dir to remove
      (not to %TEMP%, may not be on the same drive)
      rename them to rm-[hash]-[nr]

I think we know deleting will fail when opening the file with FILE_FLAG_DELETE_ON_CLOSE fails, so before moving.

@pitdicker

This comment has been minimized.

Copy link
Contributor

pitdicker commented Feb 19, 2016

O, and I walked against this bug when rustbuild (is that the current name) failed after make clean because of this.

@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented Feb 19, 2016

@pitdicker thanks for taking this on! We may want to discuss this a bit before moving much more on a solution. One question is how principled should something like fs::remove_dir_all be. For example if fs::remove_dir_all automatically handled read-only files, should fs::remove_file also do the same? Currently we have this "duality" where some fs functions are straight counterparts to the underlying system operations, but some are somewhat fancier like create_dir_all and remove_dir_all. We'd just want to make sure to have an explicitly drawn line here. Also note I'm not really thinking of an RFC here, just some more discussion before jumping to a PR.

I personally feel that remove_dir_all is fine to be fancy and do lots of implicit operations. I would continue to want, however, that remove_file and remove_dir are as straightforward as possible (e.g. don't mirror what's happening here unless it's a similarly small set of operations as to what happens today). Your proposed strategy seems reasonable to me at first glance, and I wouldn't mind reviewing more over a PR.

Curious what others think, though? cc @rust-lang/libs

@Diggsey

This comment has been minimized.

Copy link
Contributor Author

Diggsey commented Feb 19, 2016

@alexcrichton Hmm, I would prefer remove_file and remove_dir to try to give the same guarantees they have on linux if possible. Specifically that if they succeed, the file or directory should be completely gone.

It's inevitable that people will build their own abstractions over the top of these primitive operations, and I'd very much like to avoid these kind of bugs where it's sufficiently rare that it's almost impossible to track down the true cause of the problem.

@retep998

This comment has been minimized.

Copy link
Member

retep998 commented Feb 19, 2016

Specifically that if they succeed, the file or directory should be completely gone.

This wouldn't really work because files don't necessarily get deleted right away. You can't just move the file somewhere else and then delete it because there isn't always a reasonable place on the same volume to move the file to. Sitting around and indefinitely waiting for the file to vanish isn't ideal either. I'd personally just use IFileOperation for remove_dir_all and call it a day.

@Diggsey

This comment has been minimized.

Copy link
Contributor Author

Diggsey commented Feb 19, 2016

What if the file is opened with FILE_FLAG_DELETE_ON_CLOSE, no share flags specified, and then is immediately closed? Successfully opening the file means that you have exclusive access, and so closing it should delete the file.

@retep998

This comment has been minimized.

Copy link
Member

retep998 commented Feb 19, 2016

No share flags means nobody else can open it to read/write/delete it. However another handle can still be opened without read/write/delete permissions. Also what happens if you fail to open the file with no share flags? Does it error or fallback to the normal deletion method that provides no guarantees?

@pitdicker

This comment has been minimized.

Copy link
Contributor

pitdicker commented Feb 19, 2016

Great to see so many comments. Just what I hoped for!

@alexcrichton I completely agree to keep remove_file and remove_dir simple, and only let remove_dir_all do magic. Actually removing read-only files is pretty difficult when you also don't want to break hard links. http://stackoverflow.com/questions/3055668/delete-link-to-file-without-clearing-readonly-bit is an example, but I am sure I found some better way somewhere.

@Diggsey I would also like cross-platform consistency, but sometimes Windows is just to different... I think the standard library should at least expose primitives that only need one system call. Otherwise we lose performance and maybe get more vulnerable to race conditions. Maybe we can add a remove_no_matter_what in the future :)

@retep998 I have not yet written the code, so I don't know if this will really solve the problem... But if this doesn't work out I am all for IFileOperation. Meanwhile it is a nice way for me and maybe the standard library to get better at avoiding filesystem races.

I would have to test what happens with FILE_FLAG_DELETE_ON_CLOSE, but in theory if the file opens successfully we only have to move it to avoid a race of a couple of milliseconds. Directly after opening the file no one else can open the file anymore (it is not a share flag).

What is difficult is where to move a file / dir to. I think the parent of the dir we are removing is good enough. It should be on the same volume, and I think we also have write permission to it because otherwise deleting a dir is not possible (wild speculation here, it is late :))

@brson

This comment has been minimized.

Copy link
Contributor

brson commented Feb 20, 2016

I feel ok about going to extra effort to make remove_dir_all work as expected.

@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented Feb 22, 2016

@pitdicker perhaps you could draw up a list of the possible implementation strategies for these functions? That'd also help in terms of weighing the pros/cons of each strategy and we could get a good idea of which terminology is associated with what. So far it sounds like IFileOperation may be the best option here, but I'm personally at least not following what FILE_FLAG_DELETE_ON_CLOSE would mean, especially in the case that you can't open it as someone else does.

@pitdicker

This comment has been minimized.

Copy link
Contributor

pitdicker commented Feb 22, 2016

To be fair IFileOperation is a bit to diffcult for me to implement. It should "just work" when removing directories. But as far is I can see it is an entirally different kind of API. For example it does not work on handles but IShellItems, and is mostly made for gui applications.

The steps I proposed are more ugly. They use the same simpler but more low-level APIs as the rest of the standard library. But there is a little window where files may temporarily show up in the parent directory of the one we removed. But these files would otherwise be the ones that cause remove_dir_all to fail currently, which is not all that often.

On Windows deleting a file is not guaranteed to be complete before the function returns. Instead a file is scheduled for deletion, when no one else has it open anymore. When Unix on the other hand deletes the file, it looks really deleted. But deletion of the inode is scheduled until everyone is done with it.

FILE_FLAG_DELETE_ON_CLOSE is a flag we can set when opening a file, together with DELETE permission as access mode. It will schedule the file (or directory) for deletion. When the flag is set (i.e. as soon as open succeeds), it is not possible for others to open the file anymore. The only question is what happens if others already have the file open. If they did not have the file open with FILE_SHARE_DELETE we should fail to open it (as we do not have DELETE permission).

Using this flag should be identical to using DeleteFile. The only reason I want to use it, is that it keeps a hanle open. And with this handle the file can be moved to a temporary location to not block removing the parent dir.

But I am not sure of anything yet. Somehow I just can't get it to work the renaming part yet, so all this is theory, not tested :(.

@Diggsey

This comment has been minimized.

Copy link
Contributor Author

Diggsey commented Feb 22, 2016

Having looked at the IFileOperation, I agree with @pitdicker that it doesn't seem like the right kind of API for low-level filesystem operations: it's very much oriented towards interactive use. (The .net framework doesn't use it either, but then Directory.Delete suffers from the exact same problem!)

I'm starting to think that maybe just using the current implementation and retrying on failure is the best option. http://stackoverflow.com/a/1703799 (The second code example has the advantage that it only retries more than once if some progress is made, and retries more often, for more deeply nested directories)

@pitdicker

This comment has been minimized.

Copy link
Contributor

pitdicker commented Feb 25, 2016

YES! I've got the manual method working.

Now to clean up my messy code, and do a lot of testing...

@pitdicker

This comment has been minimized.

Copy link
Contributor

pitdicker commented Feb 25, 2016

@Diggsey Thanks for looking up how .NET handles this!

And I like the simplicity of just retrying on failure. But I don't like that it needs a little luck. So I prefer moving files the instant before they are deleted.

@timvisee

This comment has been minimized.

Copy link
Contributor

timvisee commented Nov 15, 2016

Has a fix already been applied to std, or do we have a working fix available yet? I'm currently bumping into the same issue on Windows, and this issue seems to be open for quite a while. Sadly, the problem seems to be quite inconsistent on my machine.

Also, I don't think that retrying to delete a file a few times sound like a good solution to put into std.

@orvly

This comment has been minimized.

Copy link

orvly commented Jul 22, 2017

I just ran into a related problem myself, where after calling remove_dir_all on a directory which had just a single empty sub-directory, it completed successfully, but then trying to re-create this directory hierarchy immediately failed with Access Denied.
(I imagine this should be a rather common scenario with testing code which relies on the filesystem, first cleaning up any artifacts from the previous run and then re-creating them).

The problem is basically not just with remove_dir_all, but with any attempt to delete a directory - it might not be executed immediately, so any action done right after it, which assumes it doesn't exist, might fail.

But after going to the issue referenced right above this comment (995 in rustup) I found out about this new "remove_dir_all" crate from @Aaronepower (hosting just this single function), which uses a trick - it moves the directory first, then deletes the moved directory (as documented extensively there).
That solved my problem as well as the more general problem of deleting a directory on Windows.
remove_dir_all crate

Could its implementation replace the one in std::fs?

Malkaviel pushed a commit to Maskerad-rs/Maskerad_GameEngine that referenced this issue Nov 25, 2017

Maxime Capuano
Modified the rmrf filesystem function -> it doesn't use fs::remove_di…
…r_all anymore, which can be buggy on Windows. RemoveDirectory mark files in the directory for deletion, it doesn't remove them actually. It means we can try to delete the directory, but the files might not be deleted at time, causing an error "The directory isn't empty".

See this issue : rust-lang/rust#29497
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment