Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Racing Directory Creation is OK So Long As the Directory Exists #2810

Merged
merged 19 commits into from
Feb 4, 2021

Conversation

kevinkreiser
Copy link
Member

@kevinkreiser kevinkreiser commented Jan 27, 2021

Through @ktatterso 's tireless effort we have finally found the underlying (at least i think so) issue causing tile building to sometimes fail. Let me explain..

When we are in the first stage of tile building we will spawn a bunch of threads to build tiles at the same time. These threads are all creating the directory structure as defined in GraphTile.h/cc. If there are enough tiles to be created and enough threads its possible that they are both competing to create the same directories at the same time. Lets look at the filename patterns structure:

level/3_digits/.../3_digits.gph

Now the level determins how many subdirectories you need to make. In practice there are is at least one 3 digit sub directory (for levels 0 and 1) and up to two 3 digit sub directories (for levels 2 and 3). @ktatterso and i theorized that something was clashing when trying to make directories for these tiles and we were right, but at first it didnt seem like it should matter.

Assume two threads are trying to write tiles in the same directory, ie they both race to create the same directory tree. The thing about the directory creation is it doesnt happen all in one go, each sub directory is created one at a time. If any one of those creations fails the function returns false signaling it couldnt make the directory. At first this seems ok, surely the other thread succeded which is why this thread failed BUUUUTTT... you can have two threads trying to write tiles that look like this:

Thread 1: 2/452/423/987.gph
Thread 2: 2/452/999/234.gph

Here we have the case that thread 1 races thread 2 for the creation of the subdirectory 452 and wins. Thread 2 then gives up up completely and says well I couldn't do it. But no one else tries to make the next sub directory for thread 2, the 999 part. Because of this when thread 2 tries to open up the file for writing it cant because the deepest subdirectory doesnt exist.

The good news is the fix should be easy. We can simply ignore when we fail to make the directory if it already existed and move on with the rest of the directory creation as normal. I've forgone writing a test for this as its super hard to do with the whole threading concerns.

…existance just before you try to make it but after you've checked and found it wasnt there
@merkispavel
Copy link
Contributor

wow! Kevins 1 : 0 bug

ktatterso
ktatterso previously approved these changes Jan 27, 2021
Copy link
Contributor

@ktatterso ktatterso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I merged these changes and can confirm I no longer see any issues when running valhalla_build_tiles.

valhalla/filesystem.h Outdated Show resolved Hide resolved
}

inline bool remove_all(const path& p) {
inline std::uintmax_t remove_all(const path& p) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified prototype/impl to match C++17 ::remove_all().

test/filesystem.cc Outdated Show resolved Hide resolved
test/filesystem.cc Outdated Show resolved Hide resolved
test/filesystem.cc Outdated Show resolved Hide resolved
valhalla/filesystem.h Outdated Show resolved Hide resolved
@mookerji
Copy link
Contributor

mookerji commented Feb 2, 2021

Made a few comments, annoyingly largely around syntax and formatting. A few of these I could have expected to be caught by clang-format and then rejected during CI, so that was a bit surprising.

@danpat
Copy link
Member

danpat commented Feb 2, 2021

@mookerji I think the things you're flagging are only caught by clang-tidy (for example, braces around statements), and we only have a limited number of checks enabled, IIRC mostly because clang-tidy is historically slooooow.

test/filesystem.cc Outdated Show resolved Hide resolved
test/filesystem.cc Outdated Show resolved Hide resolved
test/filesystem.cc Outdated Show resolved Hide resolved
test/filesystem.cc Outdated Show resolved Hide resolved
Comment on lines +427 to +435
// it can occur that another thread is removing the same file
// object at the same time. Occasionally it will occur that
// while one thread is deleting (thread 1), the other thread
// (thread 2) will receive -1 from ::remove() and (errno==ENOENT
// or errno==EINVAL). And yet, the file still exists until thread 1
// is finished deleting. In this exact moment, if thread 2 calls
// stat() on the file object in question it will get 0 which means
// "file object exists". We kinda want thread 2 to wait for the
// file object to truly go out of existence. Hence, this spin loop.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while i get that this is a nice feature i really dont think it should be the filesystem libraries concern if you write code that is racing to delete the same paths. that synchronization should be handled in the calling application. indeed i had a look at both the boost implementation as well as another filesystem implementation (https://github.com/gulrak/filesystem) and neither of them implement such a tactic:
https://github.com/gulrak/filesystem/blob/de57485877865156370f5e9878592293645b207e/include/ghc/filesystem.hpp#L4396
https://github.com/boostorg/filesystem/blob/develop/src/operations.cpp#L229

@ktatterso what was the issue that triggered you looking into this anyway? i've kind of lost the plot at this point 😄 i cant think of any multithreaded calls to remove. in mjolnir at the beginning of the process we remove_all existing tile directories in the main thread... what am i missing here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah it seems like its not something we really need to worry about at this time. the only call was the one that i remembered off the top of my head, the rest are all in tests which definitely shouldnt race because they definitely shouldnt be using the same tile directories:

image

now i get that you could be racing at the process level but again should filesystem really be the one protecting against that? i guess since it has such a small impact its probably ok but its again likely a deviation from the standard implimentation. i'm ok with whichever direction you want to go. want to just ship what we've got?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multithreaded folder creation motivated the creation of the unit test. Deciding to test multithreaded directory removal was simply my idea. I figured if we want to be robust in multithreaded directory creation we would also want to be robust in multithreaded directory removal. In a sense, valhalla's filesystem::remove_all() is the "calling application" - so imo we should take responsibility for it being robust.

In my mind, I wrote a test that should work (threads concurrently calling filesystem::remove_all()). It exposed an issue and I've resolved the issue. No need to wait for a customer to find this issue, we can fix it now.

@kevinkreiser
Copy link
Member Author

@ktatterso can you put a ship on this so we can merge it when the build finishes?

@ktatterso
Copy link
Contributor

As requested I switched the EXPECT_* macros to ASSERT_*. Then I merged with master. And although the pipeline is still running I've approved the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants