-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Fix some race conditions in the build system #752
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix some race conditions in the build system #752
Conversation
5c9cf52 to
23c5a14
Compare
|
Just for information, this was tested on KNL with 200 threads without a crash. More testing is needed, but I think that this is definitely a step in the right direction. Thanks @Teemperor for worknig on this! |
4e228e1 to
896503a
Compare
|
|
||
| std::string getTmpFileName(const std::string &filename) { | ||
| return filename + "_tmp"; | ||
| return filename + "_tmp_" + std::to_string(getpid()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this problematic - this will mean we'll never fix the underlying build system issue that triggers concurrent invocations on the same dictionary. Could you add a note to the relevant Jira ticket https://sft.its.cern.ch/jira/browse/ROOT-8244 that this commit should be reverted once the issue is fixed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part can actually go away now, as far as I can tell. It was needed before because more than one process was trying to make the dictionaries, and one process would move the files of the other, causing a crash as seen here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the CMake bug tracker and the CMake mailing list, the underlying issue is a design flaw in Make (without GNU extensions) which prevents from coordinating the generation of custom_command targets between multiple jobs (this is also why Pere doesn't see this in Ninja). See the man page of "add_custom_command" which states that we should never depend on the output files directly, but instead on a custom target. If we don't do this, then the make jobs can't communicate who is responsible for this target and then produce it multiple times (usually twice from what I saw).
We already have this custom target in the current CMakeLists.txt, we just had the wrong dependencies based on it (because of the inconsistent naming). Those missing links are fixed in the commit about fixing the dependencies, which should prevent this issue in the way the CMake manual recommends (even though it seems they still make no guarantee that this works as intended on every platform).
I also added the commented out debug code which helps figuring out if dependencies are missing for rootcling invocations (the developer warnings are enabled by default in CMake, so there isn't a better way of doing this without spamming the compile log). This should fix the problem of "How do we debug this in the future" in a systematic (and better) way than hoping to find race conditions by trial-and-error and trying to increase the error-chance by using always the same file name.
Also, we should behave like every other generator/compiler which tries to further mitigate this problem by choosing process unique tmp names (see clang etc. with their FILE.o-UNIQUE names). We don't know if someone generates build files for something really obscure (e.g. Visual Studio or an old Make version) which chooses the target order in a different way (e.g. on the CMake mailing list people complained that VS seems to have yet another behavior for building these things1 ), so let's at least give those people at least some relief in the future by making it less likely that things blow up when building :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Teemperor Can you drop this commit from the PR? I think you agree it's not needed anymore, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amadio I think I ninja'd you there :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't know if someone generates build files for something really obscure (e.g. Visual Studio or an old Make version) which chooses the target order in a different way
That's not the issue.
What you're saying is that build systems might trigger concurrent commands to generate the same target. That's horrendous. It's a bug. It would mean e.g. that the same library is linked twice in parallel. I buy your argument that this is a consequence of your CMake-based configuration to use add_custom_command.
I'd hope that we can not use add_custom_command and instead use something that allows CMake to construct a proper dependency tree.
Can we keep the original tmp name, issue an warning message if that file exists and then go use a unique name? That would allows us to detect whether the build system is generating the same dictionary in multiple concurrent processes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Axel-Naumann nope, no way around add_custom_command and add_custom_target here. LLVM is in the same situation with TableGen (even though I'm not sure if they do unique temp files). But we can easily check from CMake if someone forget to link to the custom_target as we have quite obvious naming conventions (and we further prevent this when we refactor most stuff to use the ROOT_STANDARD_WHATEVER function from RootNewMacros).
I prefer the first suggestion with reverting the "_tmp" commit before we start turning rootcling into a tool for checking our CMake build system files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @Teemperor . Anything that prevents CMake from spawning two rootcling processes to build the same product is majorly appreciated by anyone building with LLVM_BUILD_TYPE=Debug! :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Teemperor ping - can this be reverted so we see build system issues again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Teemperor and/or should be it be replaced by a 'if already exist' fail?
core/clingutils/CMakeLists.txt
Outdated
| string(REPLACE "multi" "" header ${header}) | ||
| ROOT_GENERATE_DICTIONARY(G__std_${dict} ${header} STAGE1 MODULE ${dict}Dict LINKDEF src/${dict}Linkdef.h DEPENDENCIES Core) | ||
| ROOT_LINKER_LIBRARY(${dict}Dict G__std_${dict}.cxx DEPENDENCIES Core) | ||
| ROOT_LINKER_LIBRARY(std_${dict} G__std_${dict}.cxx DEPENDENCIES Core) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this maybe the source of https://sft.its.cern.ch/jira/browse/ROOT-8244 ? How does this renaming from ${dict}Dict to std_${dict} relate to ROOT_GENERATE_DICTIONARY's MODULE ${dict}Dict?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, see the commit about fixing the naming which states that this prevents dependency linking between compiling and generating (and my comment above).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is the cause. I added a comment in ROOT-8244 explaining the problem.
|
I also added the dependencies for rootcling regarding the BUILTINS. This should prevent the unlikely case that rootcling starts before the BUILTINS are ready (which is more likely once we don't have to build LLVM anymore). |
d0a2a2a to
330ecf2
Compare
|
Ok, seems like touching the module names is making everything collapse. Let's just change the G__*.cxx filenames... |
6d5e4ff to
2ec7efc
Compare
|
Ok, all tests pass now on all platforms. I don't see any traces for missing dependencies anymore (e.g. multiple |
ROOT_GENERATE_DICTIONARY allows to specify dependencies, but we currently don't actually specify those. As we need this for getting the module dependencies right (as we can't build missing modules on demand), we should add those dependencies here. Then those dependencies propagate to the rootcling invocation which will in the future also generate the C++ module for the selected dictionary.
In case CMake spawns multiple instances of a rootcling invocation (which can happen as CMake doesn't have a good mechanism for ensuring that we only spawn one when using the make build system), we now choose process-unique file names for the temporaries to prevent the build errors when multiple rootcling invocations try to work on the same temporary file. This isn't a real fix, but as renaming files from temporary to the real name is usually FS atomic, this will mitigate any future race conditions we have in this problematic part of the building process on obscure build systems. Those systems possible don't strictly follow the order in which they are supposed to build those targets, so this hopefully makes the consequences of any race conditions less intrusive (e.g. Visual Studio also seems to suffer from this problem according to the mailing list).
2ec7efc to
d25a03d
Compare
|
Starting build on |
The naming of the library in the two function calls ROOT_GENERATE_DICTIONARY and ROOT_LINKER_LIBRARY was often not matching. This causes that ROOT_LINKER_LIBRARY couldn't correctly set its dependencies to the output file of ROOT_GENERATE_DICTIONARY and therefore causes race conditions in the build system. Note: We're attached all dependencies to a custom target, and NOT to the output files generated by rootcling. This should mitigate the race conditions we experience when multiple targets in different Make jobs request the same output file as a dependency and then suffer from this Make problem that is described here: https://cmake.org/Bug/view.php?id=10082 This patch also adds a (commented out) piece of code that can print warnings if we don't have a fitting G__*.cxx file for a ROOT_LINKER_LIBRARY call. This is sometimes intended, so I didn't enable this warning by default. We should enable this code by default in the future once we have a way to express if we intentionally don't provide the G__*.cxx file when calling ROOT_LINKER_LIBRARY.
d25a03d to
8dd8abc
Compare
|
Starting build on |
Right now rootcling doesn't know if the headers depend on a target that is provided by a BUILTIN. This patch adds a new parameter that allows expressing if thie rootcling invocation depends on a certain BUILTIN in a way that is identical to the ROOT_LINKER_LIBRARY way of handling this.
|
@Axel-Naumann Could you please take a final look and merge? I believe this PR fixes ROOT-8244. I tested with @Teemperor Thanks for fixing this! |
graf3d/x3d/CMakeLists.txt
Outdated
|
|
||
| include_directories(${X11_INCLUDE_DIR}) | ||
|
|
||
| ROOT_GENERATE_DICTIONARY(G__X3D TViewerX3D.h TX3DFrame.h MODULE ${libname} LINKDEF LinkDef.h DEPENDENCIES Graf3d Gui) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably should be using G__${libname} all over the place to keep things consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but that's out of the scope of this particular PR. It can be addressed in a new PR later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The two single calls will anyway disappear when we can replace them with the ROOT_STANDARD_LIBRARY_PACKAGE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amadio, I feel fairly strongly about this. If we aim to improve something we have to go the last mile.
@Teemperor, then let's replace this with ROOT_STANDARD_LIBRARY_PACKAGE. I think we can afford to wait for a few days and fix this once and for all.
|
Starting build on |
So far GENERATE_DICTIONARY depends on all the targets passed to its
DEPENDENCIES argument. However, this means for some targets not
only the generation of PCMs, rootmaps etc, but also the linking of
this target (as for example the target `Hist` generates Hist.pcm
and then also links libHist.so). We only care about the files
generated by rootcling when we specify the dependencies here,
so we can improve build performance here.
This patch creates a new target for each dictionary generation
command called ROOTCLING_{MODULE} which can be refernced by other
rootcling invocations to state that they depend on the
rootcling files (PCMs, rootmap, root-PCMs) of this module, but
not on a fully built module. We then start checking for each
dependency passed to GENERATE_DICTIONARY if there is a
ROOTCLING_MODULE target and depend on this if possible.
This should cause that all the rootcling invocations can be
started earlier and the linking of modules and rootcling
invocations of its dependencies now run in parallel.
|
@amadio, I'd prefer @Axel-Naumann to have given his green light and my comments to be addressed. Why did you find this so urgent to merge? |
|
We need to test thoroughly, and I wanted to give it a go at the nightlies. |
|
Let's take this discussion offline. |
With this patch we can finally do
make -j200! (And I no longer have to explain to the summer students why they have to typemake Cling; maketo compile root :) ). We also get the module dependencies for rootcling right.See the specific commits for the individual fixes.