New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Load libCling with RTLD_DEEPBIND to avoid collissions of LLVM symbols #4668
Conversation
Can one of the admins verify this patch? |
@phsft-bot build |
Starting build on |
Build failed on mac1014/cxx17. Errors:
|
@davidrohr We have at least Mac which does not have RTLD_DEEPBIND and several platforms where things are not working properly .. @Axel-Naumann might be able to give more information on why that is the case. |
@pcanal : Thx, I have seen that. For MacOS, as I said in the OP, I was expecting problems.. |
core/base/src/TROOT.cxx
Outdated
@@ -2069,7 +2073,7 @@ void TROOT::InitInterpreter() | |||
} | |||
|
|||
char *libcling = gSystem->DynamicPathName("libCling"); | |||
gInterpreterLib = dlopen(libcling, RTLD_LAZY|RTLD_LOCAL); | |||
gInterpreterLib = dlopen(libcling, RTLD_LAZY|RTLD_LOCAL|RTLD_DEEPBIND); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to test the proposed approach we should remove this line https://github.com/root-project/root/blob/master/interpreter/CMakeLists.txt#L117
Why would you remove that line?
|
I’ve missed a bit where your intent is to lift the requirements on other llvm libraries being compiled with hidden visibility. Could you describe your setup in a bit more detail? dlopen-ing libCling is one side of the problem the other is the jit symbol resolution (https://github.com/root-project/root/blob/master/interpreter/cling/lib/Interpreter/IncrementalJIT.cpp#L299) I suspect the latter is the issue. PS: Can you paste the issue you have and a particular code snippet and error message. |
I have trouble compiling the ALICE O2 with ROOT and some other libraries, which come with LLVM.
I was getting the error I fully agree that the problem is most likely due to just in time resolving of symbols. But I am wondering why my patch would break something in the ROOT ctests. Before my patch, the check would make sure that there are no other LLVM symbols present. But when there are no other symbols present, my patch shouldn't change anything. One could try to open libCling with RTLD_NOW instead of RTLD_LAZY, but I am not sure whether that would change anything. |
Example:
with the following command (using a system-installation of apache-arrow width gandiva):
will show the The problem with the opencl runtime is analogous. |
For reference: I just tried the ctests also locally with and without RTLD_DEEPBIND, and I can confirm that the DEEPBIND option makes some of them crash also for me. So unfortunately, this patch is not working as intended. |
So, would the error still be there if you change the example to something like:
why is |
Well, in that case it depends on what is loaded first, but there could be other static objects loading symbols from the other LLVM, so even if this would work, it would be only by chance. It just depends on the order. |
The reason is that we link to libgandiva (we do not dlopen it). ROOT does not link to libCling, but InitInterpreter() is called after the main(), so it will always be after libgandiva was opened. I agree, that could be avoided if we would dlopen libgandiva, and make sure to do gROOT->GetInterpreter() beforehand, but this would require some changes to our software. And also this problem is not specific to libgandiva only, but it would affect any library that uses LLVM. |
I have in the mean time found the root cause why my patch fails:
The problem can be avoided if executables are compiled with -fPIC as well. |
You should be able to guarantee what gets initialized first, either in your codebase or via the linker. I wish there was a better or even feasible-to-implement way to solve this more elegantly. The underlying issue is that whenever there in unknown (to the interpreter) symbol it will ask the JIT to resolve it. It tries to resolve the symbol via the usual dynamic linker rules and as a last resort it gives the control to ROOT. ROOT, in turn, uses dlsym and dladdr (which have platform-specific bugs) to find the unknown symbol (https://github.com/root-project/root/blob/master/core/metacling/src/TCling.cxx#L6418). Unfortunately, we do not have enough information at that point to be able to distinguish between which symbol is supposed to come from libCling or not. Thus we have a conservative strategy in resolving as much as we can from libCling and if something slips through use later-dlopened libraries. I presume a somewhat better fix would be to make a dlsym and a dlsym in libCling and always return the version of the symbol in libCling. This would be a major change which should happen after the upcoming release... |
@vgvassilev : I do not see how I could control this. |
Unfortunately, those should be carefully attended because of ROOT. I would feel more comfortable if my stack knows which libraries depend on LLVM to avoid pain debugging ROOT.
I am not a huge fan of the
IIUC,
This check is to protect the subsequent root/interpreter/cling/lib/Interpreter/IncrementalJIT.cpp Lines 299 to 302 in 39630b7
The challenge is to come up with a consistent symbol resolution :) |
Well, the problem is that this is no so easy to control. LLVM can come in from a dependency chain via many libraries like OpenCL / Vulkan / arrow. And I am afraid this will become more complicated in the future. Instead of messing with each of them, I thought it might be better to fix the issue in a single place on the ROOT side.
I agree, me neither. If we can find a better and cleaner way, I am absolutely in favor of that.
All shared libraries must be compiled with -fPIC by definition, so libgandiva is already compiler with -fPIC. The change would only be for exectuables, which usually do not have -fPIC by default. But then actually other libraries have similar requirements, e.g. Qt5 (with -reduce-relocation flag, which is the default) requires all executables to link against Qt to be compiled with -fPIC. But again, if there is a better way, I am also in favor of avoiding -fPIC-
|
This likely would not fix the global statics.
We are in the versioning hell, as the version of the system LLVM might differ from the ones the |
I do not see how that would break with global statics. All root-builtin llvm/clang statics would just go to the namespace as well.
|
… instead of forcing hidden symbols in other LLVM versions
@vgvassilev : Could you run the CI again? I pushed a new version, that might work at least for ALICE, using the fPIC fix I described above. |
@phsft-bot build! |
Starting build on |
Build failed on windows10/cxx14. |
Do I understand correctly, that only the windows build failed? From the log, I actually do not understand what was the problem. |
How's #4689 doing for you, @davidrohr ? |
It's a bug in our CI infra that forces everyone to also fork roottest... |
This patch will not work out as it is for ROOT, since it has too many side effects. Closing. |
After commit: 03790ac ("cmake: remove dynamic-list linker option") the issue with test initialy appeared: [001] box/push.test.lua [001] [001] [Instance "box" returns with non-zero exit code: 1] [001] [001] Last 15 lines of Tarantool Log file [Instance "box"][test/var/001_box/box.log]: [001] ==25624==ERROR: AddressSanitizer: odr-violation (0x000001123b60): [001] [1] size=1024 'mp_type_hint' src/lib/msgpuck/hints.c:39:20 [001] [2] size=1024 'mp_type_hint' src/lib/msgpuck/hints.c:39:20 [001] These globals were registered at these points: [001] [1]: [001] #0 0x478b8e in __asan_register_globals (src/tarantool+0x478b8e) [001] #1 0x7ff7a9bc9d0b in asan.module_ctor (function1.so+0x6d0b) [001] [001] [2]: [001] #0 0x478b8e in __asan_register_globals (src/tarantool+0x478b8e) [001] #1 0xab990b in asan.module_ctor (src/tarantool+0xab990b) [001] [001] ==25624==HINT: if you don't care about these errors you may set ASAN_OPTIONS=detect_odr_violation=0 [001] SUMMARY: AddressSanitizer: odr-violation: global 'mp_type_hint' at src/lib/msgpuck/hints.c:39:20 [001] ==25624==ABORTING [001] [ fail ] the following issue was created: #5001 The fail was described there by Vladislav Shpilevoy: """ I see why ASAN complains about mp_type_hint. This is because the symbol is defined both in Tarantool executable, and in the shared library function1.so. I think this is fine, and should be ignored. But it definitely has nothing to do with the current ticket. The problem existed always, but asan noticed it only now somewhy. And it is not a problem actually. """ He added suggestion to try RTLD_DEEPBIND, but in real it is not supported on OSX and the same issue with discussion can be found here: root-project/root#4668 The initial issue closed and the new one created especialy for the test. The fix was made in ASAN suppresion list to block the ASAN check for file: src/lib/msgpuck/hints.c Closes #5023
This is an alternative approach to solve the problem of colliding LLVM symbols, if other libraries bring in their own LLVM.
The original approach was to force other LLVM libraries to be compiled with -fvisibility=hidden or being opened with dlopen after TROOT::InitInterpreter().
This patch solves the issue on the ROOT side, which seems to me the much cleaner approach because we do not pose any limitations on 3rd party libraries.
I tried it locally and it works for me.
Marked as "Work in Progress", since this might need some more thought, in particular for other OS, and for old glibc versions that do not support RTLD_DEEPBIND.