Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Load libCling with RTLD_DEEPBIND to avoid collissions of LLVM symbols #4668

Closed
wants to merge 1 commit into from

Conversation

davidrohr
Copy link

This is an alternative approach to solve the problem of colliding LLVM symbols, if other libraries bring in their own LLVM.
The original approach was to force other LLVM libraries to be compiled with -fvisibility=hidden or being opened with dlopen after TROOT::InitInterpreter().

This patch solves the issue on the ROOT side, which seems to me the much cleaner approach because we do not pose any limitations on 3rd party libraries.

I tried it locally and it works for me.

Marked as "Work in Progress", since this might need some more thought, in particular for other OS, and for old glibc versions that do not support RTLD_DEEPBIND.

@phsft-bot
Copy link
Collaborator

Can one of the admins verify this patch?

@davidrohr
Copy link
Author

@ktf @aalkin: ping: this is an alternative solution that works for me to solve the LLVM symbol issue.

@pcanal
Copy link
Member

pcanal commented Dec 3, 2019

@phsft-bot build

@phsft-bot
Copy link
Collaborator

Starting build on ROOT-performance-centos7-multicore/default, ROOT-fedora27/noimt, ROOT-fedora29/python3, ROOT-fedora30/cxx14, ROOT-ubuntu16/nortcxxmod, ROOT-ubuntu18.04-i386/cxx14, mac1014/cxx17, windows10/cxx14
How to customize builds

@phsft-bot
Copy link
Collaborator

Build failed on mac1014/cxx17.
See console output.

Errors:

  • FAILED: core/CMakeFiles/BaseTROOT.dir/base/src/TROOT.cxx.o
  • /build/jenkins/workspace/root-pullrequests-build/root/core/base/src/TROOT.cxx:2076:63: error: use of undeclared identifier 'RTLD_DEEPBIND'

@pcanal
Copy link
Member

pcanal commented Dec 3, 2019

@davidrohr We have at least Mac which does not have RTLD_DEEPBIND and several platforms where things are not working properly .. @Axel-Naumann might be able to give more information on why that is the case.

@davidrohr
Copy link
Author

@pcanal : Thx, I have seen that. For MacOS, as I said in the OP, I was expecting problems..
I am really puzzled by the segfaults though. I mean, if it was working before, I do not see how the RTLD_DEEPBIND should break something.
That basically means, that there is already a global symbol of something loaded, that should have precedence over symbols in libCling and libCling's dependencies. But OK, I am not that much an expert of the ROOT internals.

@@ -2069,7 +2073,7 @@ void TROOT::InitInterpreter()
}

char *libcling = gSystem->DynamicPathName("libCling");
gInterpreterLib = dlopen(libcling, RTLD_LAZY|RTLD_LOCAL);
gInterpreterLib = dlopen(libcling, RTLD_LAZY|RTLD_LOCAL|RTLD_DEEPBIND);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to test the proposed approach we should remove this line https://github.com/root-project/root/blob/master/interpreter/CMakeLists.txt#L117

@davidrohr
Copy link
Author

Why would you remove that line?

  • This lines hides ROOT's LLVM 5 symbols from other libraries, so this should be fine is certainly what we want.
  • What my patch should do is is kind of the other way around, if there are other libraries exposing LLVM symbols, ROOT should still take its own LLVM 5 symbols. In this way, other libraries do not need to hide their symbols.

@vgvassilev
Copy link
Member

vgvassilev commented Dec 4, 2019

Why would you remove that line?

  • This lines hides ROOT's LLVM 5 symbols from other libraries, so this should be fine is certainly what we want.
  • What my patch should do is is kind of the other way around, if there are other libraries exposing LLVM symbols, ROOT should still take its own LLVM 5 symbols. In this way, other libraries do not need to hide their symbols.

I’ve missed a bit where your intent is to lift the requirements on other llvm libraries being compiled with hidden visibility. Could you describe your setup in a bit more detail? dlopen-ing libCling is one side of the problem the other is the jit symbol resolution (https://github.com/root-project/root/blob/master/interpreter/cling/lib/Interpreter/IncrementalJIT.cpp#L299) I suspect the latter is the issue.

PS: Can you paste the issue you have and a particular code snippet and error message.

@davidrohr
Copy link
Author

I have trouble compiling the ALICE O2 with ROOT and some other libraries, which come with LLVM.
For me, this is in particular:

  • arrow with gandiva
  • glfw with vulkan support
  • several OpenCL runtimes, which use LLVM

I was getting the error ...please hide them or dlopen() them after the call to TROOT::InitInterpreter() and a segmentation fault thereafter. Now, it is a bit complicated to require from all 3rd party libraries that they do not export LLVM symbols, so I was trying to solve the problem on the ROOT side.

I fully agree that the problem is most likely due to just in time resolving of symbols. But I am wondering why my patch would break something in the ROOT ctests. Before my patch, the check would make sure that there are no other LLVM symbols present. But when there are no other symbols present, my patch shouldn't change anything.

One could try to open libCling with RTLD_NOW instead of RTLD_LAZY, but I am not sure whether that would change anything.

@davidrohr
Copy link
Author

Example:

#include <TROOT.h>
int main(int argc, char** argv) {
  gROOT->GetInterpreter();
  return 0;
}

with the following command (using a system-installation of apache-arrow width gandiva):

c++ -o test -O0 -ggdb `root-config --libs` -I`root-config --incdir` -std=c++17 test.cpp /usr/lib64/libgandiva.so.14.1.0 && ./test

will show the ...please hide them or dlopen() them after the call to TROOT::InitInterpreter() error.

The problem with the opencl runtime is analogous.

@davidrohr
Copy link
Author

For reference: I just tried the ctests also locally with and without RTLD_DEEPBIND, and I can confirm that the DEEPBIND option makes some of them crash also for me.
For some reason, none of the tests in the ALICE software with ROOT have triggered this problem.
I tried also with RTLD_NOW instead of RTLD_LAZY but that doesn't change anything.

So unfortunately, this patch is not working as intended.
I'd still like to understand what is the problem with it, since I still think this is the much better solution compared to requiring that all 3rd party libraries hide LLVM symbols.

@vgvassilev
Copy link
Member

Example:

#include <TROOT.h>
int main(int argc, char** argv) {
  gROOT->GetInterpreter();
  return 0;
}

with the following command (using a system-installation of apache-arrow width gandiva):

c++ -o test -O0 -ggdb `root-config --libs` -I`root-config --incdir` -std=c++17 test.cpp /usr/lib64/libgandiva.so.14.1.0 && ./test

will show the ...please hide them or dlopen() them after the call to TROOT::InitInterpreter() error.

The problem with the opencl runtime is analogous.

So, would the error still be there if you change the example to something like:

#include <TROOT.h>
static auto force_init =  gROOT->GetInterpreter();
int main(int argc, char** argv) {
  return 0;
}

why is /usr/lib64/libgandiva.so.14.1.0 initialized/dlopened first?

@davidrohr
Copy link
Author

So, would the error still be there if you change the example to something like:

#include <TROOT.h>
static auto force_init =  gROOT->GetInterpreter();
int main(int argc, char** argv) {
  return 0;
}

Well, in that case it depends on what is loaded first, but there could be other static objects loading symbols from the other LLVM, so even if this would work, it would be only by chance. It just depends on the order.

@davidrohr
Copy link
Author

why is /usr/lib64/libgandiva.so.14.1.0 initialized/dlopened first?

The reason is that we link to libgandiva (we do not dlopen it). ROOT does not link to libCling, but InitInterpreter() is called after the main(), so it will always be after libgandiva was opened.

I agree, that could be avoided if we would dlopen libgandiva, and make sure to do gROOT->GetInterpreter() beforehand, but this would require some changes to our software. And also this problem is not specific to libgandiva only, but it would affect any library that uses LLVM.

@davidrohr
Copy link
Author

I have in the mean time found the root cause why my patch fails:

  • It is due to copy relocations of symbols from libstdc++
  • Crashes happen whenever cling wants to output text via cout / cerr / etc.
  • The problem is that there are 2 instances of cout / cerr: The first created by the executable itself through copy relocation, when the executable prints something via cout. The second created by libCling loaded with RTLD_DEEPBIND. (By design, RTLD_DEEPBIND links to libstdc++ first before checking the global namespace, so it does not see the copy-relocated symbol.)

The problem can be avoided if executables are compiled with -fPIC as well.
I just tried locally to compile the failing ROOT ctests with -fPIC and this fixed the segmentation faults.
The question is whether this is a proper solution, since it would require to compile all executables with -fPIC. On the other hand, I don't know how to fix this in another way. Perhaps this could be enabled via a special CMake option for ROOT, which enables RTLD_DEEPBIND for libCling and -fPIC for all executables.

@vgvassilev
Copy link
Member

You should be able to guarantee what gets initialized first, either in your codebase or via the linker. I wish there was a better or even feasible-to-implement way to solve this more elegantly.

The underlying issue is that whenever there in unknown (to the interpreter) symbol it will ask the JIT to resolve it. It tries to resolve the symbol via the usual dynamic linker rules and as a last resort it gives the control to ROOT. ROOT, in turn, uses dlsym and dladdr (which have platform-specific bugs) to find the unknown symbol (https://github.com/root-project/root/blob/master/core/metacling/src/TCling.cxx#L6418).

Unfortunately, we do not have enough information at that point to be able to distinguish between which symbol is supposed to come from libCling or not. Thus we have a conservative strategy in resolving as much as we can from libCling and if something slips through use later-dlopened libraries.

I presume a somewhat better fix would be to make a dlsym and a dlsym in libCling and always return the version of the symbol in libCling. This would be a major change which should happen after the upcoming release...

@davidrohr
Copy link
Author

@vgvassilev : I do not see how I could control this.
https://github.com/root-project/root/blob/master/core/base/src/TROOT.cxx#L2037 states that InitInterpreter() should be called after main(). Therefore, as we link to libgandiva, which links to llvm libs, the system LLVM symbols will always be in the global namespace by design. They are loaded before main(), while InitInterpreter() loads libCling only after main().

@vgvassilev
Copy link
Member

why is /usr/lib64/libgandiva.so.14.1.0 initialized/dlopened first?

The reason is that we link to libgandiva (we do not dlopen it). ROOT does not link to libCling, but InitInterpreter() is called after the main(), so it will always be after libgandiva was opened.

I agree, that could be avoided if we would dlopen libgandiva, and make sure to do gROOT->GetInterpreter() beforehand, but this would require some changes to our software. And also this problem is not specific to libgandiva only, but it would affect any library that uses LLVM.

Unfortunately, those should be carefully attended because of ROOT. I would feel more comfortable if my stack knows which libraries depend on LLVM to avoid pain debugging ROOT.

I have in the mean time found the root cause why my patch fails:

* It is due to copy relocations of  symbols from libstdc++

* Crashes happen whenever cling wants to output text via cout / cerr / etc.

* The problem is that there are 2 instances of cout / cerr: The first created by the executable itself through copy relocation, when the executable prints something via cout. The second created by libCling loaded with RTLD_DEEPBIND. (By design, RTLD_DEEPBIND links to libstdc++ first before checking the global namespace, so it does not see the copy-relocated symbol.)

I am not a huge fan of the RTLD_DEEPBIND approach because it is not portable and it essentially overrides the dynamic linker behavior yet one more time. I am not sure how this would translate to Windows for example.

The problem can be avoided if executables are compiled with -fPIC as well.
I just tried locally to compile the failing ROOT ctests with -fPIC and this fixed the segmentation faults.
The question is whether this is a proper solution, since it would require to compile all executables with -fPIC. On the other hand, I don't know how to fix this in another way. Perhaps this could be enabled via a special CMake option for ROOT, which enables RTLD_DEEPBIND for libCling and -fPIC for all executables.

IIUC, RTLD_DEEPBIND alters the search order of the dynamic linker and using -fPIC intends to deduplicate. That seems as bad as preloading a set of llvm dependent libraries, IMO. Do we need libgandiva.so to be compiled with -fPIC or just the executables which 'use' ROOT?

@vgvassilev : I do not see how I could control this.
https://github.com/root-project/root/blob/master/core/base/src/TROOT.cxx#L2037 states that InitInterpreter() should be called after main(). Therefore, as we link to libgandiva, which links to llvm libs, the system LLVM symbols will always be in the global namespace by design. They are loaded before main(), while InitInterpreter() loads libCling only after main().

This check is to protect the subsequent dlsyms in TCling to resolve symbols from the wrong place. We can fix the check and play with the JIT resolution logic by playing with the search order here:

llvm::sys::DynamicLibrary::SearchOrder
= llvm::sys::DynamicLibrary::SO_LoadedFirst;
// Enable JIT symbol resolution from the binary.
llvm::sys::DynamicLibrary::LoadLibraryPermanently(0, 0);

The challenge is to come up with a consistent symbol resolution :)

@davidrohr
Copy link
Author

The reason is that we link to libgandiva (we do not dlopen it). ROOT does not link to libCling, but InitInterpreter() is called after the main(), so it will always be after libgandiva was opened.
I agree, that could be avoided if we would dlopen libgandiva, and make sure to do gROOT->GetInterpreter() beforehand, but this would require some changes to our software. And also this problem is not specific to libgandiva only, but it would affect any library that uses LLVM.

Unfortunately, those should be carefully attended because of ROOT. I would feel more comfortable if my stack knows which libraries depend on LLVM to avoid pain debugging ROOT.

Well, the problem is that this is no so easy to control. LLVM can come in from a dependency chain via many libraries like OpenCL / Vulkan / arrow. And I am afraid this will become more complicated in the future. Instead of messing with each of them, I thought it might be better to fix the issue in a single place on the ROOT side.

I am not a huge fan of the RTLD_DEEPBIND approach because it is not portable and it essentially overrides the dynamic linker behavior yet one more time. I am not sure how this would translate to Windows for example.

I agree, me neither. If we can find a better and cleaner way, I am absolutely in favor of that.

IIUC, RTLD_DEEPBIND alters the search order of the dynamic linker and using -fPIC intends to deduplicate. That seems as bad as preloading a set of llvm dependent libraries, IMO. Do we need libgandiva.so to be compiled with -fPIC or just the executables which 'use' ROOT?

All shared libraries must be compiled with -fPIC by definition, so libgandiva is already compiler with -fPIC. The change would only be for exectuables, which usually do not have -fPIC by default. But then actually other libraries have similar requirements, e.g. Qt5 (with -reduce-relocation flag, which is the default) requires all executables to link against Qt to be compiled with -fPIC. But again, if there is a better way, I am also in favor of avoiding -fPIC-

This check is to protect the subsequent dlsyms in TCling to resolve symbols from the wrong place. We can fix the check and play with the JIT resolution logic by playing with the search order here:

llvm::sys::DynamicLibrary::SearchOrder
= llvm::sys::DynamicLibrary::SO_LoadedFirst;
// Enable JIT symbol resolution from the binary.
llvm::sys::DynamicLibrary::LoadLibraryPermanently(0, 0);

The challenge is to come up with a consistent symbol resolution :)
Just as a comment, two other possibilities would be to:

  • put the llvm / clang which are embedded in ROOT into a namespace, so the symbols do not collide with system LLVM symbols.
  • If the clang changes are upstreamed eventually, this shouldn't be necessary anymore, since cling could use system LLVM / Clang.

@vgvassilev
Copy link
Member

The reason is that we link to libgandiva (we do not dlopen it). ROOT does not link to libCling, but InitInterpreter() is called after the main(), so it will always be after libgandiva was opened.
I agree, that could be avoided if we would dlopen libgandiva, and make sure to do gROOT->GetInterpreter() beforehand, but this would require some changes to our software. And also this problem is not specific to libgandiva only, but it would affect any library that uses LLVM.

Unfortunately, those should be carefully attended because of ROOT. I would feel more comfortable if my stack knows which libraries depend on LLVM to avoid pain debugging ROOT.

Well, the problem is that this is no so easy to control. LLVM can come in from a dependency chain via many libraries like OpenCL / Vulkan / arrow. And I am afraid this will become more complicated in the future. Instead of messing with each of them, I thought it might be better to fix the issue in a single place on the ROOT side.

I am not a huge fan of the RTLD_DEEPBIND approach because it is not portable and it essentially overrides the dynamic linker behavior yet one more time. I am not sure how this would translate to Windows for example.

I agree, me neither. If we can find a better and cleaner way, I am absolutely in favor of that.

IIUC, RTLD_DEEPBIND alters the search order of the dynamic linker and using -fPIC intends to deduplicate. That seems as bad as preloading a set of llvm dependent libraries, IMO. Do we need libgandiva.so to be compiled with -fPIC or just the executables which 'use' ROOT?

All shared libraries must be compiled with -fPIC by definition, so libgandiva is already compiler with -fPIC. The change would only be for exectuables, which usually do not have -fPIC by default. But then actually other libraries have similar requirements, e.g. Qt5 (with -reduce-relocation flag, which is the default) requires all executables to link against Qt to be compiled with -fPIC. But again, if there is a better way, I am also in favor of avoiding -fPIC-

This check is to protect the subsequent dlsyms in TCling to resolve symbols from the wrong place. We can fix the check and play with the JIT resolution logic by playing with the search order here:

llvm::sys::DynamicLibrary::SearchOrder
= llvm::sys::DynamicLibrary::SO_LoadedFirst;
// Enable JIT symbol resolution from the binary.
llvm::sys::DynamicLibrary::LoadLibraryPermanently(0, 0);

The challenge is to come up with a consistent symbol resolution :)
Just as a comment, two other possibilities would be to:

* put the llvm / clang which are embedded in ROOT into a namespace, so the symbols do not collide with system LLVM symbols.

This likely would not fix the global statics.

* If the clang changes are upstreamed eventually, this shouldn't be necessary anymore, since cling could use system LLVM / Clang.

We are in the versioning hell, as the version of the system LLVM might differ from the ones the libgandiva ships with. There we are facing a similar issue -- to make sure the JIT does not pick the wrong symbol.

@davidrohr
Copy link
Author

* put the llvm / clang which are embedded in ROOT into a namespace, so the symbols do not collide with system LLVM symbols.

This likely would not fix the global statics.

I do not see how that would break with global statics. All root-builtin llvm/clang statics would just go to the namespace as well.

* If the clang changes are upstreamed eventually, this shouldn't be necessary anymore, since cling could use system LLVM / Clang.

We are in the versioning hell, as the version of the system LLVM might differ from the ones the libgandiva ships with. There we are facing a similar issue -- to make sure the JIT does not pick the wrong symbol.
Well, if gandiva ships their own llvm, then I would consider it their duty to hide their symbols as well in order to avoid a collision with the system llvm. What I am trying to fix mainly is linking to libraries that use the system/default llvm.

… instead of forcing hidden symbols in other LLVM versions
@davidrohr
Copy link
Author

@vgvassilev : Could you run the CI again? I pushed a new version, that might work at least for ALICE, using the fPIC fix I described above.
PLEASE DO NOT MERGE THIS!
I would just like to see if it passes your CI.

@oshadura
Copy link
Contributor

@phsft-bot build!

@phsft-bot
Copy link
Collaborator

Starting build on ROOT-performance-centos7-multicore/default, ROOT-fedora27/noimt, ROOT-fedora29/python3, ROOT-fedora30/cxx14, ROOT-ubuntu16/nortcxxmod, ROOT-ubuntu18.04-i386/cxx14, mac1014/cxx17, windows10/cxx14
How to customize builds

@phsft-bot
Copy link
Collaborator

Build failed on windows10/cxx14.
See console output.

@davidrohr
Copy link
Author

Do I understand correctly, that only the windows build failed? From the log, I actually do not understand what was the problem.

@Axel-Naumann
Copy link
Member

How's #4689 doing for you, @davidrohr ?

@Axel-Naumann
Copy link
Member

I actually do not understand what was the problem.

It's a bug in our CI infra that forces everyone to also fork roottest...

@davidrohr
Copy link
Author

This patch will not work out as it is for ROOT, since it has too many side effects. Closing.

@davidrohr davidrohr closed this Jan 2, 2020
avtikhon added a commit to tarantool/tarantool that referenced this pull request May 27, 2020
After commit:
03790ac ("cmake: remove dynamic-list linker option")

the issue with test initialy appeared:

 [001] box/push.test.lua
 [001]
 [001] [Instance "box" returns with non-zero exit code: 1]
 [001]
 [001] Last 15 lines of Tarantool Log file [Instance "box"][test/var/001_box/box.log]:
 [001] ==25624==ERROR: AddressSanitizer: odr-violation (0x000001123b60):
 [001]   [1] size=1024 'mp_type_hint' src/lib/msgpuck/hints.c:39:20
 [001]   [2] size=1024 'mp_type_hint' src/lib/msgpuck/hints.c:39:20
 [001] These globals were registered at these points:
 [001]   [1]:
 [001]     #0 0x478b8e in __asan_register_globals (src/tarantool+0x478b8e)
 [001]     #1 0x7ff7a9bc9d0b in asan.module_ctor (function1.so+0x6d0b)
 [001]
 [001]   [2]:
 [001]     #0 0x478b8e in __asan_register_globals (src/tarantool+0x478b8e)
 [001]     #1 0xab990b in asan.module_ctor (src/tarantool+0xab990b)
 [001]
 [001] ==25624==HINT: if you don't care about these errors you may set ASAN_OPTIONS=detect_odr_violation=0
 [001] SUMMARY: AddressSanitizer: odr-violation: global 'mp_type_hint' at src/lib/msgpuck/hints.c:39:20
 [001] ==25624==ABORTING
 [001] [ fail ]

the following issue was created:

  #5001

The fail was described there by Vladislav Shpilevoy:
"""
  I see why ASAN complains about mp_type_hint. This is because the
  symbol is defined both in Tarantool executable, and in the shared
  library function1.so. I think this is fine, and should be ignored.
  But it definitely has nothing to do with the current ticket. The
  problem existed always, but asan noticed it only now somewhy. And
  it is not a problem actually.
"""

He added suggestion to try RTLD_DEEPBIND, but in real it is not
supported on OSX and the same issue with discussion can be found here:

  root-project/root#4668

The initial issue closed and the new one created especialy for the
test. The fix was made in ASAN suppresion list to block the ASAN
check for file:

  src/lib/msgpuck/hints.c

Closes #5023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants