Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuDF: Need to canonicalize dlopen'd library names #12708

Closed
leofang opened this issue Feb 6, 2023 · 13 comments · Fixed by #13210
Closed

cuDF: Need to canonicalize dlopen'd library names #12708

leofang opened this issue Feb 6, 2023 · 13 comments · Fixed by #13210

Comments

@leofang
Copy link
Member

leofang commented Feb 6, 2023

Documenting an offline discussion with @jakirkham in preparation of CUDA 12 bring-up on conda-forge.

Currently, there're some places where libXXXXX.so without any SONAME being dlopen'd, for example,

cf_lib = dlopen("libcufile.so", RTLD_LAZY | RTLD_LOCAL | RTLD_NODELETE);

There will become problematic because the libXXXXX.so symlink is supposed to exist only in the libXXXXX-dev package by the stand practice of Linux distros, not in libXXXXX which only contains libXXXXX.so.{major}. The dlopen'd names need to be canonicalized.

@vyasr
Copy link
Contributor

vyasr commented Feb 8, 2023

What is the right approach to do this when we want to build libcudf artifacts for both CUDA 11 and 12? Will we need to guard the dlopens with preprocessor checks of CUDART_VERSION or something similar? This assumes the APIs haven't changed of course, in cases where the APIs are different we will need to operate conditionally anyway.

@leofang
Copy link
Member Author

leofang commented Feb 8, 2023

The right thing to do is to dlopen libXXXXX.so.major. Keeping libXXXXX.so (no SONAME) as a fallback is probably OK. Going forward this will be compliant with the new conda-forge packaging scheme.

@vyasr
Copy link
Contributor

vyasr commented Feb 8, 2023

Right, but say I am compiling libcudf twice, once for CUDA 11 and once for CUDA 12. Is the expectation that we do something like

#if CUDART_VERSION > 12000
dlopen(libXXX.so.12)
#else
dlopen(libXXX.so.11)
#endif

in the code?

@davidwendt
Copy link
Contributor

This seems to defeat the purposes of dlopen where we want to loosely couple to the shared library at build time.
Perhaps the version should be determined somehow at runtime? There is a good chance I'm misunderstanding this.

@vyasr
Copy link
Contributor

vyasr commented Feb 9, 2023

It's a bit tricky because we want to couple loosely in some ways (maybe it's OK for the library not to exist, or maybe we're OK with using any minor version available) but we still need to couple tightly in other ways (if our main binary is compiled against CUDA 12 then we probably have to use CUDA 12 versions of the dlopened libraries to guarantee ABI compatibility). But I'm not sure if my snippet is the best solution, or if you would still include additional runtime checks for different versions to provide more useful error messages, or something else.

@jakirkham
Copy link
Member

cc @vuule (who may have more thoughts on this)

@vuule
Copy link
Contributor

vuule commented Feb 9, 2023

Pretty much on the same page as Vyas here. If we generate the .so name at compile time based on the runtime version, we can use it in dlopen. With something like https://github.com/rapidsai/kvikio/blob/branch-23.04/cpp/include/kvikio/shim/utils.hpp#L49 we can fall back to the old name.
It would be nice to generate the .so name in a way that automatically supports future CUDA versions. With hardcoded supported versions, we would silently fail the dlopen call if we forget to update the code at next CUDA version release.

@robertmaynard
Copy link
Contributor

robertmaynard commented Feb 16, 2023

The right thing to do is to dlopen libXXXXX.so.major. Keeping libXXXXX.so (no SONAME) as a fallback is probably OK. Going forward this will be compliant with the new conda-forge packaging scheme.

This isn't the approach the CTK takes. The SOVERSION values of CTK libraries is not tightly coupled to the CTK major version. For example libcufle in CTK 12.0 ships libcufile.so.0 since it hasn't reved the ABI compat since release. Same for curand, and cusolver

If we had to encode this we need to parse the versions.json that is packaged with the CTK to determine the mapping from CTK version and SOVERSION.

@jakirkham
Copy link
Member

Suppose we could create a shim library that links against the right libcufile.so.X during build time and then dlopened that shim library at runtime.

This might strike the balance Vyas described above. Namely linking to the version that we built and tested against (so have some confidence has the feature set we need to run) while simultaneously handling the case were libcufile.so.X is not present for whatever reason (not installed by user, missing from CTK that version, not available for user's architecture, etc.).

Should also handle the linking to the right SOVERSION as Rob mentioned (without hopefully needing to bake too much logic about CTK versioning in cuDF).

Thoughts? 🙂

@bdice
Copy link
Contributor

bdice commented Apr 7, 2023

I suppose that a shim library could work. I kind of dislike the idea of having a bunch of shim libraries for all the dlopens in RAPIDS… but it does seem that it would solve the problem. I think I’d rather handle the problem directly with hardcoded versions even if it requires manual updates and fallback logic. Would it be possible to overcome the problem that @vuule mentioned (below) by explicitly testing that the dlopen succeeds? I don’t think we can be guaranteed that new versions would work out of the box, so having manual/hardcoded logic to enable new versions doesn’t sound too awful to me. The net LOC should be lower than the shim approach and it would be more self-contained in libcudf source code.

With hardcoded supported versions, we would silently fail the dlopen call if we forget to update the code at next CUDA version release.

@robertmaynard
Copy link
Contributor

Suppose we could create a shim library that links against the right libcufile.so.X during build time and then dlopened that shim library at runtime.

I think we would always have to dlopen/dlsym so that we don't get a DT_NEEDED entry for libcufile.so.X

@jakirkham jakirkham changed the title Need to canonicalize dlopen'd library names cuDF: Need to canonicalize dlopen'd library names Apr 10, 2023
@wence-
Copy link
Contributor

wence- commented Apr 11, 2023

I don’t think we can be guaranteed that new versions would work out of the box, so having manual/hardcoded logic to enable new versions doesn’t sound too awful to me.

If the libraries are not lying about their ABI compatibility, then hard-coded versions "just work", no?

e.g. in 11.8, libcufft is major versioned as .10, whereas in 12.0 it is versioned as .11. This says to me that these two libraries are not ABI compatible, and so it would be unsafe to dlopen the unversioned .so as a "loosely coupled" thing (even if it existed). So if we have code that dlopens libcufft.so.10 and fails, I don't think it can fall back to trying libcufft.so.11 unless there is further conditional handling of potential ABI differences.

In cudf, the dlopen of libcufile appears to be the only one, which accidentally happens to be safe because they've never revved the ABI version.

@jakirkham
Copy link
Member

Yeah we discussed this today and the consensus was to just hard-code for now. There is only one version to track atm. So that seems like the simplest fix.

We can revisit when there is a SOVERSION bump, which will likely correspond to a new CUDA 12.x that we try to support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
8 participants