Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does not build on CURRENT #1

Closed
arrowd opened this issue Nov 23, 2019 · 47 comments
Closed

Does not build on CURRENT #1

arrowd opened this issue Nov 23, 2019 · 47 comments

Comments

@arrowd
Copy link

arrowd commented Nov 23, 2019

I met some problems when tried building nvshim on FreeBSD CURRENT.

First of all, there is no sys/sysinfo.h file anymore, so I had to remove that include directive from src/libc/sys/sysinfo.c. Second, I changed clang60 to clang80. Now I get

error: multiple symbol versions defined for shim__sys_errlist
error: multiple symbol versions defined for shim__sys_errlist
error: multiple symbol versions defined for shim__sys_errlist
error: multiple symbol versions defined for shim__sys_nerr
error: multiple symbol versions defined for shim__sys_nerr
error: multiple symbol versions defined for shim__sys_nerr
error: multiple symbol versions defined for shim__sys_siglist
error: multiple symbol versions defined for shim_clock_getcpuclockid
error: multiple symbol versions defined for shim_clock_getres
error: multiple symbol versions defined for shim_clock_gettime
error: multiple symbol versions defined for shim_clock_nanosleep

Any idea how to fix this?

@shkhln
Copy link
Owner

shkhln commented Nov 23, 2019

The error is due to a (relatively) recently introduced sanity check, see https://reviews.llvm.org/D45845. At the moment I have no idea how to work around that.

@arrowd
Copy link
Author

arrowd commented Nov 23, 2019

Can you give an overview how things work? Maybe I can come up with something.

@shkhln
Copy link
Owner

shkhln commented Nov 23, 2019

What am I doing with symver directives in the first place? They are used to export shim_sym symbols as sym@GLIBC_ver while specifically avoiding exporting so-called default symbols (sym@@GLIBC_ver). That way FreeBSD rtld will link Linux libraries to this shim and everything else to the normal libc. I wasn't able to get the same result with version scripts.

Why multiple versions? Glibc might have multiple different implementations of something, but we, obviously, do not. So, it's kinda makes sense to export some symbols multiple times.

@shkhln
Copy link
Owner

shkhln commented Nov 23, 2019

multiple versions

For example:

% readelf -s /compat/linux/lib/libc-2.17.so | grep environ | grep GLIBC
   308: 00000000001c8de0     4 OBJECT  WEAK   DEFAULT   35 _environ@@GLIBC_2.0 (2)
  1039: 00000000001c8de0     4 OBJECT  WEAK   DEFAULT   35 environ@@GLIBC_2.0 (2)
  1395: 00000000001c8de0     4 OBJECT  GLOBAL DEFAULT   35 __environ@@GLIBC_2.0 (2)

@arrowd
Copy link
Author

arrowd commented Nov 23, 2019

So, this shim is used to fulfill the dependency on Linux libc? Ot intercepts calls to Linux libc functions and re-route them to FreeBSD ones?

@shkhln
Copy link
Owner

shkhln commented Nov 23, 2019

So, this shim is used to fulfill the dependency on Linux libc?

Yeah, the "nv" part in the name is a bit of misnomer. It's rather a reasonably generic, albeit very incomplete, glibc shim with Nvidia specific run & setup scripts.

Ot intercepts calls to Linux libc functions and re-route them to FreeBSD ones?

There is no interception per se. If you export symbols as described above (that is, no defaults), rtld will happily route everything for you. It's really simple. Almost unbelievably so.

@arrowd
Copy link
Author

arrowd commented Nov 23, 2019

I see, thanks. But is that safe? I bet, many FreeBSD libc functions operate differently that Linux ones. This might cause problems, no?

@shkhln
Copy link
Owner

shkhln commented Nov 23, 2019

Other than the bugs in the implementation itself (there are plenty of those, of course), this should be reasonably safe as long as the loaded Linux libraries:

  1. do not use direct syscalls;
  2. do not pass libc data structures and/or constant values to FreeBSD libraries though their own API, in which case they would skip our conversions.

So, proceed with caution.

@arrowd
Copy link
Author

arrowd commented Nov 24, 2019

If GAS doesn't support multiple .symver directives, how does original Linux libc end up with duplicating symbols?

@shkhln
Copy link
Owner

shkhln commented Nov 24, 2019

Hmm… Glibc seems to use __attribute__(alias(...)). Let's see whether that's suitable for us…

@shkhln
Copy link
Owner

shkhln commented Nov 24, 2019

Ok, try the latest commit.

while specifically avoiding exporting so-called default symbols (sym@@GLIBC_ver)

I rechecked that part, turns out default versions do not matter either way, I just dislike them for some reason I can't remember :/

@arrowd
Copy link
Author

arrowd commented Nov 24, 2019

Yep, it builds! I had to remove all mentions of sys/sysinfo.h, though.

Now, I'm trying to use sglrun to execute CUDA binary. Here's my attempt:

env LD_LIBRARY_PATH=/usr/home/arr/cuda101/var/cuda-repo-10-1-local-10.1.243-418.87.00/usr/local/cuda-10.1/lib64 ./sglrun ~/axpy
ld-elf.so.1: Shared object "ld-linux-x86-64.so.2" not found, required by "libcudart.so.10.1"

Added path to ld-linux-x86-64.so.2:

env LD_LIBRARY_PATH=/usr/home/arr/cuda101/var/cuda-repo-10-1-local-10.1.243-418.87.00/usr/local/cuda-10.1/lib64:/compat/linux/usr/lib64/ ./sglrun ~/axpy
ld-elf.so.1: /compat/linux/usr/lib64//librt.so.1: version FBSD_1.0 required by /usr/local/lib/libruby26.so.26 not found

The error message looks strange. Any idea what does it mean?

@shkhln
Copy link
Owner

shkhln commented Nov 24, 2019

You'll probably want to know that for CUDA there are some blocking issues on the kernel driver side.

The error message looks strange. Any idea what does it mean?

You are setting LD_LIBRARY_PATH a bit too early and it is getting picked up by a FreeBSD executable trying to run the script itself.

@arrowd
Copy link
Author

arrowd commented Nov 26, 2019

You'll probably want to know that for CUDA there are some blocking issues on the kernel driver side.

I didn't even reach that stage yet. Have you checked if things improved so far?

Stupid me. Now the error is

ld-elf.so.1: /usr/home/arr/projects/nvshim/build/lib64/nvshim.so: version GLIBC_PRIVATE required by /compat/linux/usr/lib64//librt.so.1 not found

@shkhln
Copy link
Owner

shkhln commented Nov 26, 2019

Have you checked if things improved so far?

They didn't.

Now the error is

You are not supposed to load /compat/linux/usr/lib64/librt.so.1. Unfortunately, since there is also /usr/lib/librt.so.1 I had to resort to binary patching to avoid conflicts and the corresponding LD_LIBMAP override is differently named. I admit this is a bit confusing.

@arrowd
Copy link
Author

arrowd commented Nov 26, 2019

They didn't.

If I read it right, the problem is that os_lock_user_pages Linux syscall is not implemented in Linuxulator? Maybe a separate PR should be opened to track this?
As for bug 224358, what should be done to close it?

You are not supposed to load /compat/linux/usr/lib64/librt.so.1. Unfortunately, since there is also /usr/lib/librt.so.1 I had to resort to binary patching to avoid conflicts and the corresponding LD_LIBMAP override is differently named.

Ouch. This is too much hacks, IMO. I think, I'll try something else for my problem.

@shkhln
Copy link
Owner

shkhln commented Nov 26, 2019

If I read it right, the problem is that os_lock_user_pages Linux syscall is not implemented in Linuxulator?

Eh, it's a function in nvidia.ko kernel module, see nvidia/nvidia_os.c.

As for bug 224358, what should be done to close it?

Not my bug, so no opinion.

Ouch. This is too much hacks, IMO.

Is it? In any case, the kernel part is both more important and more difficult here. It makes sense to concentrate on that.

@arrowd
Copy link
Author

arrowd commented Nov 26, 2019

I see, thanks for clearing this up. Let's close this issue, as nvshim now builds on CURRENT.

@arrowd arrowd closed this as completed Nov 26, 2019
@shkhln
Copy link
Owner

shkhln commented Feb 2, 2020

You are not supposed to load /compat/linux/usr/lib64/librt.so.1. Unfortunately, since there is also /usr/lib/librt.so.1 I had to resort to binary patching to avoid conflicts and the corresponding LD_LIBMAP override is differently named.

Ouch.

FYI, I committed a workaround for this particular annoyance in c1633b6.

@shkhln
Copy link
Owner

shkhln commented Mar 18, 2020

@arrowd You might be interested in shkhln/revird-aidivn@077197f. Note that this is meant to be used in combination with either a dummy nvidia-uvm kernel module or an equivalent LD_PRELOAD trick. Seems to pass a simple sanity check so far:

% env LD_PRELOAD=$PWD/dummy-uvm.so ./matrixMul
[Matrix Multiply Using CUDA] - Starting...
GPU Device 0: "GeForce GTX 1660" with compute capability 7.5

MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel...
done
Performance= 542.34 GFlop/s, Time= 0.242 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

@arrowd
Copy link
Author

arrowd commented Mar 18, 2020

This looks promising! Unfortunately, I don't have time to look at it right now.

Do you plan to upstream this patch into FreeBSD ports tree?

@shkhln
Copy link
Owner

shkhln commented Mar 18, 2020

Do you plan to upstream this patch into FreeBSD ports tree?

I don't have the patience. Plus it's a relatively quick and dirty patch job, so it's not necessarily appropriate for submission as is.

@shkhln
Copy link
Owner

shkhln commented Jun 12, 2020

(lldb) bt
* thread #1, name = 'linux_oceanFFT', stop reason = signal SIGBUS
  * frame #0: 0x00000008007c2a18 libcxxrt.so.1`vtable for __cxxabiv1::__si_class_type_info + 16
    frame #1: 0x0000000809cc5026 libstdc++.so.6`__dynamic_cast + 102
    frame #2: 0x0000000809d434c0 libstdc++.so.6`bool std::has_facet<std::ctype<char> >(std::locale const&) + 64
    frame #3: 0x0000000809d36ba4 libstdc++.so.6`std::basic_ios<char, std::char_traits<char> >::_M_cache_locale(std::locale const&) + 20
    frame #4: 0x0000000809d37020 libstdc++.so.6`std::basic_ios<char, std::char_traits<char> >::init(std::basic_streambuf<char, std::char_traits<char> >*) + 32
    frame #5: 0x0000000809cd8ab3 libstdc++.so.6`std::ios_base::Init::Init() + 595
    frame #6: 0x000000080083dba4 libcufft.so.8.0`___lldb_unnamed_symbol390$$libcufft.so.8.0 + 36
    frame #7: 0x0000000800a2b4e6 libcufft.so.8.0`___lldb_unnamed_symbol11991$$libcufft.so.8.0 + 70
    frame #8: 0x0000000800825ae3 libcufft.so.8.0
    frame #9: 0x00000008006a734c ld-elf.so.1
    frame #10: 0x00000008006a61d2 ld-elf.so.1

Что-то здесь не так.

@arrowd
Copy link
Author

arrowd commented Jun 12, 2020

Does libcxxrt.so.1 come from base? It might be that libstdc++ should use libsupc++ or something like that.

@shkhln
Copy link
Owner

shkhln commented Jun 12, 2020

Yes, native libGLU.so.1 brings libcxxrt.so.1, which conflicts with libcufft.so.8.0. The program (oceanFFT from the CUDA demo suite) works with Linux libGLU.so.1, but that is not what we are interested in :)

@arrowd
Copy link
Author

arrowd commented Jun 12, 2020

Compiling libGLU.so.1 with USE_GCC=yes should fix this problem, but this isn't really a solution, but a workaround. No idea how to handle this properly.

@shkhln
Copy link
Owner

shkhln commented Jun 12, 2020

As far as I understand, native FreeBSD gcc- and clang-compiled c++ libraries are pretty safe to mix. I don't see why that should be different for a Linux c++ library in principle, considering it's the same ABI.

Compiling libGLU.so.1 with USE_GCC=yes should fix this problem, but this isn't really a solution, but a workaround.

That works.

@arrowd
Copy link
Author

arrowd commented Jun 12, 2020

As far as I understand, native FreeBSD gcc- and clang-compiled c++ libraries are pretty safe to mix.

From my experience, it was never been like this.

That works.

... and here's another proof of that.

@shkhln
Copy link
Owner

shkhln commented Jun 12, 2020

The funny thing is that this working libGLU.so.1 is compiled against libc++.so.1/libcxxrt.so.1 as well. I'll try to test this more thoroughly.

@shkhln
Copy link
Owner

shkhln commented Jun 12, 2020

working libGLU.so.1 is compiled against libc++.so.1/libcxxrt.so.1

Ok, I didn't pay attention and passed a wrong path to ldd. It's libstdc++.so.6, as it should be.

@shkhln
Copy link
Owner

shkhln commented Jun 12, 2020

Aha, the most relevant comment here: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=221288#c4. Looks like in practice ports are using this solution instead: https://markmail.org/message/dwgafctuoywpuhhr. Not applicable to our case, unfortunately.

@arrowd
Copy link
Author

arrowd commented Jun 13, 2020

Just to clear things up - libstdc++ is pulled in by Linux CUDA library and everything else is FreeBSD native code, right?

@shkhln
Copy link
Owner

shkhln commented Jun 13, 2020

Specifically, libcufft.so.8.0 is a Linux binary as well as oceanFFT executable itself (I occasionally run Linux programs with the shim, it's convenient for testing). Everything else, including libstdc++, is native code.

@arrowd
Copy link
Author

arrowd commented Jun 13, 2020

If the executable itself is Linux, where do FreeBSD come from? It should use everything from /compat/linux.

@shkhln
Copy link
Owner

shkhln commented Jun 13, 2020

Something like sglrun /libexec/ld-elf.so.1 /compat/linux/bin/glxgears. Changing the interpreter path (and "GNU" strings placed right after it) with a hex editor also works.

@arrowd
Copy link
Author

arrowd commented Jun 13, 2020

Then how about making sglrun map native libcxxrt to Linux libsupc++?

@shkhln
Copy link
Owner

shkhln commented Jun 13, 2020

Честно говоря, я даже не понял что здесь написано. Просто замапить libcxxrt на libsupc++ не получится: libstdc++.so требует "version CXXRT_1.0", если избавиться от libstdc++.so вылезают ошибки вроде Undefined symbol "_ZNSt3__15ctypeIcE2idE" и т. д. Линуксовая версия библиотеки здесь тоже ровным счетом ничего не улучшит. Пересборка libstdc++ без libsupc++ выглядит наиболее адекватной идеей пока что.

@shkhln
Copy link
Owner

shkhln commented Jun 13, 2020

Хотя… Можно замапить libcxxrt на наш набор костылей, написать в экспорты CXXRT_1.0 {} и слинковать это все с libstdc++. После прохождения проверки на наличие экпортирумой версии rtld уже совершенно все равно в какой библиотеке их искать. Для librt.so.1 примерно это и было сделано.

@arrowd
Copy link
Author

arrowd commented Jun 13, 2020

Я понял так, что проблема здесь в том, что libcufft.so.8.0 была собрана для libstdc++, которая в свою очередь юзала libsupc++. На фряхе у нас libcxxrt вместо libsupc++, но она бинарно не совместима.

На libcufft.so.8.0 мы влиять никак не можем, поэтому надо либо все переводить с libc++ на libstdc++ (читай, собирать все с USE_GCC=yes), либо каким-то образом подсунуть libsupc++ вместо libcxxrt.

@shkhln
Copy link
Owner

shkhln commented Jun 14, 2020

Получается как-то так: 09fa162.

@shkhln
Copy link
Owner

shkhln commented Jun 23, 2020

При более внимательном рассмотрении на CXXRT_1.0 из libcxxrt завязана целая куча библиотек, т. е. это активно используемый код, который нельзя просто так заменить заглушкой. Можно наверно и это подкостылить, но как-то лень… (Я понятия не имею будет ли альтернативный хак в виде сборки libstdc++, завязанного на libcxxrt проще. Вполне возможно что будет.)

В любом случае, с драйвером и библиотеками здесь все примерно ясно. Стоит ли делать с этим что-то дальше? Я как-то не вижу никакого интереса к CUDA со стороны FreeBSD, если честно.

@arrowd
Copy link
Author

arrowd commented Jun 23, 2020

Я как-то не вижу никакого интереса к CUDA со стороны FreeBSD, если честно.

Ну тут нечего смотреть. Будет куда - будет интерес, порты появятся, юзающие ее.

Но затыкать плюсовый рантайм - плохая идея, конечно.

Эта libcufft.so.8.0 - часть дистрибутива куды, или какая-то 3rd-party либа, юзающая куду?

@shkhln
Copy link
Owner

shkhln commented Jun 23, 2020

Ммм… Я не имею в виду пользовательский интерес. Есть какие-то шансы уговорить кого-нибудь привести в нормальный вид патч для драйвера? Это прямо совсем не мой навык. (Тут еще немного замешана лицензия, которая одновременно разрешает и запрещает нам копирование кода из Линуксового драйвера. Это тоже по-своему интересно, хотя большой опасности здесь нет.)

Эта libcufft.so.8.0 - часть дистрибутива куды, или какая-то 3rd-party либа, юзающая куду?

Часть дистрибутива Куды, конечно. Идущая с драйвером libcuda.so это (предположительно) обычная сишная библиотка, а вот всякие надстройки над ней уже как бы нет.

@arrowd
Copy link
Author

arrowd commented Jun 23, 2020

А что за патч? Об этом речь? https://reviews.freebsd.org/D22521

Часть дистрибутива Куды, конечно. Идущая с драйвером libcuda.so это (предположительно) обычная сишная библиотка, а вот всякие надстройки над ней уже как бы нет.

Остается запускать кудышные приложения целиком в линуксе, либо собирать с USE_GCC=yes.

@shkhln
Copy link
Owner

shkhln commented Jun 23, 2020

Нет, вот этот патч: https://github.com/shkhln/revird-aidivn/compare/master...afdiuxc.patch. Он здесь уже упоминался.

Остается запускать кудышные приложения целиком в линуксе, либо собирать с USE_GCC=yes.

Я думаю кто-нибудь соберет «правильный» libstdc++, если оно реально понадобится. Можно на это пока не обращать внимания.

@arrowd
Copy link
Author

arrowd commented Jun 23, 2020

Нет, вот этот https://github.com/shkhln/revird-aidivn/compare/master...afdiuxc.patch. Он здесь уже упоминался.

О, я его профукал. Ну, за лицензию можно не особо беспокоиться, я думаю, т.к. это патчи. Что тут за проблема может быть?

Сложнее будет запинать danfe, но это я могу взять на себя.

Я думаю кто-нибудь соберет «правильный» libstdc++, если оно реально понадобится. Можно на это пока не обращать внимания.

Я в этом не уверен. Это libc++ умеет юзать разные рантаймы, а за libstdc++ я не в курсе.

@shkhln
Copy link
Owner

shkhln commented Jun 23, 2020

Ну, за лицензию можно не особо беспокоиться, я думаю, т.к. это патчи. Что тут за проблема может быть?

Местами у Нвидии в файлы натыкан заголовок, запрещающий любое использование кода. В корне дистрибутива драйвера лежит более-менее нормальная лицензия. Насколько я понимаю, они друг друга не отменяют. Очень теоретическая проблема по большей части.

Сложнее будет запинать danfe, но это я могу взять на себя.

Да уж.

Я в этом не уверен. Это libc++ умеет юзать разные рантаймы, а за libstdc++ я не в курсе.

Я не с потолка взял эту идею, а из комментариев типа вот этого: http://lists.llvm.org/pipermail/cfe-dev/2016-August/050278.html. Цитата: «We solved this in FreeBSD by linking both libstdc++ and libc++ against libcxxrt.» Не знаю куда это решение делось.

Здесь также немного про это есть: https://wiki.freebsd.org/NewC++Stack.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants