Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rust backend #105

Merged
merged 65 commits into from May 23, 2023
Merged

Rust backend #105

merged 65 commits into from May 23, 2023

Conversation

aseyboldt
Copy link
Contributor

@aseyboldt aseyboldt commented Apr 14, 2023

This is based on #88 and @WardBrian's branch.

replaces #102

It is still not decided if this should live in nutpie or in bridgestan, either would be fine for me.

I made a couple of changes compared to Brians branch:

The shared library is stored as a Borrow<StanLibrary>, so that client code can decide what kind of reference should be used to store the Library (ie Rc, Arc, &lib or direct ownership). That avoids a whole range of headache when using this in multithreaded code.
open_library checks that the version of the library has the same minor and major version as the rust library.
Model is now Send and Sync
STAN_THREADS=true is required
Some more functions are wrapped
If there is non-utf8 in an error message, it will now still return the message, but replace the invalid characters by placeholders.

I'm not 100% sure about the handling of string/bytes data in the interface. It might be easier to return &CStr in a couple of places, and leave worrying about the encoding to users.

TODO

  • Remove the nuptie code. This doesn't need to live here and can be moved to nutpie itself.
  • Add tests for the example models and github action entry that runs them
  • Add some more docstrings, and a usage example for the main module
  • Possibly remove parameter name parsing?
  • Add the missing functions from bridgestan.h (The hessian related functions)
  • Add a rust page under docs/languages (can hold off until we have a docs.rs page to link to)
  • Add a new heading in CONTRIBUTING.md describing the Rust build/dependencies.
  • Add github action to publish on release (and add API token for crates.io)

@aseyboldt
Copy link
Contributor Author

aseyboldt commented Apr 14, 2023

I guess the main question now is if this should end up here.
If so, I think the next steps from my end would be:

  • Remove the nuptie code? This will be in nutpie anyway, and I don't think it belongs in the bindings.
  • Add tests and run them in the github action
  • Add a usage example in the docstrings of the main module (should then show up in the online docs)
  • Publish it on crates.io (Or one of the owners of the repo should probably do that)

And if you'd rather not have it in the bridgrestan repo I can do those on the nutpie repo.

@WardBrian
Copy link
Collaborator

I don't speak for @roualdes and @bob-carpenter, but personally I would be happy to have this merged here.

My thoughts on next steps, mirroring yours

  • Doesn't the impl CpuLogpFunc need to live either here or in nuts-rs due to the orphan rule (e.g., not in nutpie?). I'd prefer it not live here, just checking to make sure it's reasonable.
  • Not all the functions are currently exposed (e.g., log_density_hessian)
  • We'd have to decide if we want to keep the additional helper features like the params function here. Currently we expose a pretty unified set of functionality across languages.
  • We'd need to figure out the best thing to do for our docs re: the Rust docs. If we can get cargo doc to output markdown we could do something like what we do with Julia, otherwise I suppose a link to docs.rs would be fine.
  • It would be really nice if releasing to crates.io could be done in our unified release workflow (I assume this is possible)

A more general question @aseyboldt - we currently have unified version numbers across all our APIs. This is nice for several reasons, but it is also one of the main burdens adding a new interface creates - a breaking change in the Rust interface would mean a major version bump for all the interfaces. Could you see a reason we'd need to do such a thing (for any reason other than us changing our C-API?)

@aseyboldt
Copy link
Contributor Author

Doesn't the impl CpuLogpFunc need to live either here or in nuts-rs due to the orphan rule (e.g., not in nutpie?). I'd prefer it not live here, just checking to make sure it's reasonable.

The orphan rule really shouldn't be an issue whichever way we do this, nutpie could just have a

struct StanModel(bridgestan::Model);

and implement CpuLogpFunc for that. It adds a tiny bit of boilerplate code, but not much and the compile will be able to optimize it way. I think I'll need that anyway, so that I can store the generated quantities.

Not all the functions are currently exposed (e.g., log_density_hessian)

Yes, I'll add a todo list in the description, and add that.

We'd have to decide if we want to keep the additional helper features like the params function here. Currently we expose a pretty unified set of functionality across languages.

I guess pretty much anything that uses bridgestan for something serious probably will need to parse the variable names, so I though it would make sense to just put it in here, but I'm also fine with moving it to make things more consistent. If consistency is the goal, maybe adding something like it to the other interfaces might be nice as well though. :-)

We'd need to figure out the best thing to do for our docs re: the Rust docs. If we can get cargo doc to output markdown we could do something like what we do with Julia, otherwise I suppose a link to docs.rs would be fine.

If it is published on crates.io, it will automatically get a page on docs.rs, so just linking to that would probably be the easiest way?

It would be really nice if releasing to crates.io could be done in our unified release workflow (I assume this is possible)

I've never done this, but should be pretty straight forward. I think a repo owner has to create a crates.io account and generate an API token. A github action can then use that token to pretty much just execute cargo publish.

A more general question @aseyboldt - we currently have unified version numbers across all our APIs. This is nice for several reasons, but it is also one of the main burdens adding a new interface creates - a breaking change in the Rust interface would mean a major version bump for all the interfaces. Could you see a reason we'd need to do such a thing (for any reason other than us changing our C-API?)

Nothing immediately comes to mind, and I have a running version with nutpie locally now, so I think there's at least not anything broken or missing that would make writing a sampler impossible. But of course I could have missed something. I can see how a unified version approach gets more difficult the more language bindings you have...

@WardBrian
Copy link
Collaborator

That all sounds good to me.

If consistency is the goal, maybe adding something like it to the other interfaces might be nice as well though

I agree it would be nice to think of something like this for the other interfaces, but for the initial inclusion of a Rust interface I think it should be omitted. We can then do follow-on work on all the languages at once.
In particular, this kind of generic parameter representation will need to get more complicated to be able to support non-rectangular types in Stan (e.g. tuples, ragged arrays, etc), since a Vec<usize> is no longer appropriate for the shape. I've been dealing with this a fair amount in the Stan C++ codebase as part of the effort to support tuples, and it's especially painful if you have existing code which implicitly assumes rectangularity.

@WardBrian WardBrian mentioned this pull request Apr 18, 2023
@aseyboldt
Copy link
Contributor Author

@WardBrian I pushed a change to the github workflow to also run the rust test, could you approve that workflow run?

I also made a couple other smaller changes to the rust interface:

  • model.info() returns a cstr instead of a str, which avoids possible utf8 errors, and is I think more honest about what could be in there.
  • Functions for the parameter names now return a str and panic if it is not valid utf8. Since stan specifies that identifiers must be ascii (here) I think that is safe to assume.
  • Some renaming in the error type

@roualdes
Copy link
Owner

@aseyboldt, I went to approve the workflow but it looks like there's an error on line 225 of the file .github/workflows/main.yaml. Maybe the word runs was used instead of run?

@aseyboldt
Copy link
Contributor Author

Yeah, I'm running it locally first, then try to push again. Sorry.

@roualdes
Copy link
Owner

No worries at all.

@aseyboldt
Copy link
Contributor Author

I don't think I've quite gotten it to work yet, but at least the clang deps shouldn't be missing anymore.
I'll have another look tomorrow.

@aseyboldt
Copy link
Contributor Author

I'm running into a strange timeout issue in the tests on windows:
https://github.com/roualdes/bridgestan/actions/runs/4756662815/jobs/8452607650?pr=105#step:13:173

I'm not really sure what might be causing this. Is that anything that has come up in some way before?
I guess one idea is that it might be related to loading libraries multiple times, was that tested in other backends?
Or are there any other mutexes/locks or so in the stan code that I missed?

@WardBrian
Copy link
Collaborator

I guess one idea is that it might be related to loading libraries multiple times, was that tested in other backends?

Most of the other backend's test load the same library more than once

Windows is our finnickiest CI by far. Despite the fact that I can run them all locally on Windows, I still couldn't get the github actions environment right to run the Python tests to completion. The default runners they provide for Windows have a lot going on (I think I counted like 3 different installations of gcc before R was even installed).

Might be worth trying the mingw Rust toolchain? Our experience with other things has been such that the libraries load fine even with a mix of compilers, but maybe libloading is more sensitive?

@aseyboldt
Copy link
Contributor Author

The tests run in parallel by default, so parallel loading could also be the issue here (there even is a note about that in the libloading docs, so maybe we need to add a mutex for that). I change the tests to run serially, but added an explicit test for loading the lib in parallel, hopefully that helps to narrow it down. And that test is probably a good idea anyway.

Might be worth trying the mingw Rust toolchain? Our experience with other things has been such that the libraries load fine even with a mix of compilers, but maybe libloading is more sensitive?

I'll try that if this doesn't work. It would be good though if this worked on the default toolchain though, and from what I found online mixing mingw and msvc should actually work for pure C interfaces...

@aseyboldt
Copy link
Contributor Author

I guess I made some progress. If I run the original tests serially, everything passes.
So I made a separate test that loads libraries in parallel, and that test on it's own times out (I guess deadlocks).

https://github.com/roualdes/bridgestan/actions/runs/4759489941/jobs/8458862422?pr=105#step:13:154
The code for that test is here:
d10ffa2#diff-4a581ae8c770c9276b64991f1a1235d3fe606742217e6f9d25a0dcfcbd06ac0cR62

I doubt that this is an issue in the rust interface, so I guess we could work around it by simply running the tests in serial for now, but it would be good anyway to find the problem.

More a blind guess than anything else, but I could imaging that the issue is related to the global variable for a thread pool in each library?
https://github.com/stan-dev/stan/blob/5e02a35df1d25c66a26956f0b2f1c6441f3c0afb/src/test/unit/services/sample/hmc_nuts_dense_e_adapt_parallel_match_test.cpp#L11

I'll run it again with TBB_INTERFACE_NEW=true, so that it uses the more recent TBB interface...

@WardBrian
Copy link
Collaborator

TBB_INTERFACE_NEW just changes the math library to be compatible with newer TBBs, it doesn’t actually use them (you need to supply it yourself).

I’d be curious if the issue could be recreated in e.g. Julia on windows

@aseyboldt
Copy link
Contributor Author

I experimented a bit more, and I think I have a pretty good idea of what's happening, but really know clue as to why...

If we load and close the shared libraries in different threads, loading will deadlock.

Let's say we have 3 shared libraries (A, B, and C), and three corresponding threads. If we then load A, then B, close A and then load C (nicely one after the other, the threads wait for a signal before they do anything), the last load will get stuck.

This does not happen if A, B and C are being loaded in the same thread, and it will also not happen if we do not unload A.
It also only happens on windows.

We could work around this by simply never closing the libraries (as the julia and I think python backends do).

I can only speculate about the cause. Constructors and destructors of global variables should run when we open or close the libraries, so I guess those could be at fault. Or thread local storage somehow comes in the way?

Code for that failing example is here:
https://github.com/aseyboldt/bridgestan/blob/main/rust/tests/create_models.rs#L375

I'd be really surprised if this didn't happen with other backends, if they close the libraries, but I haven't tried.

@WardBrian
Copy link
Collaborator

I can only speculate about the cause. Constructors and destructors of global variables should run when we open or close the libraries, so I guess those could be at fault. Or thread local storage somehow comes in the way?

This could just be a property of the Windows LoadLibrary and FreeLibrary functions, especially since it doesn't seem to happen on MacOS or Linux. There are some really suspect warnings in the WinAPI docs:

The FreeLibraryAndExitThread function allows threads that are executing within a DLL to safely free the DLL in which they are executing and terminate themselves. If they were to call FreeLibrary and ExitThread separately, a race condition would exist. The library could be unloaded before ExitThread is called.

Though, if this was a well-known issue, I suspect libloading would have indiciated it by not marking Library as Send. We could figure out if this is "just Windows" or something specific to bridgestan by trying the same thing with a basic hello-world DLL.

@aseyboldt
Copy link
Contributor Author

There is also quite a bit of talk about deadlocks here: https://learn.microsoft.com/en-us/windows/win32/dlls/dynamic-link-library-best-practices
Mostly about DLLMain, but I think the same things would apply to constructors for global variables (those are executed before dllmain).
The reproducer doesn't need Send or Sync for Library, each library only interacts with its own thread (on the rust side).
Trying it with a different DLL is a good idea though...

I still have a hunch that it might be related to something like this:
https://github.com/stan-dev/stan/blob/dbc40e016c092209bd2262bfa02626ad4930fda6/src/test/unit/services/sample/hmc_nuts_diag_e_adapt_parallel_match_test.cpp#L11
Creating a thread pool while initializing a global variable sounds to me like it might be asking for trouble. And since the tbb library is shared between the different stan libraries that might explain why they don't work independently in this context?

@WardBrian
Copy link
Collaborator

It appears the only place that specific init function is ever called in the Stan codebase is in test code which we would not be including.

The autodiff stack does use a global (or thread local, if STAN_THREADS is true) singleton.

I was able to recreate the deadlock locally on windows. I can try some basic DLLs and play around with things. Based on the existing print statements, the deadlock occurs somewhere I wouldn’t really have expected, which is interesting

@WardBrian
Copy link
Collaborator

I managed to attach a debugger to the hung test. Didn't get too far, but did get

* thread #6, stop reason = Exception 0xc0000005 encountered at address 0x70c16d10: User-mode data execution prevention (DEP) violation at location 0x70c16d10
  * frame #0: 0x0000000070c16d10
    frame #1: 0x0000000064944759 libwinpthread-1.dll`__pth_gpointer_locked + 249
    frame #2: 0x0000000064944965 libwinpthread-1.dll`__pth_gpointer_locked + 773
    frame #3: 0x00007ffcf38cba7a ntdll.dll`RtlGetCurrentDirectory_U + 1178
    frame #4: 0x00007ffcf387868f ntdll.dll`RtlActivateActivationContextUnsafeFast + 303
    frame #5: 0x00007ffcf3879319 ntdll.dll`LdrShutdownThread + 1465
    frame #6: 0x00007ffcf3878eb9 ntdll.dll`LdrShutdownThread + 345
    frame #7: 0x00007ffcf38aaa6e ntdll.dll`RtlExitUserThread + 62
    frame #8: 0x00007ffcf1ae26a6 kernel32.dll`BaseThreadInitThunk + 38
    frame #9: 0x00007ffcf38aa9f8 ntdll.dll`RtlUserThreadStart + 40

I can run some more precise experiments if you have suggestions. I'm once again wondering if there is some bad interaction between the mingw compiler (where libwinpthread comes from) and rust's MSVC abi

@WardBrian
Copy link
Collaborator

Okay, tried on stable-x86_64-pc-windows-gnu toolchain and the same freeze (and backtrace)

@aseyboldt
Copy link
Contributor Author

Hm, that doesn't really tell me much unfortunately. Do you have any idea why it would report an exception when we observe a deadlock? Why wouldn't that error get forwarded somewhere?

Debugging wise this is quickly moving outside my comfort zone, not sure where it makes sense to look. But some random ideas:

  • We could try to get more debugging output. I think rust-lldb should be available if rust is installed, so we could build the stan library and the rust code with debugging symbols, and try to get some info about what's at the address that causes the DEP violation.
  • Investigate if this is related to rust in any way after all. libloading is really just a thin wrapper around the windows api, so I don't think it is a likely culprit, but it did change its underlying windows api wrapper just a week or two ago, so we could try an older version?
  • If it is related to how rust manages threads, then it would be possible that a different language (julia or python) wouldn't show this? We'd have to unload the lib there is well I guess.
  • Try it with different DLLs as you suggested earlier. And then maybe link to the same libs that stan is linking to, so that we can see if it is related to libs that are loaded for the stan libs?
  • Try to find something in stan itself that might cause an issue. I totally missed that the global tbb vars are only in the test code, but are there other places where tbb or openmp or so is used? Anything that creates threads or thread local storage might be interesting I guess? If TBB isn't directly used, can we build a stan library without linking to it? Or link to a more recent version?
  • Try other compilers for the stan lib. Does this also happen if we compile with clang or msvc?

I can set up a VM next week and try some of that. If you have other, preferably better ideas, please share :-)

For the time being, would you be fine with just not unloading the libs on windows, so that this doesn't block the PR?

@WardBrian
Copy link
Collaborator

I think the reason we see a deadlock is that another thread is in a spin lock waiting on thread 6, so when that thread has a failure it just hangs.

I tried really basic DLLs and wasn’t able to recreate the behavior. I didn’t do anything like link in TBB, which could easily be the culprit here if something is hitting a race condition with a library being loaded and unloaded in the Windows API.

One small observation: changing the list so that the libraries which are loaded/unloaded are in fact the same library makes the issue go away. Not sure what that means, yet

@aseyboldt
Copy link
Contributor Author

aseyboldt commented Apr 23, 2023

I think the reason we see a deadlock is that another thread is in a spin lock waiting on thread 6, so when that thread has a failure it just hangs.

good point... But which one is the waiting thread, and which one fails?

One small observation: changing the list so that the libraries which are loaded/unloaded are in fact the same library makes the issue go away. Not sure what that means, yet

I tried that as well. I think the reason is that in that case the library is simply never actually unloaded. If you load an already loaded library, windows just increments a refcount, and only unloads it if the refcount goes to zero.

@WardBrian
Copy link
Collaborator

Tried an older libloading, no difference. I believe it's thread 1 that is parked:

Details
* thread #1
  * frame #0: 0x00007ffcf38f2844 ntdll.dll`NtWaitForAlertByThreadId + 20
    frame #1: 0x00007ffcf38839c5 ntdll.dll`TpWorkOnBehalfClearTicket + 149
    frame #2: 0x00007ffcf389f9bc ntdll.dll`RtlDllShutdownInProgress + 652
    frame #3: 0x00007ffcf389ba83 ntdll.dll`RtlWaitOnAddress + 19
    frame #4: 0x00007ffcf10b62e3 KernelBase.dll`WaitOnAddress + 51
    frame #5: 0x00000000004a4725 create_models-b0050a5c3b0ab44f.exe`std::thread::park_timeout [inlined] std::sys::windows::thread_parking::Parker::park_timeout at thread_parking.rs:149:13
    frame #6: 0x00000000004a46bf create_models-b0050a5c3b0ab44f.exe`std::thread::park_timeout at mod.rs:1076:9
    frame #7: 0x00000000004347d5 create_models-b0050a5c3b0ab44f.exe`std::sync::mpmc::context::Context::with::{{closure}} [inlined] std::sync::mpmc::context::Context::wait_until at context.rs:130:21
    frame #8: 0x00000000004347c0 create_models-b0050a5c3b0ab44f.exe`std::sync::mpmc::context::Context::with::{{closure}} [inlined] std::sync::mpmc::list::Channel<T>::recv::{{closure}} at list.rs:444:27
    frame #9: 0x00000000004346f8 create_models-b0050a5c3b0ab44f.exe`std::sync::mpmc::context::Context::with::{{closure}} at context.rs:50:13
    frame #10: 0x0000000000432da3 create_models-b0050a5c3b0ab44f.exe`std::sync::mpmc::list::Channel<T>::recv [inlined] std::sync::mpmc::context::Context::with::{{closure}} at context.rs:58:31
    frame #11: 0x0000000000432d67 create_models-b0050a5c3b0ab44f.exe`std::sync::mpmc::list::Channel<T>::recv [inlined] std::thread::local::LocalKey<T>::try_with at local.rs:446:16
    frame #12: 0x0000000000432d58 create_models-b0050a5c3b0ab44f.exe`std::sync::mpmc::list::Channel<T>::recv [inlined] std::sync::mpmc::context::Context::with at context.rs:53:9
    frame #13: 0x0000000000432d33 create_models-b0050a5c3b0ab44f.exe`std::sync::mpmc::list::Channel<T>::recv at list.rs:434:13
    frame #14: 0x000000000044abb3 create_models-b0050a5c3b0ab44f.exe`test::console::run_tests_console [inlined] std::sync::mpmc::Receiver<T>::recv_deadline at mod.rs:340:43
    frame #15: 0x000000000044ab9e create_models-b0050a5c3b0ab44f.exe`test::console::run_tests_console [inlined] std::sync::mpmc::Receiver<T>::recv_timeout at mod.rs:323:31
    frame #16: 0x000000000044ab3d create_models-b0050a5c3b0ab44f.exe`test::console::run_tests_console [inlined] std::sync::mpsc::Receiver<T>::recv_timeout at mod.rs:920:9
    frame #17: 0x000000000044ab3d create_models-b0050a5c3b0ab44f.exe`test::console::run_tests_console [inlined] test::run_tests at lib.rs:417:27
    frame #18: 0x0000000000449a9a create_models-b0050a5c3b0ab44f.exe`test::console::run_tests_console at console.rs:298:5
    frame #19: 0x00000000004634b7 create_models-b0050a5c3b0ab44f.exe`test::test_main at lib.rs:140:15
    frame #20: 0x000000000046449f create_models-b0050a5c3b0ab44f.exe`test::test_main_static at lib.rs:159:5
    frame #21: 0x000000000041d1a5 create_models-b0050a5c3b0ab44f.exe`create_models::main at create_models.rs:1:1
    frame #22: 0x0000000000401c5b create_models-b0050a5c3b0ab44f.exe`core::ops::function::FnOnce::call_once((null)=0x000000000041d190, (null)=()) at function.rs:250:5
    frame #23: 0x00000000004196ae create_models-b0050a5c3b0ab44f.exe`std::sys_common::backtrace::__rust_begin_short_backtrace(f=0x000000000041d190) at backtrace.rs:134:18
    frame #24: 0x0000000000405351 create_models-b0050a5c3b0ab44f.exe`std::rt::lang_start::{{closure}} at rt.rs:166:18
    frame #25: 0x00000000004a3315 create_models-b0050a5c3b0ab44f.exe`std::rt::lang_start_internal [inlined] core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once at function.rs:287:13
    frame #26: 0x00000000004a330e create_models-b0050a5c3b0ab44f.exe`std::rt::lang_start_internal [inlined] std::panicking::try::do_call at panicking.rs:487:40
    frame #27: 0x00000000004a330e create_models-b0050a5c3b0ab44f.exe`std::rt::lang_start_internal [inlined] std::panicking::try at panicking.rs:451:19
    frame #28: 0x00000000004a330e create_models-b0050a5c3b0ab44f.exe`std::rt::lang_start_internal [inlined] std::panic::catch_unwind at panic.rs:140:14
    frame #29: 0x00000000004a330e create_models-b0050a5c3b0ab44f.exe`std::rt::lang_start_internal [inlined] std::rt::lang_start_internal::{{closure}} at rt.rs:148:48
    frame #30: 0x00000000004a330e create_models-b0050a5c3b0ab44f.exe`std::rt::lang_start_internal [inlined] std::panicking::try::do_call at panicking.rs:487:40
    frame #31: 0x00000000004a330e create_models-b0050a5c3b0ab44f.exe`std::rt::lang_start_internal [inlined] std::panicking::try at panicking.rs:451:19
    frame #32: 0x00000000004a330e create_models-b0050a5c3b0ab44f.exe`std::rt::lang_start_internal [inlined] std::panic::catch_unwind at panic.rs:140:14
    frame #33: 0x00000000004a330e create_models-b0050a5c3b0ab44f.exe`std::rt::lang_start_internal at rt.rs:148:20
    frame #34: 0x000000000040532a create_models-b0050a5c3b0ab44f.exe`std::rt::lang_start(main=0x000000000041d190, argc=1, argv=0x0000000001042650, sigpipe=0) at rt.rs:165:17
    frame #35: 0x000000000041d1df create_models-b0050a5c3b0ab44f.exe`main + 47
    frame #36: 0x00000000004013ed create_models-b0050a5c3b0ab44f.exe`__tmainCRTStartup at crtexe.c:334
    frame #37: 0x000000000040152b create_models-b0050a5c3b0ab44f.exe`mainCRTStartup at crtexe.c:212
    frame #38: 0x00007ffcf1ae269d kernel32.dll`BaseThreadInitThunk + 29
    frame #39: 0x00007ffcf38aa9f8 ntdll.dll`RtlUserThreadStart + 40

Potentially related issues:
https://stackoverflow.com/questions/68536333/rust-program-with-rust-dll-panics-when-spawning-a-thread
nagisa/rust_libloading#82

The second points towards how libloading calls FreeLibrary in it's own drop, so I don't think we need to?

https://github.com/nagisa/rust_libloading/blob/83b1037f21850e7da0cb9c4be17ed05da63894b2/src/os/windows/mod.rs#L302

I'm personally ok with leaking the library so it doesn't get closed (I confirmed all the tests pass if we do this) but I'd still like to get to the bottom of this if possible simply because it's concerning. I'm going to try building bridgestan with the MSVC-abi clang against TBB2021 next

@aseyboldt
Copy link
Contributor Author

aseyboldt commented Apr 23, 2023

The original version didn't manually unload the lib but just relied on drop, I only changed that in the tests to that I could see if unloading was returning an error, and so that I could put loading and unloading of the libs behind a mutex to avoid any possible thread unsafetly there (I think there really shouldn't be any actually).

The traceback for thread 1 looks like it is just waiting for a value on the channel, which would be fine I think. I'm wondering why the thread that loads the library is stuck.

I also just tried to delay exiting the worker threads that load the library until all other threads are finished, and in that case I don't see the issue anymore (aseyboldt@fffb1e0). Not sure what that tells us though :-)

@WardBrian
Copy link
Collaborator

That again sounds eerily similar to the warning provided in the docs for FreeLibraryAndExitThread

As one further wrinkle, compiling the bridgestan models with MSVC/clang and a newer TBB resolved the issue.

This suggests to me one of two things:

  1. We found a bug (or at least some poor behavior) in MinGW's libwinpthread.dll.
  2. TBB updating fixed a bug (I'm seeing now if I can combine clang and TBB 2020.3).

At the very least, this suggests to me that it is not a bug in BridgeStan/Stan, which was my primary fear before moving forward.

@aseyboldt
Copy link
Contributor Author

Makes sense. I'd also be curious if MinGW + Recent TBB works.

@WardBrian
Copy link
Collaborator

  • I can't get clang+old TBB to work happily together, at least not with clang 15, because it complains about not having the /MD flag. This seems to be a hardcoded check in the old version. Clang 17 has a -fms-runtime-lib= arg, which claims to be equivalent to /MD, but I don't have access to it yet.
  • MingW + new TBB is also tricky, since it requires me to build TBB from source with MinGW, and TBB re-wrote their build system to use cmake in the 2020->2021 update, so the old process we used in Stan isn't available. Not impossible, just now how I plan on spending a Sunday 🙂

I did try both stable-msvc and stable-gnu for the Rust toolchain with the clang+tbb2021 bridgestan libraries, both worked.

Just to doc the rest of my versions, I've been using mingw-gcc version 5.3.0, and visual studio build tools 17.4.4

@aseyboldt
Copy link
Contributor Author

aseyboldt commented Apr 26, 2023

Turns out I also see very rare segfaults on linux if we unload a library and then reload it later. That one is at least a bit easier to debug, and I could pin it down to invalid data in the static variable here.

Maybe it just was never a particularly good idea to unload the libraries. It seems to be a far more dangerous thing to do than I realized. (At least this made me realize that I don't have a good understanding of what the dynamic linker is actually doing in detail...).

I'll disable the library unload code on linux as well, and hopefully that makes those issues disappear for good.

@aseyboldt
Copy link
Contributor Author

If that's ok with you I think I would prefer to move the doc updates and release action to a separate PR, this one is I think long enough as it is :-)

Copy link
Collaborator

@WardBrian WardBrian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor comments on the code here.

For docs, release, etc, I am happy to see those done separately.
I still think (even with the delay for 2.32.1), we should release 2.0 before this functionality is merged to give it some time to sit on the main branch and flush out anything else we want to tweak before a 2.1 puts it out in to the world. I don't think this will be a particularly long wait, just a week or so seems like a nice spacer between implementation and release.

rust/src/bs_safe.rs Outdated Show resolved Hide resolved
rust/src/bs_safe.rs Show resolved Hide resolved
rust/src/bs_safe.rs Show resolved Hide resolved
rust/tests/create_models.rs Outdated Show resolved Hide resolved
rust/tests/create_models.rs Outdated Show resolved Hide resolved
@aseyboldt
Copy link
Contributor Author

I still think (even with the delay for 2.32.1), we should release 2.0 before this functionality is merged to give it some time to sit on the main branch and flush out anything else we want to tweak before a 2.1 puts it out in to the world. I don't think this will be a particularly long wait, just a week or so seems like a nice spacer between implementation and release.

Makes sense. :-)

@WardBrian
Copy link
Collaborator

WardBrian commented May 5, 2023

@aseyboldt seems like the Windows tests are hung again (been spinning for 1 hour+). Could be related to the issue the other platforms are having (I think it's because #115 changed the cache key?)

@aseyboldt
Copy link
Contributor Author

Seems like I was missing the Makefile in the hash for the cache.
This explains why the linux and mac tests fail, but it does not explain why the windows test is stuck again...
I'll let it finish to have a look at where it was failing.

@aseyboldt
Copy link
Contributor Author

I fixed the missing hash in the workflow, this should fix the unix and mac issue.
I'm a bit lost as to why there would be a timeout in the windows test again, nothing should be unloading any libraries now, so if this does have in fact another source that'll be some more work I guess.
I'll be traveling for a while, so it might take a bit until I can get back to this.

Copy link
Collaborator

@WardBrian WardBrian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay @aseyboldt - just got around to this before I also will be traveling (starting next week).

Unless there are further changes you'd like to make to this, I'm happy to merge and then start working on the docs and release pipeline for this.

@aseyboldt
Copy link
Contributor Author

Looks good to me. Thanks for merging the tests back in.

I'll have time to work on the docs next week.

@WardBrian WardBrian merged commit 9973a26 into roualdes:main May 23, 2023
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants