-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rust backend #105
Rust backend #105
Conversation
I guess the main question now is if this should end up here.
And if you'd rather not have it in the bridgrestan repo I can do those on the nutpie repo. |
I don't speak for @roualdes and @bob-carpenter, but personally I would be happy to have this merged here. My thoughts on next steps, mirroring yours
A more general question @aseyboldt - we currently have unified version numbers across all our APIs. This is nice for several reasons, but it is also one of the main burdens adding a new interface creates - a breaking change in the Rust interface would mean a major version bump for all the interfaces. Could you see a reason we'd need to do such a thing (for any reason other than us changing our C-API?) |
The orphan rule really shouldn't be an issue whichever way we do this, nutpie could just have a
and implement
Yes, I'll add a todo list in the description, and add that.
I guess pretty much anything that uses bridgestan for something serious probably will need to parse the variable names, so I though it would make sense to just put it in here, but I'm also fine with moving it to make things more consistent. If consistency is the goal, maybe adding something like it to the other interfaces might be nice as well though. :-)
If it is published on crates.io, it will automatically get a page on docs.rs, so just linking to that would probably be the easiest way?
I've never done this, but should be pretty straight forward. I think a repo owner has to create a crates.io account and generate an API token. A github action can then use that token to pretty much just execute
Nothing immediately comes to mind, and I have a running version with nutpie locally now, so I think there's at least not anything broken or missing that would make writing a sampler impossible. But of course I could have missed something. I can see how a unified version approach gets more difficult the more language bindings you have... |
That all sounds good to me.
I agree it would be nice to think of something like this for the other interfaces, but for the initial inclusion of a Rust interface I think it should be omitted. We can then do follow-on work on all the languages at once. |
@WardBrian I pushed a change to the github workflow to also run the rust test, could you approve that workflow run? I also made a couple other smaller changes to the rust interface:
|
@aseyboldt, I went to approve the workflow but it looks like there's an error on line 225 of the file |
Yeah, I'm running it locally first, then try to push again. Sorry. |
No worries at all. |
I don't think I've quite gotten it to work yet, but at least the clang deps shouldn't be missing anymore. |
I'm running into a strange timeout issue in the tests on windows: I'm not really sure what might be causing this. Is that anything that has come up in some way before? |
Most of the other backend's test load the same library more than once Windows is our finnickiest CI by far. Despite the fact that I can run them all locally on Windows, I still couldn't get the github actions environment right to run the Python tests to completion. The default runners they provide for Windows have a lot going on (I think I counted like 3 different installations of gcc before R was even installed). Might be worth trying the mingw Rust toolchain? Our experience with other things has been such that the libraries load fine even with a mix of compilers, but maybe libloading is more sensitive? |
The tests run in parallel by default, so parallel loading could also be the issue here (there even is a note about that in the libloading docs, so maybe we need to add a mutex for that). I change the tests to run serially, but added an explicit test for loading the lib in parallel, hopefully that helps to narrow it down. And that test is probably a good idea anyway.
I'll try that if this doesn't work. It would be good though if this worked on the default toolchain though, and from what I found online mixing mingw and msvc should actually work for pure C interfaces... |
I guess I made some progress. If I run the original tests serially, everything passes. https://github.com/roualdes/bridgestan/actions/runs/4759489941/jobs/8458862422?pr=105#step:13:154 I doubt that this is an issue in the rust interface, so I guess we could work around it by simply running the tests in serial for now, but it would be good anyway to find the problem. More a blind guess than anything else, but I could imaging that the issue is related to the global variable for a thread pool in each library? I'll run it again with |
TBB_INTERFACE_NEW just changes the math library to be compatible with newer TBBs, it doesn’t actually use them (you need to supply it yourself). I’d be curious if the issue could be recreated in e.g. Julia on windows |
I experimented a bit more, and I think I have a pretty good idea of what's happening, but really know clue as to why... If we load and close the shared libraries in different threads, loading will deadlock. Let's say we have 3 shared libraries (A, B, and C), and three corresponding threads. If we then load A, then B, close A and then load C (nicely one after the other, the threads wait for a signal before they do anything), the last load will get stuck. This does not happen if A, B and C are being loaded in the same thread, and it will also not happen if we do not unload A. We could work around this by simply never closing the libraries (as the julia and I think python backends do). I can only speculate about the cause. Constructors and destructors of global variables should run when we open or close the libraries, so I guess those could be at fault. Or thread local storage somehow comes in the way? Code for that failing example is here: I'd be really surprised if this didn't happen with other backends, if they close the libraries, but I haven't tried. |
This could just be a property of the Windows
Though, if this was a well-known issue, I suspect |
There is also quite a bit of talk about deadlocks here: https://learn.microsoft.com/en-us/windows/win32/dlls/dynamic-link-library-best-practices I still have a hunch that it might be related to something like this: |
It appears the only place that specific init function is ever called in the Stan codebase is in test code which we would not be including. The autodiff stack does use a global (or thread local, if STAN_THREADS is true) singleton. I was able to recreate the deadlock locally on windows. I can try some basic DLLs and play around with things. Based on the existing print statements, the deadlock occurs somewhere I wouldn’t really have expected, which is interesting |
I managed to attach a debugger to the hung test. Didn't get too far, but did get
I can run some more precise experiments if you have suggestions. I'm once again wondering if there is some bad interaction between the mingw compiler (where |
Okay, tried on |
Hm, that doesn't really tell me much unfortunately. Do you have any idea why it would report an exception when we observe a deadlock? Why wouldn't that error get forwarded somewhere? Debugging wise this is quickly moving outside my comfort zone, not sure where it makes sense to look. But some random ideas:
I can set up a VM next week and try some of that. If you have other, preferably better ideas, please share :-) For the time being, would you be fine with just not unloading the libs on windows, so that this doesn't block the PR? |
I think the reason we see a deadlock is that another thread is in a spin lock waiting on thread 6, so when that thread has a failure it just hangs. I tried really basic DLLs and wasn’t able to recreate the behavior. I didn’t do anything like link in TBB, which could easily be the culprit here if something is hitting a race condition with a library being loaded and unloaded in the Windows API. One small observation: changing the list so that the libraries which are loaded/unloaded are in fact the same library makes the issue go away. Not sure what that means, yet |
good point... But which one is the waiting thread, and which one fails?
I tried that as well. I think the reason is that in that case the library is simply never actually unloaded. If you load an already loaded library, windows just increments a refcount, and only unloads it if the refcount goes to zero. |
Tried an older libloading, no difference. I believe it's thread 1 that is parked: Details
Potentially related issues: The second points towards how I'm personally ok with leaking the library so it doesn't get closed (I confirmed all the tests pass if we do this) but I'd still like to get to the bottom of this if possible simply because it's concerning. I'm going to try building bridgestan with the MSVC-abi clang against TBB2021 next |
The original version didn't manually unload the lib but just relied on drop, I only changed that in the tests to that I could see if unloading was returning an error, and so that I could put loading and unloading of the libs behind a mutex to avoid any possible thread unsafetly there (I think there really shouldn't be any actually). The traceback for thread 1 looks like it is just waiting for a value on the channel, which would be fine I think. I'm wondering why the thread that loads the library is stuck. I also just tried to delay exiting the worker threads that load the library until all other threads are finished, and in that case I don't see the issue anymore (aseyboldt@fffb1e0). Not sure what that tells us though :-) |
That again sounds eerily similar to the warning provided in the docs for As one further wrinkle, compiling the bridgestan models with MSVC/clang and a newer TBB resolved the issue. This suggests to me one of two things:
At the very least, this suggests to me that it is not a bug in BridgeStan/Stan, which was my primary fear before moving forward. |
Makes sense. I'd also be curious if MinGW + Recent TBB works. |
I did try both Just to doc the rest of my versions, I've been using mingw-gcc version 5.3.0, and visual studio build tools 17.4.4 |
Turns out I also see very rare segfaults on linux if we unload a library and then reload it later. That one is at least a bit easier to debug, and I could pin it down to invalid data in the static variable here. Maybe it just was never a particularly good idea to unload the libraries. It seems to be a far more dangerous thing to do than I realized. (At least this made me realize that I don't have a good understanding of what the dynamic linker is actually doing in detail...). I'll disable the library unload code on linux as well, and hopefully that makes those issues disappear for good. |
If that's ok with you I think I would prefer to move the doc updates and release action to a separate PR, this one is I think long enough as it is :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor comments on the code here.
For docs, release, etc, I am happy to see those done separately.
I still think (even with the delay for 2.32.1), we should release 2.0 before this functionality is merged to give it some time to sit on the main
branch and flush out anything else we want to tweak before a 2.1 puts it out in to the world. I don't think this will be a particularly long wait, just a week or so seems like a nice spacer between implementation and release.
Makes sense. :-) |
@aseyboldt seems like the Windows tests are hung again (been spinning for 1 hour+). Could be related to the issue the other platforms are having (I think it's because #115 changed the cache key?) |
Seems like I was missing the |
I fixed the missing hash in the workflow, this should fix the unix and mac issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay @aseyboldt - just got around to this before I also will be traveling (starting next week).
Unless there are further changes you'd like to make to this, I'm happy to merge and then start working on the docs and release pipeline for this.
Looks good to me. Thanks for merging the tests back in. I'll have time to work on the docs next week. |
This is based on #88 and @WardBrian's branch.
replaces #102
It is still not decided if this should live in nutpie or in bridgestan, either would be fine for me.
I made a couple of changes compared to Brians branch:
I'm not 100% sure about the handling of string/bytes data in the interface. It might be easier to return &CStr in a couple of places, and leave worrying about the encoding to users.
TODO