-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segfaults in get_nprocs #68
Comments
Here's my current best theory for what might be causing this: our overrides in libparla_context may not be getting correctly preloaded. The resulting shared object has libc as a dependency in its ELF header. Given that to "preload" it into each VEC we just dlmopen libparla_context, this probably means libc's stuff actually gets loaded before our overrides. The only overrides we've actually observed being called successfully from within a VEC are the ones involving pthreads routines. I think the fix is to build libparla_context with undefined symbols so that it doesn't explicitly list libc as a dependency. That'll let us do the equivalent of LD_PRELOAD but within a linker namespace. |
All that said, that theory doesn't necessarily mean that there couldn't also be something wrong with our thread affinity wrappers. |
The error happens most of the times we run. Eventually a run works. I think it is a per-thread issue since if I run with less cores, the error happens least frequently. With more cores I might see the error one or more times. This seems really close to what we're seeing https://sourceware.org/legacy-ml/libc-help/2019-06/msg00026.html |
Probably related: #12 |
@sestephens73 mentioned on slack that this showed up in the matmul demo as well. The backtrace there was
I don't remember what the exact conditions to reproduce it for that app were. @sestephens73 feel free to add more details if you have them. |
Gist reproducing the above trace: https://gist.github.com/sestephens73/9f8c744d5c56bc81283cf8f6d88046cd |
Here's an alternate theory as to what could cause this: the current VEC is a thread-local. Spawned threads don't automatically inherit the values of the thread-local variables of the thread that spawned them. Maybe somehow a newly created thread is resolving some thread affinity related stuff in VEC 0 since its thread-local data will be zero-initialized. That could result in some kind of weird failure when shuttling affinity information back and forth. |
I tried to handle this by hooking into thread creation. But I might have done it wrong, or not hooked in deeply enough. |
We've been seeing mysterious segfaults in get_nprocs when threads are used together with VECs. The exact conditions that trigger this aren't known since lots of things still appear to work fine.
With the ARPACK demo this does show up, but only if many copies are used (e.g. one ARPACK copy per core, so increase the limit then run 24 copies or something). I most recently saw it there when masively oversubscribed though since I wasn't setting
OMP_NUM_THREADS
there yet. I wasn't able to get an informative backtrace beyond seeingget_nprocs
at the bottom of it.@hfingler saw segfaults like this several times when debugging the Galois/VECs demo. Here are two backtraces that we saw:
Another one:
@sestephens73 at one point saw this one as well when working on the matmul demo (I'm not sure what the workaround to avoid this there was):
The text was updated successfully, but these errors were encountered: