-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock due to too many fake leave timeouts #26
Comments
That's a good point. You can just let all the SYS_futex calls skip the fake leave mechanism, and the futex wait/wake matching code will ensure they stay synchronous. The reason that's not done right now is that both mechanisms were developed independently. |
I would like to compose a pull request for this issue. My current solution is when timeout is detected and we decide to blacklist a fake leave, instead of only do one leave on the current one, do real leave on all fake-leave threads that wait on the same PC of the same process. This could significantly accelerate the progress when many threads are used, as they will likely wait on the same sync primitive. Now we don't need to wait for timeout for each of them. This modification has minimum change to the current code in schedule.cpp/h, so it will be an easy fix. The most correct thing, however, will be to skip futex altogether. If you think this is useful, let me know and I will make a pull request. |
I think it will be very useful as the runtime can be shortened. Could you share your patch? I want to shorten the runtime of a large application. Thank you. |
Mingyu, go ahead & submit the pull request |
I am running a workload in which the main thread and 1024 workers are synced by condition variables. At some point, all the workers will call wait(). zsim does fake leave for this SYS_futex syscall, and after some timeout period, it will do real leave, as explained in #8.
The problem is, since there are 1024 wait(), zsim needs to detect 1024 timeouts before it can blacklist this call when there is only 1 left. The time to do so exceeds the deadlock detection threshold 120 seconds (it resolves about 5 fake leaves per second, so 1024 fake leaves need 200 seconds), and thus cause the simulation terminate.
There are multiple solutions. First I can increase the deadlock threshold or disable it, but it still waste time on those fake leaves. Another solution is to avoid fake leave for SYS_futex at the first place, and blindly do real leaves for all SYS_futex. This seems to be a good solution to me, but I am curious whether there is any catch, and why you did not do this.
Thanks!
The text was updated successfully, but these errors were encountered: