Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock due to too many fake leave timeouts #26

Closed
gaomy3832 opened this issue Jan 23, 2015 · 4 comments
Closed

Deadlock due to too many fake leave timeouts #26

gaomy3832 opened this issue Jan 23, 2015 · 4 comments

Comments

@gaomy3832
Copy link
Contributor

I am running a workload in which the main thread and 1024 workers are synced by condition variables. At some point, all the workers will call wait(). zsim does fake leave for this SYS_futex syscall, and after some timeout period, it will do real leave, as explained in #8.

The problem is, since there are 1024 wait(), zsim needs to detect 1024 timeouts before it can blacklist this call when there is only 1 left. The time to do so exceeds the deadlock detection threshold 120 seconds (it resolves about 5 fake leaves per second, so 1024 fake leaves need 200 seconds), and thus cause the simulation terminate.

There are multiple solutions. First I can increase the deadlock threshold or disable it, but it still waste time on those fake leaves. Another solution is to avoid fake leave for SYS_futex at the first place, and blindly do real leaves for all SYS_futex. This seems to be a good solution to me, but I am curious whether there is any catch, and why you did not do this.

Thanks!

@s5z
Copy link
Owner

s5z commented Feb 3, 2015

That's a good point. You can just let all the SYS_futex calls skip the fake leave mechanism, and the futex wait/wake matching code will ensure they stay synchronous. The reason that's not done right now is that both mechanisms were developed independently.

@gaomy3832
Copy link
Contributor Author

I would like to compose a pull request for this issue. My current solution is when timeout is detected and we decide to blacklist a fake leave, instead of only do one leave on the current one, do real leave on all fake-leave threads that wait on the same PC of the same process. This could significantly accelerate the progress when many threads are used, as they will likely wait on the same sync primitive. Now we don't need to wait for timeout for each of them.

This modification has minimum change to the current code in schedule.cpp/h, so it will be an easy fix. The most correct thing, however, will be to skip futex altogether.

If you think this is useful, let me know and I will make a pull request.

@Joyguo
Copy link

Joyguo commented Jun 4, 2015

I think it will be very useful as the runtime can be shortened. Could you share your patch? I want to shorten the runtime of a large application. Thank you.

@s5z
Copy link
Owner

s5z commented Jun 4, 2015

Mingyu, go ahead & submit the pull request

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants