Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

idle processes during execution #1056

Closed
BoPeng opened this issue Sep 19, 2018 · 15 comments
Closed

idle processes during execution #1056

BoPeng opened this issue Sep 19, 2018 · 15 comments

Comments

@BoPeng
Copy link
Contributor

BoPeng commented Sep 19, 2018

sos generates a number of idle processes when steps are waiting for the completion of tasks or subworkflows. Basically, these processes are push aside (not counted towards -j) so that new processes could be created to process the nested workflow etc. It would be helpful to make use of these idle processes for the processing of new steps.

@BoPeng
Copy link
Contributor Author

BoPeng commented Sep 20, 2018

This is a bigger problem now with zmq because the new branch counts the exact number of processes so the idle processes will be counted towards -j, and fewer processes will be actually running if there are a number of pending steps and workflows etc. Need to find a way to reuse idle processes.

@BoPeng
Copy link
Contributor Author

BoPeng commented Sep 20, 2018

It looks like the easiest solution is to run steps with tasks and/or sos_run action in the master sos process. Because steps that involves tasks and sos_run are usually "controller" steps without much computation for themselves, it seems to make sense to do this. This also avoids the trouble of sending tasks and nested workflows to the master workflow, which simplifies the execution logic.

@BoPeng
Copy link
Contributor Author

BoPeng commented Sep 21, 2018

The proposed solution will not work.

  1. The steps will wait so it cannot be executed in the master thread of the master process.
  2. Because steps are executed in env.sos_dict, all threads in the master process will share it. So at most one thread with the master thread are allowed. Then the entire workflow will stall if two waiting processes are needed (e.g. a nested workflow and a step with task).

Right now I have implemented the zmq version of the original pipe solution. That is to say, idle processes will be present but not counted towards -j so -j 4 might have more than 4 sos processes. It is possible for us to suspend and wake up idle processes but I am not sure if that is the right thing to do.

@gaow
Copy link
Member

gaow commented Sep 21, 2018

I think initially we had issues with this solution, that nested workflows triggers so many more processes than -j. That's why we ended up using slotmanager I think. Isnt there a problem now?

@BoPeng
Copy link
Contributor Author

BoPeng commented Sep 22, 2018

No. The problem is not resolved. The idle processes are there but not counted towards -j so that steps could be executed at full speed. The only feasible solution seems to be pushing context to a stack and allow idle processes to accept new jobs, but that is too much trouble and I do not want to go there before thinking harder on better solutions.

The problem is mostly caused by env.sos_dict which is a singleton process-level dictionary. We could move it to a local dictionary and pass it around so that multiple steps could be executed side by side (in threads) each with its own sos_dict...

@gaow gaow mentioned this issue Sep 22, 2018
@gaow
Copy link
Member

gaow commented Sep 24, 2018

Is -j switch also broken? I tried to run a workflow with -j 8, but I got the following:

2018-09-23-20-34-18_scrot

and kind of freezes my computer now ...

@BoPeng
Copy link
Contributor Author

BoPeng commented Sep 24, 2018

Needs an example to duplicate this...

@gaow
Copy link
Member

gaow commented Sep 24, 2018 via email

@BoPeng
Copy link
Contributor Author

BoPeng commented Sep 24, 2018

Still working on tests. Windows is giving me lots of headache. Billiard failed (now I see why I gave it up before) and there are problems with passing sockets around. I will work on -j after all tests pass.

@BoPeng
Copy link
Contributor Author

BoPeng commented Sep 28, 2018

@gaow The problem you have seen was likely caused by zoombie processes caused by incomplete disposal of zmq resources. The current trunk has addressed this particular problem so perhaps you can try again and let me know if the problem persists. Note that the idle processes problem still exists so you should see more processes than -j if your workflow contains sub workflows.

@gaow
Copy link
Member

gaow commented Sep 28, 2018

Unfortunately I do not think it works. Here is a MWE:

[1]
n = [x for x in range(20)]
input: for_each = 'n', concurrent=True
bash:
  stress --cpu 1

and

[GW] sos run test.sos -j2
INFO: Running default_1: 
stress: info: [19469] dispatching hogs: 1 cpu, 0 io, 0 vm, 0 hdd
stress: info: [19472] dispatching hogs: 1 cpu, 0 io, 0 vm, 0 hdd
stress: info: [19481] dispatching hogs: 1 cpu, 0 io, 0 vm, 0 hdd
stress: info: [19487] dispatching hogs: 1 cpu, 0 io, 0 vm, 0 hdd
stress: info: [19493] dispatching hogs: 1 cpu, 0 io, 0 vm, 0 hdd
stress: info: [19499] dispatching hogs: 1 cpu, 0 io, 0 vm, 0 hdd
stress: info: [19505] dispatching hogs: 1 cpu, 0 io, 0 vm, 0 hdd
stress: info: [19511] dispatching hogs: 1 cpu, 0 io, 0 vm, 0 hdd
stress: info: [19520] dispatching hogs: 1 cpu, 0 io, 0 vm, 0 hdd
stress: info: [19526] dispatching hogs: 1 cpu, 0 io, 0 vm, 0 hdd
stress: info: [19535] dispatching hogs: 1 cpu, 0 io, 0 vm, 0 hdd
stress: info: [19541] dispatching hogs: 1 cpu, 0 io, 0 vm, 0 hdd
stress: info: [19547] dispatching hogs: 1 cpu, 0 io, 0 vm, 0 hdd
stress: info: [19554] dispatching hogs: 1 cpu, 0 io, 0 vm, 0 hdd
stress: info: [19560] dispatching hogs: 1 cpu, 0 io, 0 vm, 0 hdd
stress: info: [19566] dispatching hogs: 1 cpu, 0 io, 0 vm, 0 hdd
stress: info: [19572] dispatching hogs: 1 cpu, 0 io, 0 vm, 0 hdd
stress: info: [19578] dispatching hogs: 1 cpu, 0 io, 0 vm, 0 hdd
stress: info: [19584] dispatching hogs: 1 cpu, 0 io, 0 vm, 0 hdd
stress: info: [19590] dispatching hogs: 1 cpu, 0 io, 0 vm, 0 hdd
^CKeyboardInterrupt

I killed it before it got crazier.

@BoPeng
Copy link
Contributor Author

BoPeng commented Oct 3, 2018

I still could not figure out a good way to solve the idle process problem but the concurrent worker problem should have been solved. Basically, substeps are now sent to a controller where workers are created (and destroyed) to handle substeps from all steps. You should usually see j+1 processes because the one extra process is the step that is waiting for the completion of substeps.

Also, all tests have passed so it is time for you to test the master of sos and sos-notebook. The conversion to zmq is not completed because many other things could be done in the zmq way (e.g resume of tasks) but these can be done slowly and should be relatively easy with a working zmq framework.

@gaow
Copy link
Member

gaow commented Oct 3, 2018

Great to know! Doing it the zmq way is also solution for the idle process problem (I seem to recall you mentioning about it before)? Is it good time to update my cluster SoS installation to use master? I've got stuff in production but I'm willing to take a risk, if not too high.

@BoPeng
Copy link
Contributor Author

BoPeng commented Oct 3, 2018

zmq makes the handling of idle process a bit easier because things are less tightly integrated but I still cannot find a good way to put idle processes to sleep or aside. The problem is with our global sos_dict which cannot be shared by threads in the same process. A real solution would have to make sos_dict local (to threads) but that is not an easy thing to do... so this ticket will remain open for a while.

From my point of view, it is ok to use master in production as long as all tests pass. Your jobs might fail due to incomplete tests (e.g. your sos_variable + depends + concurrent + not using variable in step) but at least these are new bugs that can be tested and fixed.

@BoPeng
Copy link
Contributor Author

BoPeng commented Feb 28, 2019

#1218

Feels good to close this loooooong-standing ticket.

@BoPeng BoPeng closed this as completed Feb 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants