Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pipeline example waits for more units than it submits? #113

Closed
andre-merzky opened this issue Oct 21, 2015 · 19 comments
Closed

pipeline example waits for more units than it submits? #113

andre-merzky opened this issue Oct 21, 2015 · 19 comments
Labels

Comments

@andre-merzky
Copy link
Member

the output shows:

submit 16 unit(s)
        ................                                                      ok
wait for 16 unit(s)
        ++++++++++++++++                                                      ok
                                                                            done
Waiting for step_2 to complete.                                                \
submit 16 unit(s)
        ................                                                      ok
wait for 32 unit(s)
        ++++++++++++++++++++++++++++++++                                      ok

so the second step submits 16 units, but waits for 32? I assume it waits for the units from step_1 again -- but that is not very intuitive. Also, not needed...

@vivek-bala
Copy link
Contributor

I think that is from rp. The reporting seems to be cumulative.
On Oct 21, 2015 6:06 PM, "Andre Merzky" notifications@github.com wrote:

the output shows:

submit 16 unit(s)
................ ok
wait for 16 unit(s)
++++++++++++++++ ok
done
Waiting for step_2 to complete.
submit 16 unit(s)
................ ok
wait for 32 unit(s)
++++++++++++++++++++++++++++++++ ok

so the second step submits 16 units, but waits for 32? I assume it waits
for the units from step_1 again -- but that is not very intuitive. Also,
not needed...


Reply to this email directly or view it on GitHub
#113.

@andre-merzky
Copy link
Member Author

wait for 32 unit(s) - it is not just the reporting: it really waits for that many units! I assume (but did not check) that enmd simply calls umgr.wait_units()? That would wait for all units ever submitted to that umgr -- which explains the numbers. You probably want to pass a list of unit IDs: umgr.wait_units(unit_ids_for_this_step) ...

@andre-merzky
Copy link
Member Author

PS.: don't bother fixing this for the tutorial...

@vivek-bala
Copy link
Contributor

https://github.com/radical-cybertools/radical.ensemblemd/blob/master/src/radical/ensemblemd/exec_plugins/pipeline/static.py#L332

Every step has "N" CUs and every step waits for the "N" CUs to finish. So I
submit N at a time and I do wait_units(), doesn't that mean you wait for N
CUs at a time (since the N from the previous step have already finished -
"Done"). In the above case N=16.

On Wed, Oct 21, 2015 at 6:35 PM, Andre Merzky notifications@github.com
wrote:

PS.: don't bother fixing this for the tutorial...


Reply to this email directly or view it on GitHub
#113 (comment)
.

@andre-merzky
Copy link
Member Author

wait_units() will wait on all units which have ever been submitted to that umgr. How would the umgr know which ones not to wait for?

If you want to wait for N CUs, then you need to pass the UIDs for those N CUs to wait_units()...

@vivek-bala
Copy link
Contributor

wait_units() will wait on all units which have ever been submitted to that umgr. How would the umgr know which ones not to wait for?

But the first N CUs have finished executing (reach 'Done' state). I cannot understand "waiting" for those completed CUs.

@andre-merzky
Copy link
Member Author

consider (pseudo code):

umgr = UnitManager()
unit_1 = umgr,submit('sleep 1')
sleep(5) # unit 1 is DONE now
unit_2 = umgr,submit('sleep 1')
umgr.wait_units()

Is the umgr checking one or two units? Both obviously, because the application submitted both, and would otherwise not know if the first one is done or not...

Please use

umgr.wait_units(unit_2.uid)

if you only want to wait for the second...

@vivek-bala
Copy link
Contributor

Oh ok. So even if the CUs are "Done" they are not flushed out of the unit
manager queue (or some data structure). That would explain it.

On Thu, Oct 22, 2015 at 3:08 AM, Andre Merzky notifications@github.com
wrote:

consider (pseudo code):

umgr = UnitManager()
unit_1 = umgr,submit('sleep 1')
sleep(5) # unit 1 is DONE now
unit_2 = umgr,submit('sleep 1')
umgr.wait_units()

Is the umgr checking one or two units? Both obviously, because the
application submitted both...

Please use

umgr.wait_units(unit_2.uid)

if you only want to wait for the second...


Reply to this email directly or view it on GitHub
#113 (comment)
.

@andre-merzky
Copy link
Member Author

No - they are still manager by that unit manager. wait_units returns the states of the units the umgr - it would be inconsistent if it only returned DONE states of units which were not DONE when the call started - it would return a different number every time, and none at all if all are done. Uh... ;)

@marksantcroos
Copy link
Contributor

I agree with the semantics, but I can also see that why it confused Vivek. We probably need to document that better.

@andre-merzky
Copy link
Member Author

@andre-merzky
Copy link
Member Author

Maybe add 'that includes previously completed units'?

@marksantcroos
Copy link
Contributor

Yeah, be more verbose about it. Especially given that we have this reporting now, it is not really intuitive.

@vivek-bala
Copy link
Contributor

so wait_units() is not designed to be a Barrier function (as in MPI), we just happen to use it in such a mode. (?)

umgr = UnitManager()
unit_1 = umgr,submit('sleep 1')
sleep(5) # unit 1 is DONE now
unit_2 = umgr,submit('sleep 1')
umgr.wait_units()

Even in this case, IF the time taken to check if the unit_1 is "Done" is small ... doesn't this serve the purpose of waiting for unit_2 ? Not complaining against using umgr.wait_units(unit_2.uid), just want to know what I lose (in terms of time spent maybe).

@marksantcroos
Copy link
Contributor

so wait_units() is not designed to be a Barrier function (as in MPI), we just happen to use it in such a mode. (?)

Lets not try to bring in non-applying analogies :-)

Here is my attempt to a definition: wait_units(units, state) waits for all units under control of the UM (or a user-specified sub-set of them) to reach the specified state (or the default set of final states).

@vivek-bala
Copy link
Contributor

wait_units(units, state) waits for all units under control of the UM (or a user-specified sub-set of them) to reach the specified state (or the default set of final states).

I understand that.

But isn't waiting for a unit which is already "Done" ~0 work.

umgr = UnitManager()
unit_1 = umgr,submit('sleep 1')
sleep(5) # unit 1 is DONE now
umgr.wait_units(unit_1.uid)          # 0 work
unit_2 = umgr,submit('sleep 1')
umgr.wait_unit([unit_1.uid,unit_2.uid]) # essentially similar to umgr.wait_units(unit_2.uid)

@andre-merzky
Copy link
Member Author

umgr = UnitManager()
unit_1 = umgr,submit('sleep 10')
unit_2 = umgr,submit('sleep 1')
sleep(random(5,15))
umgr.wait_units()

if random picks 10, then unit_1 is DONE, otherwise not. So, it would depend on application workflow if wait_units returns one or two states. That is not deterministic, so not easy to handle... You would not even know what unit was done and which one would still be running...

@vivek-bala
Copy link
Contributor

But I am not interested in what it returns, I simply want to wait till all CUs (say of the current iteration) are "Done". If I use wait_units(), it also returns CUs of the previous iterations (since they are "done") as well, agreed. But it achieves what I wanted.

umgr = UnitManager()
unit_1 = umgr,submit('sleep 10')
sleep(random(5,15))
umgr.wait_units() #not interested in what, how many CUs it returns as long as it waits for ALL CUs to reach Done

print 'check'

If all I require is that "check" should be printed after the unit_1 is "Done", isn't the above script doing exactly that.

@andre-merzky
Copy link
Member Author

That is ok that you are not interested in what it returns, but returning the states is what the call does :P

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants