pipeline example waits for more units than it submits? #113

andre-merzky · 2015-10-21T22:05:59Z

the output shows:

submit 16 unit(s)
        ................                                                      ok
wait for 16 unit(s)
        ++++++++++++++++                                                      ok
                                                                            done
Waiting for step_2 to complete.                                                \
submit 16 unit(s)
        ................                                                      ok
wait for 32 unit(s)
        ++++++++++++++++++++++++++++++++                                      ok

so the second step submits 16 units, but waits for 32? I assume it waits for the units from step_1 again -- but that is not very intuitive. Also, not needed...

The text was updated successfully, but these errors were encountered:

vivek-bala · 2015-10-21T22:09:38Z

I think that is from rp. The reporting seems to be cumulative.
On Oct 21, 2015 6:06 PM, "Andre Merzky" notifications@github.com wrote:

the output shows:

submit 16 unit(s)
................ ok
wait for 16 unit(s)
++++++++++++++++ ok
done
Waiting for step_2 to complete.
submit 16 unit(s)
................ ok
wait for 32 unit(s)
++++++++++++++++++++++++++++++++ ok

so the second step submits 16 units, but waits for 32? I assume it waits
for the units from step_1 again -- but that is not very intuitive. Also,
not needed...

—
Reply to this email directly or view it on GitHub
#113.

andre-merzky · 2015-10-21T22:22:25Z

wait for 32 unit(s) - it is not just the reporting: it really waits for that many units! I assume (but did not check) that enmd simply calls umgr.wait_units()? That would wait for all units ever submitted to that umgr -- which explains the numbers. You probably want to pass a list of unit IDs: umgr.wait_units(unit_ids_for_this_step) ...

andre-merzky · 2015-10-21T22:35:56Z

PS.: don't bother fixing this for the tutorial...

vivek-bala · 2015-10-21T22:41:15Z

https://github.com/radical-cybertools/radical.ensemblemd/blob/master/src/radical/ensemblemd/exec_plugins/pipeline/static.py#L332

Every step has "N" CUs and every step waits for the "N" CUs to finish. So I
submit N at a time and I do wait_units(), doesn't that mean you wait for N
CUs at a time (since the N from the previous step have already finished -
"Done"). In the above case N=16.

On Wed, Oct 21, 2015 at 6:35 PM, Andre Merzky notifications@github.com
wrote:

PS.: don't bother fixing this for the tutorial...

—
Reply to this email directly or view it on GitHub
#113 (comment)
.

andre-merzky · 2015-10-22T00:16:11Z

wait_units() will wait on all units which have ever been submitted to that umgr. How would the umgr know which ones not to wait for?

If you want to wait for N CUs, then you need to pass the UIDs for those N CUs to wait_units()...

vivek-bala · 2015-10-22T00:56:16Z

wait_units() will wait on all units which have ever been submitted to that umgr. How would the umgr know which ones not to wait for?

But the first N CUs have finished executing (reach 'Done' state). I cannot understand "waiting" for those completed CUs.

andre-merzky · 2015-10-22T07:08:54Z

consider (pseudo code):

umgr = UnitManager()
unit_1 = umgr,submit('sleep 1')
sleep(5) # unit 1 is DONE now
unit_2 = umgr,submit('sleep 1')
umgr.wait_units()

Is the umgr checking one or two units? Both obviously, because the application submitted both, and would otherwise not know if the first one is done or not...

Please use

umgr.wait_units(unit_2.uid)

if you only want to wait for the second...

vivek-bala · 2015-10-22T07:15:03Z

Oh ok. So even if the CUs are "Done" they are not flushed out of the unit
manager queue (or some data structure). That would explain it.

On Thu, Oct 22, 2015 at 3:08 AM, Andre Merzky notifications@github.com
wrote:

consider (pseudo code):

umgr = UnitManager()
unit_1 = umgr,submit('sleep 1')
sleep(5) # unit 1 is DONE now
unit_2 = umgr,submit('sleep 1')
umgr.wait_units()

Is the umgr checking one or two units? Both obviously, because the
application submitted both...

Please use

umgr.wait_units(unit_2.uid)

if you only want to wait for the second...

—
Reply to this email directly or view it on GitHub
#113 (comment)
.

andre-merzky · 2015-10-22T07:23:08Z

No - they are still manager by that unit manager. wait_units returns the states of the units the umgr - it would be inconsistent if it only returned DONE states of units which were not DONE when the call started - it would return a different number every time, and none at all if all are done. Uh... ;)

marksantcroos · 2015-10-22T07:25:12Z

I agree with the semantics, but I can also see that why it confused Vivek. We probably need to document that better.

andre-merzky · 2015-10-22T07:27:05Z

http://radicalpilot.readthedocs.org/en/latest/apidoc.html#radical.pilot.UnitManager.wait_units

second sentence

andre-merzky · 2015-10-22T07:27:44Z

Maybe add 'that includes previously completed units'?

marksantcroos · 2015-10-22T07:28:57Z

Yeah, be more verbose about it. Especially given that we have this reporting now, it is not really intuitive.

vivek-bala · 2015-10-22T07:46:04Z

so wait_units() is not designed to be a Barrier function (as in MPI), we just happen to use it in such a mode. (?)

umgr = UnitManager()
unit_1 = umgr,submit('sleep 1')
sleep(5) # unit 1 is DONE now
unit_2 = umgr,submit('sleep 1')
umgr.wait_units()

Even in this case, IF the time taken to check if the unit_1 is "Done" is small ... doesn't this serve the purpose of waiting for unit_2 ? Not complaining against using umgr.wait_units(unit_2.uid), just want to know what I lose (in terms of time spent maybe).

marksantcroos · 2015-10-22T07:50:39Z

so wait_units() is not designed to be a Barrier function (as in MPI), we just happen to use it in such a mode. (?)

Lets not try to bring in non-applying analogies :-)

Here is my attempt to a definition: wait_units(units, state) waits for all units under control of the UM (or a user-specified sub-set of them) to reach the specified state (or the default set of final states).

vivek-bala · 2015-10-22T14:24:19Z

wait_units(units, state) waits for all units under control of the UM (or a user-specified sub-set of them) to reach the specified state (or the default set of final states).

I understand that.

But isn't waiting for a unit which is already "Done" ~0 work.

umgr = UnitManager()
unit_1 = umgr,submit('sleep 1')
sleep(5) # unit 1 is DONE now
umgr.wait_units(unit_1.uid)          # 0 work
unit_2 = umgr,submit('sleep 1')
umgr.wait_unit([unit_1.uid,unit_2.uid]) # essentially similar to umgr.wait_units(unit_2.uid)

andre-merzky · 2015-10-22T14:48:21Z

umgr = UnitManager()
unit_1 = umgr,submit('sleep 10')
unit_2 = umgr,submit('sleep 1')
sleep(random(5,15))
umgr.wait_units()

if random picks 10, then unit_1 is DONE, otherwise not. So, it would depend on application workflow if wait_units returns one or two states. That is not deterministic, so not easy to handle... You would not even know what unit was done and which one would still be running...

vivek-bala · 2015-10-22T14:56:48Z

But I am not interested in what it returns, I simply want to wait till all CUs (say of the current iteration) are "Done". If I use wait_units(), it also returns CUs of the previous iterations (since they are "done") as well, agreed. But it achieves what I wanted.

umgr = UnitManager()
unit_1 = umgr,submit('sleep 10')
sleep(random(5,15))
umgr.wait_units() #not interested in what, how many CUs it returns as long as it waits for ALL CUs to reach Done

print 'check'

If all I require is that "check" should be printed after the unit_1 is "Done", isn't the above script doing exactly that.

andre-merzky · 2015-10-22T15:11:55Z

That is ok that you are not interested in what it returns, but returning the states is what the call does :P

andre-merzky mentioned this issue Oct 22, 2015

CUs waiting count incorrect radical-cybertools/ExTASY#211

Closed

andre-merzky mentioned this issue Oct 22, 2015

fix unit reporting/waiting #119

Merged

vivek-bala closed this as completed Dec 11, 2015

vivek-bala added the type:bug label Apr 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pipeline example waits for more units than it submits? #113

pipeline example waits for more units than it submits? #113

andre-merzky commented Oct 21, 2015

vivek-bala commented Oct 21, 2015

andre-merzky commented Oct 21, 2015

andre-merzky commented Oct 21, 2015

vivek-bala commented Oct 21, 2015

andre-merzky commented Oct 22, 2015

vivek-bala commented Oct 22, 2015

andre-merzky commented Oct 22, 2015

vivek-bala commented Oct 22, 2015

andre-merzky commented Oct 22, 2015

marksantcroos commented Oct 22, 2015

andre-merzky commented Oct 22, 2015

andre-merzky commented Oct 22, 2015

marksantcroos commented Oct 22, 2015

vivek-bala commented Oct 22, 2015

marksantcroos commented Oct 22, 2015

vivek-bala commented Oct 22, 2015

andre-merzky commented Oct 22, 2015

vivek-bala commented Oct 22, 2015

andre-merzky commented Oct 22, 2015

pipeline example waits for more units than it submits? #113

pipeline example waits for more units than it submits? #113

Comments

andre-merzky commented Oct 21, 2015

vivek-bala commented Oct 21, 2015

andre-merzky commented Oct 21, 2015

andre-merzky commented Oct 21, 2015

vivek-bala commented Oct 21, 2015

andre-merzky commented Oct 22, 2015

vivek-bala commented Oct 22, 2015

andre-merzky commented Oct 22, 2015

vivek-bala commented Oct 22, 2015

andre-merzky commented Oct 22, 2015

marksantcroos commented Oct 22, 2015

andre-merzky commented Oct 22, 2015

andre-merzky commented Oct 22, 2015

marksantcroos commented Oct 22, 2015

vivek-bala commented Oct 22, 2015

marksantcroos commented Oct 22, 2015

vivek-bala commented Oct 22, 2015

andre-merzky commented Oct 22, 2015

vivek-bala commented Oct 22, 2015

andre-merzky commented Oct 22, 2015