Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the timeout option is not working correctly in parallel computing #11658

Closed
koffie opened this issue Aug 7, 2011 · 15 comments
Closed

the timeout option is not working correctly in parallel computing #11658

koffie opened this issue Aug 7, 2011 · 15 comments

Comments

@koffie
Copy link

koffie commented Aug 7, 2011

With a timeout of 5 seconds I should not be able to perform a sleep of 40 seconds!

sage: from time import sleep
sage: f=parallel(ncpus=4,timeout=5,verbose=True)
sage: g=f(sleep)
sage: time list(g([5,10,20,40]))
[(((5,), {}), None), (((10,), {}), None), (((20,), {}), None), (((40,), {}), None)]
Time: CPU 0.02 s, Wall: 40.07 s

Component: performance

Keywords: sleep time-out

Author: Leif Leonhardy

Reviewer: Volker Braun

Merged: sage-4.7.2.alpha2

Issue created by migration from https://trac.sagemath.org/ticket/11658

@koffie koffie added this to the sage-4.7.2 milestone Aug 7, 2011
@koffie

This comment has been minimized.

@nexttime
Copy link
Mannequin

nexttime mannequin commented Aug 7, 2011

Changed keywords from none to sleep time-out

@nexttime
Copy link
Mannequin

nexttime mannequin commented Aug 7, 2011

comment:2

Is this really an issue?

I guess the timeout simply sets a SIGALRM, which the sleep() function(s) override, but I may be wrong.

@nexttime nexttime mannequin changed the title the timout option is not working correctly in parralel computing the timeout option is not working correctly in parallel computing Aug 7, 2011
@nexttime
Copy link
Mannequin

nexttime mannequin commented Aug 7, 2011

comment:3

P.S.:

If you use p_iter="fork" (the only variant that's said to support timeout), apparently indeed only ncpus-1 child processes are created, so it seems to be as I guessed.

Depending on the selection / order of arguments, you may well get timeouts for [perhaps only some of] the child processes (i.e. they'll get killed), though not after the time you'd expect.

The behaviour is non-deterministic though, for whatever reason. (Try running the time command repeatedly, i.e. within a loop.)

@nexttime
Copy link
Mannequin

nexttime mannequin commented Aug 8, 2011

comment:4

Oh, this is a more funny one (and not due to what I first guessed).

Looking at the code, it does spawn ncpus processes, but there's a fat bug in the code (re)computing how long to wait the next time (the call to signal.alarm()):

                    if timeout:
                        def mysig(a,b):
                            raise RuntimeError, "SIGALRM"
                        oldest = min([X[1] for X in workers.values()])
                        signal.signal(signal.SIGALRM, mysig)
                        signal.alarm(int(walltime() - oldest)+1)

(X[1] is the wall time each child process was forked / started.)

This code is executed repeatedly; if wait() was interrupted by a RuntimeError triggered by SIGALRM, it is checked whether any of the child processes timed out, and if so, they get killed.

@nexttime
Copy link
Mannequin

nexttime mannequin commented Aug 8, 2011

Sage library patch. Corrects time to wait for child processes (before they get killed) in the parallel fork decorator. Based on Sage 4.7.1.rc0.

@nexttime
Copy link
Mannequin

nexttime mannequin commented Aug 8, 2011

comment:5

Attachment: trac_11658-fix_timeout_in_parallel_decorator.sagelib.patch.gz

A trivial patch is up.

@nexttime nexttime mannequin added the s: needs review label Aug 8, 2011
@vbraun
Copy link
Member

vbraun commented Aug 8, 2011

Author: Leif Leonhardy

@vbraun
Copy link
Member

vbraun commented Aug 8, 2011

comment:6

Yes its definitely better to have the timeout depend on the timeout variable :-)

@vbraun
Copy link
Member

vbraun commented Aug 8, 2011

Reviewer: Volker Braun

@nexttime
Copy link
Mannequin

nexttime mannequin commented Aug 8, 2011

comment:7

Replying to @vbraun:

Yes its definitely better to have the timeout depend on the timeout variable :-)

:D Just wondering if we should add a doctest for that, but now it has already positive review...

I was going to add one similar to Maarten's example (with some parallel sleeping processes, the default of ncpus, and a timeout of about 5 seconds), marking it # long time.

@koffie
Copy link
Author

koffie commented Aug 10, 2011

comment:8

I already had a patch sorry guy's for making you do double effort but thanks for the quick fix :). I found this bug by reading the source code since I was trying to understand what they where doing. BTW I didn't respond earlier because the mailing system of the trac server was malfunctioning and didn't know others where working on it. The mailing server should work now. At least this message will test that for me :)

@koffie
Copy link
Author

koffie commented Aug 10, 2011

comment:9

Ok trac wasn't fixed yet :(. Another test

@nexttime
Copy link
Mannequin

nexttime mannequin commented Aug 10, 2011

comment:10

Replying to @koffie:

Ok trac wasn't fixed yet :(. Another test

For the record, I did get this one. [Another test :) ]

@jdemeyer
Copy link

Merged: sage-4.7.2.alpha2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants