Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sensible defaults for submitJobs() #84

Open
larskotthoff opened this issue Mar 23, 2015 · 10 comments
Open

Sensible defaults for submitJobs() #84

larskotthoff opened this issue Mar 23, 2015 · 10 comments

Comments

@larskotthoff
Copy link

submitJobs() has a maximum number of retries for submit errors and a function to determine the wait time between them. The problem is that "worker busy" counts as such an error. This means that if the first (n - |workers|) jobs take longer than the combined wait time, the function will exit with an error even though there is nothing actually wrong. The final jobs won't be submitted in this case.

It would be good to have the default be "wait until all jobs are submitted unless there are actual errors". At the moment, I have to go back if the jobs take a long time to complete and resubmit manually.

@berndbischl
Copy link
Contributor

Well, first of all all "errors" we are talking about here are of the "soft / temporary" kind, in "hard" cases we always abort.

The main problem is that SSH is bit special here, "worker busy" can happen very often.

OTOH what you want can easily be configured with "wait", "max.retries" and "job.delay".
Why not extend the config file so the user can specifiy this for his given site?

@larskotthoff
Copy link
Author

This would be a good start, but wouldn't solve the problem that you don't know how long a job will take to complete and the timeout needs to be higher than that. I would expect the command to submit all jobs, just like other batch job management systems.

@berndbischl
Copy link
Contributor

But we are not on a batch system? We try our best with SSH but we cannot work magic.

And you dont need any calculation? Set max.retries to Inf and wait to a constant of eg 10 secs?

@berndbischl
Copy link
Contributor

I mean this is "basic polling" and we dont know "a priori" how long a worker is still going to work for a certain job?

@berndbischl
Copy link
Contributor

I am not sure what you mean with "time.out" though? the max.retries?

@larskotthoff
Copy link
Author

The combination of max.retries and the wait function. Ok, setting the former to Inf would solve the problem, so if that could be specified in the settings file it would be great.

@berndbischl
Copy link
Contributor

That was my idea.
I also did that on SSH before.

(What i dont wnat by default in the the package is an infinite process that spams computers)

@larskotthoff
Copy link
Author

Ok, fair enough. Although I guess you could find out if a worker is busy because it's running other jobs as opposed to a worker that's just busy because of external influences.

@berndbischl
Copy link
Contributor

yes IIRC we have that already

@larskotthoff
Copy link
Author

Would it make sense to make that different error codes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants