Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot run simple SGE example #48

Closed
jtpoirier opened this issue Jul 22, 2014 · 11 comments
Closed

Cannot run simple SGE example #48

jtpoirier opened this issue Jul 22, 2014 · 11 comments

Comments

@jtpoirier
Copy link

I am trying to get a simple example working on my SGE cluster. Session info below. I'd greatly appreciate any ideas for figuring this out. BiocParallel is not available for R 3.1.1.

Maybe my SGE template is not correct? I expect not because this code seems to fail on a simple qstat.

I am using the following configuration:
cluster.functions = makeClusterFunctionsSGE("/home/poirierj/R_libs/BatchJobs/etc/simple.tmpl", list.jobs.cmd = c("qstat", "-u poirierj"))
mail.start = "none"
mail.done = "none"
mail.error = "none"
db.driver = "SQLite"
db.options = list()
debug = TRUE

library(BatchJobs)
Loading required package: BBmisc
Sourcing configuration file: '/home/poirierj/R_libs/BatchJobs/etc/BatchJobs_global_config.R'
BatchJobs configuration:
cluster functions: SGE
mail.from:
mail.to:
mail.start: none
mail.done: none
mail.error: none
default.resources:
debug: TRUE
raise.warnings: FALSE
staged.queries: FALSE
max.concurrent.jobs: Inf
fs.timeout: NA
library(BiocParallel)
param <- BatchJobsParam(2)
register(param)
x<-bplapply(1:10, identity)
OS cmd: qstat -u poirierj
OS result:
$exit.code
[1] 0

$output
character(0)

Error: $ operator is invalid for atomic vectors

traceback()
15: fun(getBatchJobsConf(), reg)
14: getBatchIds(reg, "Cannot find jobs on system")
13: dbFindOnSystem(reg, unlist(ids))
12: as.vector(y)
11: intersect(unlist(ids), dbFindOnSystem(reg, unlist(ids)))
10: (function (reg, ids, resources = list(), wait, max.retries = 10L,
chunks.as.arrayjobs = FALSE, job.delay = FALSE)
{
getDelays = function(cf, job.delay, n) {
if (is.logical(job.delay)) {
if (job.delay && n > 100L && cf$name %nin% c("Interactive",
"Multicore", "SSH")) {
return(runif(n, n * 0.1, n * 0.2))
}
return(delays = rep.int(0, n))
}
vnapply(seq_along(ids), job.delay, n = n)
}
checkRegistry(reg)
syncRegistry(reg)
if (missing(ids)) {
ids = dbFindSubmitted(reg, negate = TRUE)
if (length(ids) == 0L) {
info("All jobs submitted, nothing to do!")
return(invisible(integer(0L)))
}
}
else {
if (is.list(ids)) {
ids = lapply(ids, checkIds, reg = reg, check.present = FALSE)
dbCheckJobIds(reg, unlist(ids))
}
else if (is.numeric(ids)) {
ids = checkIds(reg, ids)
}
else {
stop("Parameter 'ids' must be a integer vector of job ids or a list of chunked job ids (list of integer vectors)!")
}
}
conf = getBatchJobsConf()
cf = getClusterFunctions(conf)
limit.concurrent.jobs = is.finite(conf$max.concurrent.jobs)
n = length(ids)
assertList(resources)
resources = resrc(resources)
if (missing(wait))
wait = function(retries) 10 * 2^retries
else assertFunction(wait, "retries")
if (is.logical(job.delay)) {
assertFlag(job.delay)
}
else {
checkFunction(job.delay, c("n", "i"))
}
if (is.finite(max.retries))
max.retries = asCount(max.retries)
assertFlag(chunks.as.arrayjobs)
if (chunks.as.arrayjobs && is.na(cf$getArrayEnvirName())) {
warningf("Cluster functions '%s' do not support array jobs, falling back on chunks",
cf$name)
chunks.as.arrayjobs = FALSE
}
if (!is.null(cf$listJobs)) {
ids.intersect = intersect(unlist(ids), dbFindOnSystem(reg,
unlist(ids)))
if (length(ids.intersect) > 0L) {
stopf("Some of the jobs you submitted are already present on the batch system! E.g. id=%i.",
ids.intersect[1L])
}
}
if (limit.concurrent.jobs && (cf$name %in% c("Interactive",
"Local", "Multicore", "SSH") || is.null(cf$listJobs))) {
warning("Option 'max.concurrent.jobs' is enabled, but your cluster functions implementation does not support the listing of system jobs.\n",
"Option disabled, sleeping 5 seconds for safety reasons.")
limit.concurrent.jobs = FALSE
Sys.sleep(5)
}
if (n > 5000L) {
warningf(collapse(c("You are about to submit '%i' jobs.",
"Consider chunking them to avoid heavy load on the scheduler.",
"Sleeping 5 seconds for safety reasons."), sep = "\n"),
n)
Sys.sleep(5)
}
saveConf(reg)
is.chunked = is.list(ids)
info("Submitting %i chunks / %i jobs.", n, if (is.chunked)
sum(viapply(ids, length))
else n)
info("Cluster functions: %s.", cf$name)
info("Auto-mailer settings: start=%s, done=%s, error=%s.",
conf$mail.start, conf$mail.done, conf$mail.error)
fs.timeout = conf$fs.timeout
staged = conf$staged.queries && !is.na(fs.timeout)
interrupted = FALSE
submit.msgs = buffer(type = "list", capacity = 1000L, value = dbSendMessages,
reg = reg, max.retries = 10000L, sleep = function(r) 5,
staged = staged, fs.timeout = fs.timeout)
logger = makeSimpleFileLogger(file.path(reg$file.dir, "submit.log"),
touch = FALSE, keep = 1L)
on.exit({
if (interrupted && exists("batch.result", inherits = FALSE)) {
submit.msgs$push(dbMakeMessageSubmitted(reg, id,
time = submit.time, batch.job.id = batch.result$batch.job.id,
first.job.in.chunk.id = if (is.chunked) id1 else NULL,
resources.timestamp = resources.timestamp))
}
info("Sending %i submit messages...\nMight take some time, do not interrupt this!",
submit.msgs$pos())
submit.msgs$clear()
if (logger$getSize()) messagef("%i temporary submit errors logged to file '%s'.\nFirst message: %s",
logger$getSize(), logger$getLogfile(), logger$getMessages(1L))
})
info("Writing %i R scripts...", n)
resources.timestamp = saveResources(reg, resources)
rscripts = writeRscripts(reg, cf, ids, chunks.as.arrayjobs,
resources.timestamp, disable.mail = FALSE, delays = getDelays(cf,
job.delay, n))
waitForFiles(rscripts, timeout = fs.timeout)
dbSendMessage(reg, dbMakeMessageKilled(reg, unlist(ids),
type = "first"), staged = staged, fs.timeout = fs.timeout)
bar = makeProgressBar(max = n, label = "SubmitJobs")
bar$set()
tryCatch({
for (i in seq_along(ids)) {
id = ids[[i]]
id1 = id[1L]
retries = 0L
repeat {
if (limit.concurrent.jobs && length(cf$listJobs(conf,
reg)) >= conf$max.concurrent.jobs) {
batch.result = makeSubmitJobResult(status = 10L,
batch.job.id = NA_character_, "Max concurrent jobs exhausted")
}
else {
interrupted = TRUE
submit.time = now()
batch.result = cf$submitJob(conf = conf, reg = reg,
job.name = sprintf("%s-%i", reg$id, id1),
rscript = rscripts[i], log.file = getLogFilePath(reg,
id1), job.dir = getJobDirs(reg, id1), resources = resources,
arrayjobs = if (chunks.as.arrayjobs)
length(id)
else 1L)
}
if (batch.result$status == 0L) {
submit.msgs$push(dbMakeMessageSubmitted(reg,
id, time = submit.time, batch.job.id = batch.result$batch.job.id,
first.job.in.chunk.id = if (is.chunked)
id1
else NULL, resources.timestamp = resources.timestamp))
interrupted = FALSE
bar$inc(1L)
break
}
interrupted = FALSE
if (batch.result$status > 0L && batch.result$status <=
100L) {
if (is.finite(max.retries) && retries > max.retries)
stopf("Retried already %i times to submit. Aborting.",
max.retries)
Sys.sleep(wait(retries))
logger$log(batch.result$msg)
retries = retries + 1L
}
else if (batch.result$status > 100L && batch.result$status <=
200L) {
stopf("Fatal error occured: %i. %s", batch.result$status,
batch.result$msg)
}
else {
stopf("Illegal status code %s returned from cluster functions!",
batch.result$status)
}
}
}
}, error = bar$error)
return(invisible(ids))
})(reg = list(id = "bpmapply", version = list(platform = "x86_64-unknown-linux-gnu",
arch = "x86_64", os = "linux-gnu", system = "x86_64, linux-gnu",
status = "", major = "3", minor = "0.2", year = "2013", month = "09",
day = "25", svn rev = "63987", language = "R", version.string = "R version 3.0.2 (2013-09-25)",
nickname = "Frisbee Sailing"), RNGkind = c("Mersenne-Twister",
"Inversion"), db.driver = "SQLite", db.options = list(), seed = 693613467L,
file.dir = "/home/poirierj//BiocParallel_tmp_55523b7bd6c2",
sharding = TRUE, work.dir = "/home/poirierj", src.dirs = character(0),
src.files = character(0), multiple.result.files = FALSE,
packages = list(BatchJobs = list(version = list(c(1L, 3L))))),
ids = list(c(2L, 4L, 5L, 7L, 9L), c(1L, 3L, 6L, 8L, 10L)))
9: do.call(submitJobs, pars)
8: withCallingHandlers(expr, message = function(c) invokeRestart("muffleMessage"))
7: suppressMessages(do.call(submitJobs, pars))
6: bpmapply(FUN, X, MoreArgs = list(...), SIMPLIFY = FALSE, USE.NAMES = FALSE,
resume = resume, BPPARAM = BPPARAM)
5: bpmapply(FUN, X, MoreArgs = list(...), SIMPLIFY = FALSE, USE.NAMES = FALSE,
resume = resume, BPPARAM = BPPARAM)
4: bplapply(X, FUN, ..., resume = resume, BPPARAM = x)
3: bplapply(X, FUN, ..., resume = resume, BPPARAM = x)
2: bplapply(1:10, identity)
1: bplapply(1:10, identity)
sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] BiocParallel_0.4.1 BatchJobs_1.3 BBmisc_1.7

loaded via a namespace (and not attached):
[1] brew_1.0-6 checkmate_1.2 codetools_0.2-8 DBI_0.2-7
[5] digest_0.6.4 fail_1.2 foreach_1.4.2 iterators_1.0.7
[9] parallel_3.0.2 RSQLite_0.11.4 sendmailR_1.1-2 stringr_0.6.2
[13] tools_3.0.2

@mllg
Copy link
Member

mllg commented Aug 12, 2014

  1. Does it work w/o BiocParallel wrapped around BatchJobs?
  2. Can we get the template file?

@aluc0
Copy link

aluc0 commented Aug 14, 2014

Hey!
After updating to BatchJobs_1.3, I experienced the same problem (before that it worked perfectly), though I am not using BiocParallel.
I traced the error back to the function runOSCommandLinux() (in listJobs() of makeClusterFunctionsSGE()), and it seems to originate from the system3-call by returning an empty character for the list.jobs.cmd:

system3(sys.cmd, sys.args, stdin = stdin, stdout = TRUE, stderr = TRUE, wait = TRUE, stop.on.exit.code = stop.on.exit.code)
$exit.code
[1] 0

$output
character(0)
sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-unknown-linux-gnu (64-bit)

 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] BatchExperiments_1.2 BatchJobs_1.3        BBmisc_1.7           vimcom_0.9-93        setwidth_1.0-3       colorout_0.9-9      

loaded via a namespace (and not attached):
 [1] brew_1.0-6      checkmate_1.2   DBI_0.2-7       digest_0.6.4    fail_1.2        plyr_1.8.1      Rcpp_0.11.2     RSQLite_0.11.4 
 [9] sendmailR_1.1-2 stringr_0.6.2   tools_3.0.2

So far, I've been using a pretty simple template file:

#!/bin/bash

<%
walltime <- convertInteger(resources$walltime)
memory = convertInteger(resources$memory)
-%>

#$ -N <%= job.name %>
#$ -j y
#$ -o <%= log.file %>
#$ -cwd
#$ -V
#$ -l h_rt=<%= walltime %>, h_vmem= <%= memory %>

R CMD BATCH --no-save --no-restore "<%= rscript %>" /dev/stdout
exit 0

It would be great if you could help me with that! And thanks for maintaining this pretty awesome package.

Best,
mariëlle

@mllg
Copy link
Member

mllg commented Aug 14, 2014

Could you post the output of:

library(BatchJobs)
setConfig(debug = TRUE)
reg = makeRegistry("debug")
findRunning(reg)

@berndbischl
Copy link
Contributor

  1. No need for further info, we found it

  2. Sorry, for taking so long, I was moving houses, then got sick.

  3. The bug is an idiotic mistake from a simple change by me in the last release.
    The problem is not the bug, but that testing this is so hard for us.
    We work on SLURM and Torque, but cannot use SGE.

How much do you use the package at your facility? Would you be willing to test for us if we ever do interface changes for the backends BEFORE the release?

@aluc0
Copy link

aluc0 commented Aug 14, 2014

Thanks for the fast answer! Good to read it's just a tiny fix.
If you need a tester for an SGE environment, I'd be totally willing to help (as long as I have access to one, but no need to worry about that soon). I'm using especially BatchExperiments a lot, as it's helping me to keep track of the algorithms and projects (which are rather data-heavy).

Thanks!

@berndbischl
Copy link
Contributor

I pushed the mini fix.

Can you pls test and report?

Also, for later would you send me your mail address and name? So I might ask you again about an SGE test?

I am here:
http://www.statistik.tu-dortmund.de/bischl.html

It is also good to hear that people use BatchExperiments we mainly get feedback for BatchJobs.

@aluc0
Copy link

aluc0 commented Aug 14, 2014

Installed it, and it works as a treat. Thanks for fixing so quickly!
I am a bit further to the South: www.orn.mpg.de/vantoor

All the best,
mariëlle

@berndbischl
Copy link
Contributor

Thx for the test.

I will try to get this on CRAN soon, I can drop you a message. Or keep this open till it has happened.

@berndbischl
Copy link
Contributor

I see we had the last CRAN release 5 weeks ago, so they should allow a new one without special asking.

I will check whether we have a "round" release now.

@jtpoirier
Copy link
Author

Thanks, everyone, for solving this bug so fast I missed the entire process! Will test when pushed out to CRAN.

On Aug 14, 2014, at 9:59 AM, berndbischl <notifications@github.commailto:notifications@github.com> wrote:

I see we had the last CRAN release 5 weeks ago, so they should allow a new one without special asking.

I will check whether we have a "round" release now.


Reply to this email directly or view it on GitHubhttps://github.com//issues/48#issuecomment-52186406.

 =====================================================================



 Please note that this e-mail and any files transmitted from

 Memorial Sloan-Kettering Cancer Center may be privileged, confidential,

 and protected from disclosure under applicable law. If the reader of

 this message is not the intended recipient, or an employee or agent

 responsible for delivering this message to the intended recipient,

 you are hereby notified that any reading, dissemination, distribution,

 copying, or other use of this communication or any of its attachments

 is strictly prohibited.  If you have received this communication in

 error, please notify the sender immediately by replying to this message

 and deleting this message, any attachments, and all copies and backups

 from your computer.

@berndbischl
Copy link
Contributor

Will release today. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants