Debugging a future_lapply()-powered SLURM workflow #115

kendonB · 2017-10-26T02:45:27Z

I can see that you've been working on integrating the future package which is an exciting development.

I have a project with a stage that requires a lot of memory per CPU and another stage that requires a lot less. Ideally, I'd be able to get drake to schedule a bunch of slurm jobs for the first stage with a lot of memory per CPU, have drake/future wait for it to finish, then schedule a bunch more slurm jobs with the lower memory per CPU.

Slurm can also program dependencies natively which would be nice to have automated through drake.

Is this already possible?

I should also note that my HPC has a limit of 1000 array jobs and I would expect other science organizations to have similar limits. Breaking up the call into a separate sbatch/srun call per target would work I think.

wlandau-lilly · 2017-10-26T04:01:17Z

@kendonB thank you for the interest! Integeration of future-powered parallel computing is coming along well, I just need to access to SLURM and other job schedulers so I can test the more exotic examples.

There is indeed functionality in drake to use different Makefiles for different sets of targets. Each call to make(..., targets = THIS_SUBSET, parallelism = "Makefile") (or just make(..., parallelism = "Makefile")) writes a one-time Makefile, which you can configure with the recipe_command and prepend arguments to make(). See the parallelism vignette for details. I also have a couple different ideas for your use case.

Idea 1: Makefile parallelism

The idea is to have multiple calls to drake::make(..., targets = TARGETS_IN_THIS_STAGE, parallelism = "Makefile", recipe_command = INVOKE_SLURM_FOR_THIS_STAGE). I am not actually invoking SLURM here, and it should run locally.

library(drake)

simulate <- function(n){
  rnorm(n)
}

# workflow() is not yet defined in current CRAN release (4.3.0).
# It will replace drake::plan() due to a name conflict with future::plan().
# I will still keep drake::plan() exported and deprecated for a long time.
my_plan <- workflow(
  primer = simulate(20),
  data1 = primer + 1,
  data2 = primer + 2,
  result = mean(c(data1, data2))
)

my_plan

##   target               command
## 1 primer          simulate(20)
## 2  data1            primer + 1
## 3  data2            primer + 2
## 4 result mean(c(data1, data2))

Suppose the datasets and the primer can build with low memory and the result requires high memory. You can configure your Makefile recipes differently for different sets of targets. A one-time Makefile is generated for each call to drake::make(). These are mock builds, so I am not actually changing the memory. You would use recipe_command and maybe prepend to set the SLURM configuration differently for each make().

make(
  plan = my_plan,
  targets = c("data1", "data2"), # `primer` is built too
  parallelism = "Makefile",
  jobs = 2,
  recipe_command = "echo 'low memory'; Rscript -e 'R_RECIPE'"
)

## check 1 item: rnorm
## import rnorm
## check 1 item: simulate
## import simulate
## echo 'low memory'; Rscript -e 'drake::mk(target = "primer", cache_path = "/home/wlandau/Desktop/.drake")'
## low memory
## target primer
## echo 'low memory'; Rscript -e 'drake::mk(target = "data1", cache_path = "/home/wlandau/Desktop/.drake")'
## echo 'low memory'; Rscript -e 'drake::mk(target = "data2", cache_path = "/home/wlandau/Desktop/.drake")'
## low memory
## low memory
## load 1 item: primer
## load 1 item: primer
## target data1
## target data2

make(
  plan = my_plan,
  targets = "result",
  parallelism = "Makefile",
  recipe_command = "echo 'high memory'; Rscript -e 'R_RECIPE'"
)

## check 3 items: c, mean, rnorm
## import c
## import mean
## import rnorm
## check 1 item: simulate
## import simulate
## echo 'high memory'; Rscript -e 'drake::mk(target = "result", cache_path = "/home/wlandau/Desktop/.drake")'
## high memory
## load 2 items: data1, data2
## target result

Idea 2: `future.batchtools`

This one will not work on CRAN drake until I release a post-4.3.0 version. The idea is to plug the previous workflow into the SLURM future.batchtools example for drake.

library(future.batchtools)
library(drake)
backend(batchtools_slurm(template = "batchtools.slurm.tmpl")) # The tmpl file comes with the drake::example_drake("slurm")

simulate <- function(n){
  rnorm(n)
}

# workflow() is not yet defined in current CRAN release (4.3.0).
# It will replace drake::plan() due to a name conflict with future::plan().
# I will still keep drake::plan() exported and deprecated for a long time.
my_plan <- workflow(
  primer = simulate(20),
  data1 = primer + 1,
  data2 = primer + 2,
  result = mean(c(data1, data2))
)

my_plan

make(
  plan = my_plan,
  targets = c("data1", "data2"),
  parallelism = "future_lapply"
)

make(
  plan = my_plan,
  targets = "result",
  parallelism = "future_lapply"
)

wlandau-lilly · 2017-10-26T04:05:22Z

Also, unless you are using parallelism = "future_lapply", you won't max out the number of jobs. With make(..., jobs = 4), at most 4 jobs deploy at a time. For "future_lapply", you could limit the number of jobs with a SLURM-specific environment variable, maybe something in ?future.options.

wlandau-lilly · 2017-10-26T04:07:32Z

Another thing: what sort of native dependencies would you like to leverage in SLURM? The ways that drake can talk to the job scheduler are:

recipe_command
prepend
the *.tmpl file for future.batchtools and "future_lapply" parallelism

Does this meet your needs?

kendonB · 2017-10-26T04:27:23Z

Thanks for the detailed response. I think I should be able to figure this out now.

I'm not sure how the future_lapply parallelism works in the background but I was referring to, for example, the --dependency option for sbatch (see: http://geco.mines.edu/files/userguides/techReports/slurmchaining/slurm_errors.html).

drake would have to capture the jobids of the earlier jobs and plug them in.

The advantage to using sbatch like this would be that the jobs only briefly rely on the host R process. All the jobs would get scheduled and live on SLURM right away.

wlandau-lilly · 2017-10-26T07:32:38Z

Yeah, it does sound like --dependency might lessen the overhead a bit. I will keep it in mind, but to be honest, it probably will not get implemented.

Please let me know how the rest of the setup goes. Since you said you should be able to figure it out now, I am closing this issue, but we can continue the thread if you like.

kendonB · 2017-10-28T03:44:05Z

@wlandau-lilly I'm trying to get this working now and both "ideas" above fail for me.

First one I get the error Makefile:9: *** missing separator. Stop.:

library(drake)

simulate <- function(n){
  rnorm(n)
  print("simulating 3")
  Sys.sleep(20)
}

my_plan <- workflow(
  primer1 = simulate(20),
  primer2 = simulate(10),
  data1 = primer1 + 1,
  data2 = primer2 + 2,
  result = mean(c(data1, data2))
)

make(
  plan = my_plan,
  targets = c("data1", "data2"), # `primer` is built too
  parallelism = "Makefile",
  jobs = 2,
  prepend = c(
    "#!/bin/bash",
    "#SBATCH -J testing", 
    "#SBATCH -A landcare00063",
    "#SBATCH --time=1:00:00",
    "#SBATCH --cpus-per-task=1",
    "#SBATCH --begin=now",
    "#SBATCH --mem=1G", 
    "#SBATCH -C sb",
    "module load R"
  ),
  recipe_command = "srun Rscript -e 'R_RECIPE'"
)
Makefile:9: *** missing separator.  Stop.

Second one runs great and seems to nicely create multiple jobs on slurm. However, I can't seem to find where the log files end up so it's hard to see what actually happened. Do you know?

@wlandau-lilly, did you miss this one?

kendonB · 2017-10-28T04:22:46Z

I've noticed a deal-breaking drawback with using future_lapply. It seems to use the slurm cluster to perform the simple tasks rather than letting the host R process do that:

Right now, I see:

check 67 items: as, c, filter, inner_join, left_join, mean, mutate, paste0, c...

which I presume is just a simple text processing task.
and squeue shows:

          65947247      high jobcc195  PENDING       0:00     1:00:00      1    1  2017-10-28T21:30:00
          65947248      high jobe592f  PENDING       0:00     1:00:00      1    1  2017-10-28T21:30:00
          65947249      high job33fa1  PENDING       0:00     1:00:00      1    1  2017-10-28T21:30:00
          65947241      high job0b2f8  PENDING       0:00     1:00:00      1    1  2017-10-28T21:15:00
          65947242      high job518b6  PENDING       0:00     1:00:00      1    1  2017-10-28T21:15:00
          65947243      high job287d9  PENDING       0:00     1:00:00      1    1  2017-10-28T21:15:00
          65947235      high joba87b5  PENDING       0:00     1:00:00      1    1  2017-10-28T20:30:42
          65947236      high job6edee  PENDING       0:00     1:00:00      1    1  2017-10-28T20:30:42
          65947237      high jobb411a  PENDING       0:00     1:00:00      1    1  2017-10-28T20:30:42
          65947238      high job6ee25  PENDING       0:00     1:00:00      1    1  2017-10-28T20:30:42
          65947239      high jobf9b55  PENDING       0:00     1:00:00      1    1  2017-10-28T20:30:42
          65947240      high jobad15f  PENDING       0:00     1:00:00      1    1  2017-10-28T20:30:42
          65947232      high jobbea8a  PENDING       0:00     1:00:00      1    1  2017-10-28T20:15:00
          65947233      high job795d3  PENDING       0:00     1:00:00      1    1  2017-10-28T20:15:00
          65947234      high job9c977  PENDING       0:00     1:00:00      1    1  2017-10-28T20:15:00
          65947229      high jobe4b78  PENDING       0:00     1:00:00      1    1  2017-10-28T19:15:00
          65947230      high jobdb978  PENDING       0:00     1:00:00      1    1  2017-10-28T19:15:00
          65947231      high jobe3cec  PENDING       0:00     1:00:00      1    1  2017-10-28T19:15:00
          65947226      high jobd3c52  PENDING       0:00     1:00:00      1    1  2017-10-28T18:19:44
          65947227      high job4644b  PENDING       0:00     1:00:00      1    1  2017-10-28T18:19:44
          65947228      high job27849  PENDING       0:00     1:00:00      1    1  2017-10-28T18:19:44
          65947265      high job72433  PENDING       0:00     1:00:00      1    1                  N/A
          65947266      high jobbe6d3  PENDING       0:00     1:00:00      1    1                  N/A
          65947267      high jobedcf3  PENDING       0:00     1:00:00      1    1                  N/A
          65947268      high job4bbdf  PENDING       0:00     1:00:00      1    1                  N/A
          65947269      high job915e6  PENDING       0:00     1:00:00      1    1                  N/A
          65947270      high jobb01ad  PENDING       0:00     1:00:00      1    1                  N/A
          65947271      high jobf3cdc  PENDING       0:00     1:00:00      1    1                  N/A
          65947272      high job1b749  PENDING       0:00     1:00:00      1    1                  N/A
          65947273      high jobcb2f3  PENDING       0:00     1:00:00      1    1                  N/A
          65947274      high jobc91ab  PENDING       0:00     1:00:00      1    1                  N/A
          65947275      high jobf7be7  PENDING       0:00     1:00:00      1    1                  N/A
          65947276      high jobaaf64  PENDING       0:00     1:00:00      1    1                  N/A
          65947277      high job0a254  PENDING       0:00     1:00:00      1    1                  N/A
          65947278      high jobc6dc9  PENDING       0:00     1:00:00      1    1                  N/A
          65947279      high job6df41  PENDING       0:00     1:00:00      1    1                  N/A
          65947280      high job9d78e  PENDING       0:00     1:00:00      1    1                  N/A
          65947281      high job23938  PENDING       0:00     1:00:00      1    1                  N/A
          65947282      high jobcb293  PENDING       0:00     1:00:00      1    1                  N/A
          65947283      high job50340  PENDING       0:00     1:00:00      1    1                  N/A

The scheduler isn't thrilled about allocating all those resources and thus the task takes far longer than it should.

wlandau-lilly · 2017-10-28T04:41:00Z

Yes, for future-powered parallelism, drake is incorrectly submitting a job for every object, file, or function you import. This is superfluous because by the time it calls future_lapply(), everything should already be imported. All I need to do is filter out the imports beforehand. Easy. Please stay tuned.

wlandau-lilly · 2017-10-28T04:51:07Z

@kendonB I think I fixed it here. Would you be willing to try again with 041bb50?

wlandau-lilly · 2017-10-28T04:51:34Z

By the way, it goes without saying that this is a super important thing for me to be aware of. Thank you for bringing it to my attention.

wlandau-lilly · 2017-10-28T05:17:21Z

By the way, if you have future-powered SLURM parallelism up and running, would you be willing to share your configuration? I am a batchtools novice, and I currently do not have SLURM access.

kendonB · 2017-10-28T05:19:44Z

The fix seemed to work for the above problem. Great!

Tried it again and got a pretty unhelpful error message. Does it make any sense to you?

Error: BatchtoolsExpiration: Future ('<none>') expired (registry path /gpfs1m/projects/landcare00063/projects_ac/measurementerrorandclimate/.future/20171028_180231-5y4tkp/batchtools_1469997983).. The last few lines of the logged output:
44: try(execJob(job))
45: doJobCollection.JobCollection(obj, output = output)
46: doJobCollection.character("/gpfs1m/projects/landcare00063/projects_ac/measurementerrorandclimate/.future/20171028_180231-5y4tkp/batchtools_1469997983/jobs/job2f26f50cddc38aa1f2fc2bef4606efe9.rds")
47: batchtools::doJobCollection("/gpfs1m/projects/landcare00063/projects_ac/measurementerrorandclimate/.future/20171028_180231-5y4tkp/batchtools_1469997983/jobs/job2f26f50cddc38aa1f2fc2bef4606efe9.rds")
An irrecoverable exception occurred. R is aborting now ...
/var/spool/slurmd/job65947498/slurm_script: line 22: 25373 Illegal instruction     (core dumped) Rscript -e 'batchtools::doJobCollection("/gpfs1m/projects/landcare00063/projects_ac/measurementerrorandclimate/.future/20171
In addition: Warning message:
In waitForJobs(ids = jobid, timeout = timeout, sleep = sleep_fcn,  :
  Some jobs disappeared from the system

Digging further, I found the associated log file:

### [bt 2017-10-28 18:04:13]: This is batchtools v0.9.6
### [bt 2017-10-28 18:04:13]: Starting calculation of 1 jobs
### [bt 2017-10-28 18:04:13]: Setting working directory to '/gpfs1m/projects/landcare00063/projects_ac/measurementerrorandclimate'
Loading required package: drake
Loading required package: methods
### [bt 2017-10-28 18:04:13]: Memory measurement disabled
### [bt 2017-10-28 18:04:16]: Starting job [batchtools job.id=1]

 *** caught illegal operation ***
address 0x2ae5ae328a68, cause 'illegal operand'

Traceback:
 1: dyn.load(file, DLLpath = DLLpath, ...)
 2: library.dynam(lib, package, package.lib)
 3: loadNamespace(name)
 4: doTryCatch(return(expr), name, parentenv, handler)
 5: tryCatchOne(expr, names, parentenv, handlers[[1L]])
 6: tryCatchList(expr, classes, parentenv, handlers)
 7: tryCatch(loadNamespace(name), error = function(e) {    warning(gettextf("namespace %s is not available and has been replaced\nby .GlobalEnv when processing object %s",         sQuote(name)[1L], sQuote(where)), domain = NA, call. = FALSE,         immediate. = TRUE)  \
  .GlobalEnv})
 8: ..getNamespace(c("dplyr", "0.7.4"), "")
 9: readRDS(self$name_hash(hash))
10: self$driver$get_object(hash)
11: self$get_value(self$get_hash(key, namespace), use_cache)
12: cache$get("config", namespace = "distributed")
13: ...future.FUN(...future.x_jj, ...)
14: FUN(X[[i]], ...)
15: lapply(seq_along(...future.x_ii), FUN = function(jj) {    ...future.x_jj <- ...future.x_ii[[jj]]    ...future.FUN(...future.x_jj, ...)})
16: (function (...) {    lapply(seq_along(...future.x_ii), FUN = function(jj) {        ...future.x_jj <- ...future.x_ii[[jj]]        ...future.FUN(...future.x_jj, ...)    })})(cache_path = "/gpfs1m/projects/landcare00063/projects_ac/measurementerrorandclimate/.drake")
17: do.call(function(...) {    lapply(seq_along(...future.x_ii), FUN = function(jj) {        ...future.x_jj <- ...future.x_ii[[jj]]        ...future.FUN(...future.x_jj, ...)    })}, args = future.call.arguments)
18: eval(quote({    do.call(function(...) {        lapply(seq_along(...future.x_ii), FUN = function(jj) {            ...future.x_jj <- ...future.x_ii[[jj]]            ...future.FUN(...future.x_jj, ...)        })    }, args = future.call.arguments)}), new.env())
19: eval(quote({    do.call(function(...) {        lapply(seq_along(...future.x_ii), FUN = function(jj) {            ...future.x_jj <- ...future.x_ii[[jj]]            ...future.FUN(...future.x_jj, ...)        })    }, args = future.call.arguments)}), new.env())
20: eval(expr, p)
21: eval(expr, p)
22: eval.parent(substitute(eval(quote(expr), envir)))
23: local({    do.call(function(...) {        lapply(seq_along(...future.x_ii), FUN = function(jj) {            ...future.x_jj <- ...future.x_ii[[jj]]            ...future.FUN(...future.x_jj, ...)        })    }, args = future.call.arguments)})
24: tryCatchList(expr, classes, parentenv, handlers)
25: tryCatch({    local({        do.call(function(...) {            lapply(seq_along(...future.x_ii), FUN = function(jj) {                ...future.x_jj <- ...future.x_ii[[jj]]                ...future.FUN(...future.x_jj, ...)            })        }, args = future.call.\
arguments)    })}, finally = {    {        {            NULL            future::plan(list(function (expr, envir = parent.frame(),                 substitute = TRUE, globals = TRUE, label = NULL,                 template = "batchtools_slurm.tmpl", resources = list(),    \
             workers = Inf, ...)             {                if (substitute)                   expr <- substitute(expr)                batchtools_by_template(expr, envir = envir, substitute = FALSE,                   globals = globals, label = label, template = templat\
e,                   type = "slurm", resources = resources, workers = workers,                   ...)            }), .cleanup = FALSE, .init = FALSE)        }        options(...future.oldOptions)    }})
26: eval(quote({    {        ...future.oldOptions <- options(future.startup.loadScript = FALSE,             future.globals.onMissing = "error")        {            {                NULL                local({                  for (pkg in "drake") {                    lo\
adNamespace(pkg)                    library(pkg, character.only = TRUE)                  }                })            }            future::plan("default", .cleanup = FALSE, .init = FALSE)        }    }    tryCatch({        local({            do.call(function(...) {   \
             lapply(seq_along(...future.x_ii), FUN = function(jj) {                  ...future.x_jj <- ...future.x_ii[[jj]]                  ...future.FUN(...future.x_jj, ...)                })            }, args = future.call.arguments)        })    }, finally = {     \
   {            {                NULL                future::plan(list(function (expr, envir = parent.frame(),                   substitute = TRUE, globals = TRUE, label = NULL,                   template = "batchtools_slurm.tmpl", resources = list(),                   \
workers = Inf, ...)                 {                  if (substitute) expr <- substitute(expr)                  batchtools_by_template(expr, envir = envir,                     substitute = FALSE, globals = globals, label = label,                     template = template\
, type = "slurm", resources = resources,                     workers = workers, ...)                }), .cleanup = FALSE, .init = FALSE)            }            options(...future.oldOptions)        }    })}), new.env())
27: eval(quote({    {        ...future.oldOptions <- options(future.startup.loadScript = FALSE,             future.globals.onMissing = "error")        {            {                NULL                local({                  for (pkg in "drake") {                    lo\
adNamespace(pkg)                    library(pkg, character.only = TRUE)                  }                })            }            future::plan("default", .cleanup = FALSE, .init = FALSE)        }    }    tryCatch({        local({            do.call(function(...) {   \
             lapply(seq_along(...future.x_ii), FUN = function(jj) {                  ...future.x_jj <- ...future.x_ii[[jj]]                  ...future.FUN(...future.x_jj, ...)                })            }, args = future.call.arguments)        })    }, finally = {     \
   {            {                NULL                future::plan(list(function (expr, envir = parent.frame(),                   substitute = TRUE, globals = TRUE, label = NULL,                   template = "batchtools_slurm.tmpl", resources = list(),                   \
workers = Inf, ...)                 {                  if (substitute) expr <- substitute(expr)                  batchtools_by_template(expr, envir = envir,                     substitute = FALSE, globals = globals, label = label,                     template = template\
, type = "slurm", resources = resources,                     workers = workers, ...)                }), .cleanup = FALSE, .init = FALSE)            }            options(...future.oldOptions)        }    })}), new.env())
28: eval(expr, p)
29: eval(expr, p)
30: eval.parent(substitute(eval(quote(expr), envir)))
31: local({    {        ...future.oldOptions <- options(future.startup.loadScript = FALSE,             future.globals.onMissing = "error")        {            {                NULL                local({                  for (pkg in "drake") {                    loadNam\
espace(pkg)                    library(pkg, character.only = TRUE)                  }                })            }            future::plan("default", .cleanup = FALSE, .init = FALSE)        }    }    tryCatch({        local({            do.call(function(...) {        \
        lapply(seq_along(...future.x_ii), FUN = function(jj) {                  ...future.x_jj <- ...future.x_ii[[jj]]                  ...future.FUN(...future.x_jj, ...)                })            }, args = future.call.arguments)        })    }, finally = {        { \
           {                NULL                future::plan(list(function (expr, envir = parent.frame(),                   substitute = TRUE, globals = TRUE, label = NULL,                   template = "batchtools_slurm.tmpl", resources = list(),                   worke\
rs = Inf, ...)                 {                  if (substitute)                     expr <- substitute(expr)                  batchtools_by_template(expr, envir = envir,                     substitute = FALSE, globals = globals, label = label,                     temp\
late = template, type = "slurm", resources = resources,                     workers = workers, ...)                }), .cleanup = FALSE, .init = FALSE)            }            options(...future.oldOptions)        }    })})
32: eval(expr, envir = envir)
33: eval(expr, envir = envir)
34: (function (expr, substitute = FALSE, envir = .GlobalEnv, ...) {    if (substitute)         expr <- substitute(expr)    eval(expr, envir = envir)})(local({    {        ...future.oldOptions <- options(future.startup.loadScript = FALSE,             future.globals.onMis\
sing = "error")        {            {                NULL                local({                  for (pkg in "drake") {                    loadNamespace(pkg)                    library(pkg, character.only = TRUE)                  }                })            }       \
     future::plan("default", .cleanup = FALSE, .init = FALSE)        }    }    tryCatch({        local({            do.call(function(...) {                lapply(seq_along(...future.x_ii), FUN = function(jj) {                  ...future.x_jj <- ...future.x_ii[[jj]]     \
             ...future.FUN(...future.x_jj, ...)                })            }, args = future.call.arguments)        })    }, finally = {        {            {                NULL                future::plan(list(function (expr, envir = parent.frame(),                  \
 substitute = TRUE, globals = TRUE, label = NULL,                   template = "batchtools_slurm.tmpl", resources = list(),                   workers = Inf, ...)                 {                  if (substitute)                     expr <- substitute(expr)             \
     batchtools_by_template(expr, envir = envir,                     substitute = FALSE, globals = globals, label = label,                     template = template, type = "slurm", resources = resources,                     workers = workers, ...)                }), .cle\
anup = FALSE, .init = FALSE)            }            options(...future.oldOptions)        }    })}), substitute = TRUE)
35: do.call(job$fun, job$pars, envir = .GlobalEnv)
36: with_preserve_seed({    set.seed(seed)    code})
37: with_seed(job$seed, do.call(job$fun, job$pars, envir = .GlobalEnv))
38: execJob.Job(job)
39: execJob(job)
40: doTryCatch(return(expr), name, parentenv, handler)
41: tryCatchOne(expr, names, parentenv, handlers[[1L]])
42: tryCatchList(expr, classes, parentenv, handlers)
43: tryCatch(expr, error = function(e) {    call <- conditionCall(e)    if (!is.null(call)) {        if (identical(call[[1L]], quote(doTryCatch)))             call <- sys.call(-4L)        dcall <- deparse(call)[1L]        prefix <- paste("Error in", dcall, ": ")        \
LONG <- 75L        msg <- conditionMessage(e)        sm <- strsplit(msg, "\n")[[1L]]        w <- 14L + nchar(dcall, type = "w") + nchar(sm[1L], type = "w")        if (is.na(w))             w <- 14L + nchar(dcall, type = "b") + nchar(sm[1L],                 type = "b")  \
      if (w > LONG)             prefix <- paste0(prefix, "\n  ")    }    else prefix <- "Error : "    msg <- paste0(prefix, conditionMessage(e), "\n")    .Internal(seterrmessage(msg[1L]))    if (!silent && identical(getOption("show.error.messages"),         TRUE)) {    \
    cat(msg, file = outFile)        .Internal(printDeferredWarnings())    }    invisible(structure(msg, class = "try-error", condition = e))})
44: try(execJob(job))
45: doJobCollection.JobCollection(obj, output = output)
46: doJobCollection.character("/gpfs1m/projects/landcare00063/projects_ac/measurementerrorandclimate/.future/20171028_180231-5y4tkp/batchtools_1064049984/jobs/jobf0c91b37dd1deae3f4b129cf8189c303.rds")
47: batchtools::doJobCollection("/gpfs1m/projects/landcare00063/projects_ac/measurementerrorandclimate/.future/20171028_180231-5y4tkp/batchtools_1064049984/jobs/jobf0c91b37dd1deae3f4b129cf8189c303.rds")
An irrecoverable exception occurred. R is aborting now ...
/var/spool/slurmd/job65947499/slurm_script: line 22: 25563 Illegal instruction     (core dumped) Rscript -e 'batchtools::doJobCollection("/gpfs1m/projects/landcare00063/projects_ac/measurementerrorandclimate/.future/20171028_180231-5y4tkp/batchtools_1064049984/jobs/jobf\
0c91b37dd1deae3f4b129cf8189c303.rds")'

```

wlandau-lilly · 2017-10-28T05:21:42Z

Hmm.... good to know, but over my head until I learn batchtools in earnest. If the original problem was solved, I will close this issue. Would you reference and continue this in #113?

kendonB · 2017-10-28T05:22:28Z

Other than account name and wall time, I have the same config as in your .tmpl file. As in, yours works for me!

wlandau-lilly · 2017-10-28T17:40:17Z

@kendonB would you check drake::session() from that last failure? (Returns the cached sessionInfo() of the make() attempt.) This trouble may have something to do with the package environment being different on the compute nodes than the headnode. I have experienced similar issues with SGE, usually because module load R loads a version of R incompatible with the packages in my local library. I have to do module load R-3.4.2 or similar.

wlandau-lilly · 2017-10-28T17:41:31Z

Reopening this issue with a different title. Right now, it's really about debugging a SLURM workflow.

kendonB · 2017-10-28T19:56:43Z

sessionInfos for the calling session and drake below. They appear to be the same. FWIW, I'm certain that the R environments on the build and compute nodes are identical and have access to the same files/packages when they're loaded.

drake::session()

R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.3 (Santiago)

Matrix products: default
BLAS/LAPACK: /gpfs1m/apps/easybuild/RHEL6.3/sandybridge/software/imkl/2017.1.132-gimpi-2017a/compilers_and_libraries_2017.1.132/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
 [1] prism_0.0.7             xtable_1.8-2            climateimpacts_0.1.0
 [4] dtplyr_0.0.2            data.table_1.10.4-2     stringr_1.2.0
 [7] plm_1.6-5               Formula_1.2-2           lfe_2.5-1998
[10] Matrix_1.2-10           feather_0.3.1           lubridate_1.6.0
[13] assertive_0.3-5         gistools_1.0            weatherdata_0.1.0
[16] raster_2.5-8            sp_1.2-5                bindrcpp_0.2
[19] future.batchtools_0.6.0 future_1.6.2            dplyr_0.7.4
[22] purrr_0.2.4             readr_1.1.1             tidyr_0.7.2
[25] tibble_1.3.4            ggplot2_2.2.1           tidyverse_1.1.1
[28] drake_4.3.1.9000

loaded via a namespace (and not attached):
 [1] minqa_1.2.4                assertive.base_0.0-7
 [3] colorspace_1.3-2           rprojroot_1.2
 [5] listenv_0.6.0              MatrixModels_0.4-1
 [7] assertive.sets_0.0-3       xml2_1.1.1
 [9] splines_3.4.0              assertive.data.uk_0.0-1
[11] codetools_0.2-15           R.methodsS3_1.7.1
[13] mnormt_1.5-5               knitr_1.17
[15] jsonlite_1.5               nloptr_1.0.4
[17] assertive.data.us_0.0-1    pbkrtest_0.4-7
[19] broom_0.4.2                R.oo_1.21.0
[21] compiler_3.4.0             httr_1.3.1
[23] backports_1.1.0            assertthat_0.2.0
[25] lazyeval_0.2.0             quantreg_5.33
[27] visNetwork_2.0.1           htmltools_0.3.6
[29] prettyunits_1.0.2          tools_3.4.0
[31] igraph_1.1.2               gtable_0.2.0
[33] glue_1.1.1                 reshape2_1.4.2
[35] batchtools_0.9.6           rappdirs_0.3.1
[37] Rcpp_0.12.13               cellranger_1.1.0
[39] nlme_3.1-131               assertive.files_0.0-2
[41] assertive.datetimes_0.0-2  assertive.models_0.0-1
[43] lmtest_0.9-35              psych_1.7.5
[45] globals_0.10.3             lme4_1.1-13
[47] testthat_1.0.2             rvest_0.3.2
[49] eply_0.1.0                 MASS_7.3-47
[51] zoo_1.8-0                  scales_0.4.1
[53] hms_0.3                    sandwich_2.4-0
[55] SparseM_1.77               assertive.matrices_0.0-1
[57] assertive.strings_0.0-3    geosphere_1.5-5
[59] bdsmatrix_1.3-2            stringi_1.1.5
[61] checkmate_1.8.5            storr_1.1.2
[63] rlang_0.1.2                pkgconfig_2.0.1
[65] evaluate_0.10.1            lattice_0.20-35
[67] assertive.data_0.0-1       bindr_0.1
[69] htmlwidgets_0.8            assertive.properties_0.0-4
[71] assertive.code_0.0-1       plyr_1.8.4
[73] magrittr_1.5               R6_2.2.2
[75] base64url_1.2              DBI_0.6-1
[77] mgcv_1.8-17                haven_1.1.0
[79] foreign_0.8-68             withr_2.0.0
[81] assertive.numbers_0.0-2    nnet_7.3-12
[83] car_2.1-4                  modelr_0.1.0
[85] crayon_1.3.4               assertive.types_0.0-3
[87] progress_1.1.2             grid_3.4.0
[89] readxl_1.0.0               forcats_0.2.0
[91] digest_0.6.12              brew_1.0-6
[93] R.utils_2.5.0              munsell_0.4.3
[95] assertive.reflection_0.0-4


sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.3 (Santiago)

Matrix products: default
BLAS/LAPACK: /gpfs1m/apps/easybuild/RHEL6.3/sandybridge/software/imkl/2017.1.132-gimpi-2017a/compilers_and_libraries_2017.1.132/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
 [1] prism_0.0.7             xtable_1.8-2            climateimpacts_0.1.0
 [4] dtplyr_0.0.2            data.table_1.10.4-2     stringr_1.2.0
 [7] plm_1.6-5               Formula_1.2-2           lfe_2.5-1998
[10] Matrix_1.2-10           feather_0.3.1           lubridate_1.6.0
[13] assertive_0.3-5         gistools_1.0            weatherdata_0.1.0
[16] raster_2.5-8            sp_1.2-5                bindrcpp_0.2
[19] future.batchtools_0.6.0 future_1.6.2            dplyr_0.7.4
[22] purrr_0.2.4             readr_1.1.1             tidyr_0.7.2
[25] tibble_1.3.4            ggplot2_2.2.1           tidyverse_1.1.1
[28] drake_4.3.1.9000

loaded via a namespace (and not attached):
 [1] minqa_1.2.4                assertive.base_0.0-7
 [3] colorspace_1.3-2           rprojroot_1.2
 [5] listenv_0.6.0              MatrixModels_0.4-1
 [7] assertive.sets_0.0-3       xml2_1.1.1
 [9] splines_3.4.0              assertive.data.uk_0.0-1
[11] codetools_0.2-15           R.methodsS3_1.7.1
[13] mnormt_1.5-5               knitr_1.17
[15] jsonlite_1.5               nloptr_1.0.4
[17] assertive.data.us_0.0-1    pbkrtest_0.4-7
[19] broom_0.4.2                R.oo_1.21.0
[21] compiler_3.4.0             httr_1.3.1
[23] backports_1.1.0            assertthat_0.2.0
[25] lazyeval_0.2.0             quantreg_5.33
[27] visNetwork_2.0.1           htmltools_0.3.6
[29] prettyunits_1.0.2          tools_3.4.0
[31] igraph_1.1.2               gtable_0.2.0
[33] glue_1.1.1                 reshape2_1.4.2
[35] batchtools_0.9.6           rappdirs_0.3.1
[37] Rcpp_0.12.13               cellranger_1.1.0
[39] nlme_3.1-131               assertive.files_0.0-2
[41] assertive.datetimes_0.0-2  assertive.models_0.0-1
[43] lmtest_0.9-35              psych_1.7.5
[45] globals_0.10.3             lme4_1.1-13
[47] testthat_1.0.2             rvest_0.3.2
[49] eply_0.1.0                 MASS_7.3-47
[51] zoo_1.8-0                  scales_0.4.1
[53] hms_0.3                    sandwich_2.4-0
[55] SparseM_1.77               assertive.matrices_0.0-1
[57] assertive.strings_0.0-3    geosphere_1.5-5
[59] bdsmatrix_1.3-2            stringi_1.1.5
[61] checkmate_1.8.5            storr_1.1.2
[63] rlang_0.1.2                pkgconfig_2.0.1
[65] evaluate_0.10.1            lattice_0.20-35
[67] assertive.data_0.0-1       bindr_0.1
[69] htmlwidgets_0.8            assertive.properties_0.0-4
[71] assertive.code_0.0-1       plyr_1.8.4
[73] magrittr_1.5               R6_2.2.2
[75] base64url_1.2              DBI_0.6-1
[77] mgcv_1.8-17                haven_1.1.0
[79] foreign_0.8-68             withr_2.0.0
[81] assertive.numbers_0.0-2    nnet_7.3-12
[83] car_2.1-4                  modelr_0.1.0
[85] crayon_1.3.4               assertive.types_0.0-3
[87] progress_1.1.2             grid_3.4.0
[89] readxl_1.0.0               forcats_0.2.0
[91] digest_0.6.12              brew_1.0-6
[93] R.utils_2.5.0              munsell_0.4.3
[95] assertive.reflection_0.0-4

wlandau-lilly · 2017-10-29T01:28:23Z

Yup, the sessionInfo()s are the same.

It looks like the root problem is a failed attempt to load dplyr, apparently triggered when drake loads the central configuration list in preparation to build the target. In the past, I have only encountered this error when there is somehow a mismatch between the local node and the compute node. In those cases, it mattered that I compiled dplyr on one node and ran it in another. But in your case, this should not matter. Baffling.

I can think of a couple things to try, but they probably won't be sufficient.

Submit a job that just loads dplyr and then exits. This may tell us if drake is really at fault.
Try make(..., envir = your_envir), where all your functions and other import objects are defined in your_envir. You might also set packages equal to c("dplyr", ...) That way, at least the dyn.load() error will be triggered in a different place.

wlandau-lilly · 2017-10-29T01:30:41Z

By the way, are you loading dplyr with a formal call to library(), using :: to reference functions everwhere instead, or using make(..., packages = c("dplyr", ...))? This probably won't matter either, but it could help.

kendonB · 2017-10-30T08:12:41Z

Fantastic that you got SLURM to work!

To start, the above file path is just the location I was running the drake example, so that's exactly where it should be looking for stuff.

Unfortunately, I still see the same error. This was using your new *.tmpl file with the obvious edits and ran fresh in my home directory.

Error: BatchtoolsExpiration: Future ('<none>') expired (registry path /home/kendon.bell/slurm/.future/20171030_210751-BfbAn9/batchtools_1052297447).. The last few lines of the logged output:
24: try(loadRegistryDependencies(jc, must.work = TRUE), silent = TRUE)
25: doJobCollection.JobCollection(obj, output = output)
26: doJobCollection.character("/home/kendon.bell/slurm/.future/20171030_210751-BfbAn9/batchtools_1052297447/jobs/job065213670d130c5be989faea9feff510.rds")
27: batchtools::doJobCollection("/home/kendon.bell/slurm/.future/20171030_210751-BfbAn9/batchtools_1052297447/jobs/job065213670d130c5be989faea9feff510.rds")
An irrecoverable exception occurred. R is aborting now ...
/var/spool/slurmd/job65960911/slurm_script: line 15: 20301 Illegal instruction     (core dumped) Rscript -e 'batchtools::doJobCollection("/home/kendon.bell/slurm/.future/20171030_210751-BfbAn9/batchtools_1052297447/jobs/job065213670d130c5be989faea9feff510.rds")'
In addition: Warning message:
In waitForJobs(ids = jobid, timeout = timeout, sleep = sleep_fcn,  :
  Some jobs disappeared from the system

wlandau-lilly · 2017-10-30T11:43:58Z

How well do you know your way around batchtools itself? I am a complete novice, so I will need to do more digging before I can have future suggestions. That and we should ask @mllg and @HenrikBengtsson for help.

What are your versions of batchtools and future.batchtools, by the way? Mine are 0.9.6 and 0.6.0, respectively.

wlandau-lilly · 2017-10-30T11:49:26Z

Wait... I see from your session info that your versions agree with mine.

kendonB · 2017-10-30T19:25:24Z

Batchtools novice here, unfortunately. At some point, I'll try some minimal examples from that package.

wlandau-lilly · 2017-10-30T19:30:38Z

Hmm... come to think of it, future.batchtools seems much more accessible for both of us. It would be a great help if you would try out the following when you get a chance.

library(future.batchtools)
plan(batchtools_slurm(template = "batchtools.slurm.tmpl")) # future::plan(), not drake::plan()
future_lapply(1:2, cat)

Sorry about the awkward back-and-forth again. I guess the trick for me is getting a SLURM installation just buggy enough to fail at the right times.

kendonB · 2017-10-30T20:30:33Z

Yep same error - will report in future.batchtools

wlandau-lilly · 2017-10-30T20:44:00Z

Good, now we know it's not actually drake itself. Thanks!

kendonB · 2017-10-31T01:40:49Z

As per HenrikBengtsson/future.batchtools#11, I solved it with another configuration flag that was missing (#SBATCH -C sb). The minimal drake with slurm example now works!! Sorry to waste your time, @wlandau-lilly. I really appreciate your effort and speedy responses.

wlandau-lilly · 2017-10-31T01:47:45Z

Best news I have heard all week! (Not saying much for a Monday, but you get the idea.) Totally worth the time. I will close this issue, but I have a couple more questions.

Would you explain what #SBATCH -C sb does? Does sb stand for sandybridge, the architecture you referenced here? I had a look at the --constraint flag in man sbatch, but I am not sure I understand it.
Do you think anything needs to be added to the drake documentation?

wlandau-lilly · 2017-10-31T02:01:33Z

The solution

For completeness: from @kendonB via HenrikBengtsson/future.batchtools#11:

OK, I solved it I believe. It was my fault - we have two architectures on our cluster and I had compiled my R packages on sandybridge but the job was getting sent to westmere. This was as simple as adding another configuration flag to the *.tmpl file. The minimal example now works!

wlandau-lilly · 2017-10-31T02:12:29Z

@kendonB in case it makes you feel any better, I was planning to go to the trouble to install SLURM anyway so I could test the minimal example and fix some of the issues my colleagues from grad school were having. Speaking of whom: @jarad, @emittman, and @nachalca, drake's minimal SLURM example is ready for you to try with development drake.

wlandau-lilly · 2017-10-31T02:13:30Z

(@nachalca, have you had a chance to check out the hpc resources at your new job?)

kendonB · 2017-10-31T02:14:46Z

sb is sandybridge, yes. As far as I can tell, the constraint flag is for system specific constraints, like "hey SLURM put this one on sandybridge"

kendonB · 2017-10-31T10:23:35Z

@wlandau-lilly, I find when running my project using future_lapply, after the slurm jobs complete, the host R processes memory usage blows up (slowly) in htop. Even if this is isn't real memory usage, it's still problematic as I'm running the host process on the shared build node. Is this the behavior you expect? Is the host process bringing back all the data to the host before writing to disk?

wlandau-lilly · 2017-10-31T12:48:24Z

Hmm... I thought I had avoided that problem. I even prune the environment to make sure unnecessary targets are removed from memory at each parallelizable stage.

Do you have the same memory issues if you call make() on an up-to-date project? If not, we can narrow our search to run_future_lapply(). The future_lapply() worker calls build_distributed(), which calls build(). All that should only run on the cluster.

wlandau-lilly · 2017-10-31T12:50:15Z

If you think #117 might work for you, you might compare the host memory usage there. Makefile parallelism is a totally different mechanism.

wlandau-lilly · 2017-10-31T12:56:22Z

Just how slowly does host memory blow up? What is the progression?

wlandau-lilly · 2017-10-31T13:18:27Z

In short, drake should not be writing targets back to host memory before storing them.

HenrikBengtsson · 2017-10-31T16:33:57Z

If this is happening "slowly" and "after the slurm jobs complete" it may suggests that it occurs in the step where the values from all the jobs are gathered and brought back to the master R process by future_lapply(). I don't know what you/drake is returning in each future/job, but if they're large values or small but very many, then this could happen. FYI, I've when implemented these steps in future I did pay attention to memory efficiency - maybe there are more tweaks that can be done, especially if there is a huge number of futures being collected. @kendonB, you've mention "large number of jobs" elsewhere - what is "large" in your examples?

wlandau-lilly · 2017-10-31T17:35:20Z

I think I know what the problem is: build_distributed() returns the whole configuration list. I was unwisely using this to keep track of which targets were attempted. I will fix this today.

wlandau-lilly · 2017-10-31T18:02:22Z

I think 1e5daed fixes the memory issues, pending confirmation from @kendonB.

kendonB · 2017-10-31T19:21:09Z

Re: Do you have the same memory issues if you call make() on an up-to-date project?: I will try to remember to check this once the project is up to date.

Re: Just how slowly does host memory blow up? What is the progression? I first looked at around the time the last job finished which was around 20 minutes after the first job finished and it was using 10GB in htop. Next, I watched for about another 20 minutes and it grew to an ultimate ~20GB.

Re: you've mention "large number of jobs" elsewhere - what is "large" in your examples? In this particular example I was building 200 targets which are each 163MB. Thankfully the implied total is certainly higher than 20GB.

Re: 1e5daed fixes the memory issues, pending confirmation from @kendonB. I will try this today.

wlandau-lilly closed this as completed Oct 26, 2017

wlandau-lilly mentioned this issue Oct 27, 2017

Try drake + future.batchtools nachalca/example_slurmArray#1

Open

wlandau-lilly reopened this Oct 28, 2017

wlandau-lilly closed this as completed Oct 28, 2017

This was referenced Oct 28, 2017

Collect quick high-performance computing examples for drake #113

Closed

Troubleshooting Makefile parallelism for SLURM #117

Closed

wlandau-lilly changed the title ~~Option to use different makefiles for different targets?~~ Debugging a future_lapply()-powered SLURM workflow Oct 28, 2017

wlandau-lilly changed the title ~~Debugging a future_lapply()-powered SLURM workflow~~ Debugging a future_lapply()-powered SLURM workflow Oct 28, 2017

wlandau-lilly reopened this Oct 28, 2017

wlandau-lilly added the status: priority label Oct 28, 2017

wlandau-lilly added this to the CRAN release 4.4.0 milestone Oct 28, 2017

wlandau-lilly mentioned this issue Oct 30, 2017

Resolving a batchtools_slurm-related error, probably a configuration problem HenrikBengtsson/future.batchtools#11

Closed

wlandau-lilly closed this as completed Oct 31, 2017

wlandau-lilly mentioned this issue Oct 31, 2017

Contemplate an rslurm backend #120

Closed

wlandau-lilly mentioned this issue Oct 31, 2017

Reports of large memory usage for the host process. #124

Closed

wlandau-lilly mentioned this issue Nov 11, 2017

Testing the high-performance computing examples #139

Closed

kendonB mentioned this issue Feb 5, 2018

Do not write to the cache anywhere in build_target() #236

Closed

wlandau mentioned this issue Feb 23, 2018

How should I mix non-R code (e.g. Python and shell scripts) in a large drake workflow? #277

Closed

pat-s mentioned this issue Sep 30, 2019

SLURM: "Illegal instruction" on some nodes mschubert/clustermq#171

Closed

Debugging a future_lapply()-powered SLURM workflow #115

Debugging a future_lapply()-powered SLURM workflow #115

Comments

kendonB commented Oct 26, 2017 • edited

wlandau-lilly commented Oct 26, 2017

Idea 1: Makefile parallelism

Idea 2: future.batchtools

wlandau-lilly commented Oct 26, 2017

wlandau-lilly commented Oct 26, 2017

kendonB commented Oct 26, 2017

wlandau-lilly commented Oct 26, 2017

kendonB commented Oct 28, 2017 • edited

kendonB commented Oct 28, 2017

wlandau-lilly commented Oct 28, 2017

wlandau-lilly commented Oct 28, 2017

wlandau-lilly commented Oct 28, 2017

wlandau-lilly commented Oct 28, 2017

kendonB commented Oct 28, 2017

wlandau-lilly commented Oct 28, 2017

kendonB commented Oct 28, 2017 • edited

wlandau-lilly commented Oct 28, 2017 • edited

wlandau-lilly commented Oct 28, 2017

kendonB commented Oct 28, 2017

wlandau-lilly commented Oct 29, 2017

wlandau-lilly commented Oct 29, 2017

kendonB commented Oct 30, 2017

wlandau-lilly commented Oct 30, 2017 • edited

wlandau-lilly commented Oct 30, 2017

kendonB commented Oct 30, 2017

wlandau-lilly commented Oct 30, 2017 • edited

kendonB commented Oct 30, 2017

wlandau-lilly commented Oct 30, 2017

kendonB commented Oct 31, 2017

wlandau-lilly commented Oct 31, 2017

wlandau-lilly commented Oct 31, 2017

The solution

wlandau-lilly commented Oct 31, 2017

wlandau-lilly commented Oct 31, 2017

kendonB commented Oct 31, 2017

kendonB commented Oct 31, 2017

wlandau-lilly commented Oct 31, 2017 • edited

wlandau-lilly commented Oct 31, 2017

wlandau-lilly commented Oct 31, 2017

wlandau-lilly commented Oct 31, 2017

HenrikBengtsson commented Oct 31, 2017

wlandau-lilly commented Oct 31, 2017 • edited

wlandau-lilly commented Oct 31, 2017 • edited

kendonB commented Oct 31, 2017 • edited

kendonB commented Oct 26, 2017 •

edited

Idea 2: `future.batchtools`

kendonB commented Oct 28, 2017 •

edited

kendonB commented Oct 28, 2017 •

edited

wlandau-lilly commented Oct 28, 2017 •

edited

wlandau-lilly commented Oct 30, 2017 •

edited

wlandau-lilly commented Oct 30, 2017 •

edited

wlandau-lilly commented Oct 31, 2017 •

edited

wlandau-lilly commented Oct 31, 2017 •

edited

wlandau-lilly commented Oct 31, 2017 •

edited

kendonB commented Oct 31, 2017 •

edited