Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using drake with RStudio job launcher #807

Closed
2 tasks done
gadenbuie opened this issue Mar 29, 2019 · 6 comments
Closed
2 tasks done

Using drake with RStudio job launcher #807

gadenbuie opened this issue Mar 29, 2019 · 6 comments

Comments

@gadenbuie
Copy link
Contributor

@gadenbuie gadenbuie commented Mar 29, 2019

Prework

Description

I have a drake workflow with a long-running step and in my pre-drake life I would use the RStudio Job Launcher to run these kinds of tasks in the background so I can keep coding in my console.

I've discovered, though, that without additional work, running drake::make() as an RStudio background job invalidates targets that depend on functions sourced into the global environment. Running drake::make() or drake::r_make() from a standard environment after running make() inside the RStudio job launcher will again invalidate old but up-to-date targets, including those just built by the job launcher.

It took me quite a bit of poking around to be able to pare down the problem to the reproducible example in this repo, but in doing so I think I've uncovered that the invalidating results from some minor changes to the global environment that RStudio uses to monitor script progress and update the job viewer.

A side-effect of this is that all of the hashes in the cache log are the same between steps, so it's not immediately obvious why the targets are invalidating. Some digging with deps_profile() eventually provide clues that something is amiss. In the end, using an environment for the function dependencies clears up the problem.

I'm sharing here because I thought you might like to know and to find out if I'm doing anything very wrong in my setup. I'm not sure if there's anything you can or would want to do about this, other than to document the issue, and with RStudio 1.2 releasing soon (or eventually) I would imagine that other drake users will try to do what I did and be equally confused. Also, if the workflow and solution in my reprex repo seems reasonable, I'll wrap it up into a blog post.

@wlandau
Copy link
Collaborator

@wlandau wlandau commented Mar 29, 2019

Thanks for investigating. I do not have access to the RStudio Job Launcher, but I will try to follow along.

Could we try something even simpler? What if we used the following make.R job script?

library(drake)
library(readr)

import_data <- function(infile) {
  suppressMessages(read_csv(infile))
}

stopifnot(file.exists("data/mtcars.csv"))
plan <- drake_plan(data = import_data(file_in("data/mtcars.csv")))
prefix <- paste(Sys.info()["nodename"], proc.time()["elapsed"], stringi::stri_rand_strings(1, 10), sep = "-")

make(
  plan,
  cache_log_file = paste0(prefix, "-cache.log",
  console_log_file = paste0(prefix, "-console.log" # much more useful post-#808
)

config <- drake_config(plan)
vis_drake_graph(config, file = paste0(prefix, ".png")) # requires webshot::install_phantomjs() first

What do the results look like if you run the following with a fresh session and cache?

callr::rscript("make.R", show = TRUE)
rstudioapi::jobRunScript("make.R")

And what if we replace jobRunScript() with a full remote job from the RStudio Job Launcher?

emitProgress and sourceWithProgress should theoretically not be the problem because they should not show up as dependencies of any of the targets or functions. It is good that you are looking at deps_profile() and deps_target(). I am surprised that hashes appear to mostly agree despite the fact that targets are getting invalidated.

I noticed that the dependency profile of data actually does appear to change. At one point in the README, you show:

deps_profile("data", config)
## # A tibble: 4 x 4
##   hash     changed old_hash         new_hash
##   <chr>    <lgl>   <chr>            <chr>
## 1 command  FALSE   40c2ded1562d6fda 40c2ded1562d6fda
## 2 depend   TRUE    ""               4f18907a711e6c41
## 3 file_in  FALSE   a0775797ef1a5066 a0775797ef1a5066
## 4 file_out FALSE   ""               ""

So it looks like data was first built with no global object/function dependencies, and now it suddenly has enough dependencies to compute the hash 4f18907a711e6c41. That tells me import_data() may not have been detected in the first make().

gadenbuie added a commit to gadenbuie/drake-rstudio-jobs-example that referenced this issue Mar 29, 2019
@gadenbuie
Copy link
Contributor Author

@gadenbuie gadenbuie commented Mar 29, 2019

Thanks for the advice! I tried your suggestions and added them here, but maybe it's too minimal? I didn't see the same behavior I'm seeing in the larger example.

And what if we replace jobRunScript() with a full remote job from the RStudio Job Launcher?

I actually don't have access to this either, but until now I've considered the local job launcher in RStudio to be a great way to spawn background processes in new, clean sessions.

I noticed that the dependency profile of data actually does appear to change. ... So it looks like data was first built with no global object/function dependencies, and now it suddenly has enough dependencies to compute the hash 4f18907a711e6c41. That tells me import_data() may not have been detected in the first make().

I agree, and this is what led me to be suspicious of the environment. Just a shot in the dark, but I think the job launcher uses sourceWithProgress() to run the job's script, and digging around RStudio source code, the first thing it does is set up a new environment and all of the job script lines are executed in that environment.

sourceWithProgress <- function(script, # path to R script
   ...
   ){
   # create a new enviroment to host any values created; make its parent the global env so any
   # variables inside this function's environment aren't visible to the script
   sourceEnv <- new.env(parent = globalenv())
   ...

   # evaluate the statement
   eval(statements[[idx]], envir = sourceEnv)
   ...
}

Edit (sent too soon): So maybe this is why the import_data() dependency is missed by drake?

@wlandau
Copy link
Collaborator

@wlandau wlandau commented Mar 29, 2019

Thanks for the advice! I tried your suggestions and added them here, but maybe it's too minimal?

Could be. One thing you could try now is replacing the minimal plan with the larger one you started with and keep everything else the same.

So maybe this is why the import_data() dependency is missed by drake?

Could easily be. It is difficult to micromanage the environment in which you source() scripts. In particular, in nested calls to source(), the inner scripts do not automatically use the environment of the parent scripts: https://stackoverflow.com/questions/55008645/control-the-environment-of-nested-calls-to-source.

In your original example, you had a top-level _drake_env_rstudio-job.R that sourced _drake.R. So while _drake_env_rstudio-job.R used the clean new environment from the job launcher, _drake.R populated the global environment with all the functions. By default, make() gets the calling environment, which in this cases should be the non-global environment in which _drake_env_rstudio-job.R is source()'ed. If I am write, the minimal make.R script will behave as desired on the full plan and the _drake_env_rstudio-job.R + _drake.R setup from before will fail with the minimal plan.

@wlandau
Copy link
Collaborator

@wlandau wlandau commented Mar 29, 2019

gadenbuie added a commit to gadenbuie/drake-rstudio-jobs-example that referenced this issue Mar 29, 2019
gadenbuie added a commit to gadenbuie/drake-rstudio-jobs-example that referenced this issue Mar 29, 2019
@gadenbuie
Copy link
Contributor Author

@gadenbuie gadenbuie commented Mar 29, 2019

I think you're right @wlandau!

Moving the more complete plan into a single make.R file works fine, as demonstrated here. (FYI I moved the graph so that it shows out of date targets prior to executing a step.)

I then moved one function into a separate file and sourced it, and this invalidated the target that required that function (see here). Interestingly, downstream targets of the invalidated target were also marked as outdated but not re-run once the invalidated target was re-made (which makes sense).

@wlandau
Copy link
Collaborator

@wlandau wlandau commented Mar 29, 2019

Thanks for the thorough and pedantic detective work, @gadenbuie. I think we know what is going on now. Also, your original solution of source()'ing into a custom envir makes total sense now. It's a way for us to take control of the environment and not leave it up to the RStudio IDE.

Interestingly, downstream targets of the invalidated target were also marked as outdated but not re-run once the invalidated target was re-made (which makes sense).

Yeah, drake ultimately uses hashes instead of timestamps to check if targets are valid/invalid. Once the data is rebuilt, the fingerprint turns out to be identical to last time, so the downstream targets were up to date all along.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants