Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parametrised Jobs #4176

Closed
EdwinTh opened this issue Jan 18, 2019 · 15 comments
Closed

Parametrised Jobs #4176

EdwinTh opened this issue Jan 18, 2019 · 15 comments
Labels
enhancement jobs stale Issues which have been closed automatically due to inactivitiy.

Comments

@EdwinTh
Copy link

@EdwinTh EdwinTh commented Jan 18, 2019

Hi,

Many thanks for the awesome Jobs functionality, it is a fantastic productivity boost! I think it could be further improved if you could parametrise a Job. I currently use the Jobs functionality to train a number of models, each with a long training time. The models are exactly the same, the only difference is that they are trained on different subsets of the data. I added a subgroup column to the data that is read in the Jobs script and a parameter to the top of the script, so the first lines of the script are:

group_nr <- 1
library(dplyr)
full_set <- readr::read_csv("path_to_data")
train_set <- full_set %>% filter(subgroup == group_nr)

I now manually adjust the group_nr, save the script and start a new Job, several thime. This is a bit tedious and error prone. Another drawback is that all the Jobs have the same name, so it is some trouble to figure out which one is which. (Which is problematic if some of them fail).

It would be great if there could be an option to kick-off the same Job several time at once, in which the user can specify the parameter values in a different interface than the script itself. Each of the Jobs then gets to run with one of the parameter values.

Good luck with all the great work you are doing,
Edwin

@jmcphers
Copy link
Member

@jmcphers jmcphers commented Jan 18, 2019

You can do this today! There's no UI affordance for it, but it's scriptable. There's a new API, so you can make a master .R file which sets the group number in the global environment, and then runs the script with a snapshot of the global environment.

group_nr <- 1 
rstudioapi::jobRunScript("process.R", importEnv = TRUE)
group_nr <- 2
rstudioapi::jobRunScript("process.R", importEnv = TRUE)
group_nr <- 3
...

It'd be nicer if you could do this in the UI, and it'd also be nice if these jobs had distinct names, so we'll leave this issue open to track that request.

@EdwinTh
Copy link
Author

@EdwinTh EdwinTh commented Jan 19, 2019

Actually, scriptable is even better than the UI solution I proposed, in terms of reproducibility. Did not cross my mind, but great it is already here. The distinct names to the job are then still a clear improvement, maybe as an argument of rstudioapi::jobRunScript?

@jmcphers
Copy link
Member

@jmcphers jmcphers commented Jan 22, 2019

Yes, that's a good idea.

@EdwinTh
Copy link
Author

@EdwinTh EdwinTh commented Jan 24, 2019

Writing a blog post about using the Jobs functionality for parallelising large tasks. Hit a drawback for the solution that you proposed, that is you are actually importing the entire environment. If there are some large objects there the RAM used by the different jobs is blown up. Of course you can clear global env before starting the Jobs, but then you need to rerun everything you were doing before again after you have started the Jobs. Then some Jobs might fail, you need to clear again first etc. A different solution in which only the parameter values are sent to the Jobs would therefore still be desirable imo.

@jmcphers
Copy link
Member

@jmcphers jmcphers commented Jan 24, 2019

Yes, it'd be nicer to pass some parameters in! I'll reopen this issue to track that request, although I can't promise we'll take care of it in the 1.2 timeframe.

Have you looked at the callr package?

https://callr.r-lib.org/

The Jobs function in RStudio 1.2 is really meant for one-off script runs. If you need to batch run scripts with different parameters in a reproducible way, callr might be better suited to your needs.

@EdwinTh
Copy link
Author

@EdwinTh EdwinTh commented Jan 25, 2019

Will look into callr, thanks!

@jmcphers
Copy link
Member

@jmcphers jmcphers commented Mar 11, 2019

We should think a little about what this would look like. Two possible ideas:

a) Free-form -- jobs could take a list of arbitrary keys/values, injected as objects into the job's environment. In this mode you'd be able to check individual items from the parent environment to forward as job parameters, and define additional parameters by adding their names and values.

b) Structured -- jobs could declare up front what parameters they expect, very much like parameterized R Markdown documents. In this mode you'd be presented with UI for defining parameter values when starting the job. This might be nice for publishing "plain" R scripts to RStudio Connect.

@EdwinTh
Copy link
Author

@EdwinTh EdwinTh commented Mar 12, 2019

From a reproducible research perspective I think a) is to be preferred. Sure there will be other considerations, just my two cents.

@harryprince
Copy link

@harryprince harryprince commented Apr 6, 2019

same issue +1.

In big data era, a classical scenario is running impala or spark by SQL with some partition parameters like date, the time zone in async mode.

Currently, I am using Rmarkdown to do a similar thing, and hope it can add to SQL-preview and job launcher too.

Wish the syntax can support Hive parameter style like this:

```
--- !preview conn=sc
SELECT * FROM tbl WHERE date_time = ${date_time}
```

And date_time parameter can be passed by local R environment

Here is a Hue SQL Editor example:

@hadley
Copy link
Member

@hadley hadley commented Apr 22, 2020

I think an easy win that wouldn't require too much additional thinking would be to add a data argument. This would be a list that was saved to a known location (using saveRDS()) and then automatically loaded into the global environment of the job (using readRDS() + attach() or similar).

@kevinushey
Copy link
Contributor

@kevinushey kevinushey commented Apr 22, 2020

We could make data fairly flexible in this regard:

jobRunScript(..., data = list(var1 = 1, var2 = 2))
jobRunScript(..., data = globalenv())
jobRunScript(..., data = "path/to/file.rds")

@hadley any thoughts on which form would be the most natural / useful for you?

@hadley
Copy link
Member

@hadley hadley commented Apr 22, 2020

@kevinushey I think exposing environments is generally suboptimal because it's too easy to accidentally include stuff you didn't mean, and it's not entirely clear what you should do with the parent environments. And exposing a single path is easy enough to do inside a list. So I think just a list is adequate.

@stale
Copy link

@stale stale bot commented Feb 5, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs, per https://github.com/rstudio/rstudio/wiki/Issue-Grooming. Thank you for your contributions.

@stale stale bot added the stale Issues which have been closed automatically due to inactivitiy. label Feb 5, 2021
@stale
Copy link

@stale stale bot commented Feb 19, 2021

This issue has been automatically closed due to inactivity.

@stale stale bot closed this as completed Feb 19, 2021
@jmcphers
Copy link
Member

@jmcphers jmcphers commented Mar 16, 2022

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement jobs stale Issues which have been closed automatically due to inactivitiy.
Projects
None yet
Development

No branches or pull requests

5 participants