Initial work on a manual scheduler (#227) #259

wlandau · 2018-02-17T19:42:31Z

This PR addresses #227. Here, make(..., parallelism = "future") uses a manual scheduler to process the targets (imports are processed using staged parallelism with the default backend). Users may choose between caching = "master" and caching = "worker" to select whether caching happens on the master process or each worker individually. (Workers may not have cache access, and non-storr_rds() caches may not be threadsafe). The queue in queue.R is not really a priority, queue, just a cheap imitation and placeholder. When dirmeier/datastructures#4 is fixed, we can move to a much better and more scalable one.

This scheduler is a new backend, and none of the other backends are affected. This is a good opportunity to iterate on the new scheduler and make it more efficient as we compare it to the other backends.

There's a problem with file targets: need to make sure they stay up to date.

Creates bugs too

codecov-io · 2018-02-17T20:05:13Z

Codecov Report

Merging #259 into master will not change coverage.
The diff coverage is 100%.

@@          Coverage Diff           @@
##           master   #259    +/-   ##
======================================
  Coverage     100%   100%            
======================================
  Files          64     66     +2     
  Lines        4682   4871   +189     
======================================
+ Hits         4682   4871   +189

Impacted Files	Coverage Δ
R/future.R	`100% <100%> (ø)`
R/parallel_ui.R	`100% <100%> (ø)`	⬆️
R/make.R	`100% <100%> (ø)`	⬆️
R/config.R	`100% <100%> (ø)`	⬆️
R/workplan.R	`100% <100%> (ø)`	⬆️
R/dependencies.R	`100% <100%> (ø)`	⬆️
R/queue.R	`100% <100%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 763eda6...825f146. Read the comment docs.

This commit documents the experimental "future" backend and "caching" argument to `make()`. These features need time to play out in real projects, but they are ready for code review and hopefully alpha testing.

wlandau · 2018-02-18T14:02:17Z

FYI: just wrapped in a solution to #169. Seems to work fine on an old project I keep around for testing purposes. Details are here in the parallelism vignette. cc @kendonB

kendonB · 2018-02-18T15:56:23Z

Amazing and I look forward to testing this! Just glancing at the vignette:

it looks like the evaluator list is not a column of my_plan. If the interface is to have this object just floating in the environment, I strongly recommend you change it to being a column in my_plan. What if I sorted myplan? Or added to the top of my_plan without adding evaluators? Ideally this would be an argument to the *_plan functions and allocating resources as I'm building the different pieces of my plan. Maybe this is your intention once this is tested and working?
Last I looked the future::plan function returns the previous plan if there was one, otherwise it returns the new one. The code in the vignette would then have the same plan twice in local and remote.

So good that this functionality is getting in. No more making in multiple stages!

Sent with thumbs only.

@kendonB

cc @kendonB

wlandau · 2018-02-18T19:46:56Z

Glad to hear your enthusiasm, Kendon. I can think of no better alpha/beta tester than you.

In that place in the vignette, I mean to assign the evaluator list to my_plan$evaluator. It really is supposed to be a column in my_plan, I just made a mistake in the vignette. Will fix soon.
You're right.

 library(future)
> remote <- future::plan(multisession)
> local <- future::plan(multicore)
> remote
sequential:
- args: function (expr, envir = parent.frame(), substitute = TRUE, lazy = FALSE, seed = NULL, globals = TRUE, local = TRUE, earlySignal = FALSE, label = NULL, ...)
- tweaked: FALSE
- call: plan("default", .init = FALSE)
> local
multisession:
- args: function (expr, envir = parent.frame(), substitute = TRUE, lazy = FALSE, seed = NULL, globals = TRUE, persistent = FALSE, workers = availableCores(), gc = FALSE, earlySignal = FALSE, label = NULL, ...)
- tweaked: FALSE
- call: future::plan(multisession)
>

@HenrikBengtsson, what is the safest way to collect a bunch of heterogeneous custom evaluators in advance?

HenrikBengtsson · 2018-02-18T19:59:59Z

A plain R list should be fine; "evaluators" are just plain functions with some extra attributes.

wlandau · 2018-02-18T20:11:17Z

@HenrikBengtsson that's great to hear. It means #169 should be possible. Is there a better way to generate that list of evaluators? As I show in the above code fragment, it seems like I am either not using future::plan() correctly here or I need to use something different than future::plan() altogether.

HenrikBengtsson · 2018-02-18T20:30:50Z

Without having going into the details of what your doing, I should share there is a long-term(!) plan to support resource specifications when setting up futures. It's on my todo list to write some of these ideas up to get the discussion going and to help others like to to plan ahead. I'm thinking of something like:

a <- future({ expr }, resources = c(ram = ">= 5 GB", mount = "/home/alice/data", local = TRUE))
b <- future({ expr }, resources = c(ram = ">= 3 GB"))

This would make it possible to control which worker / type of worker each future will be resolved on. It sounds related to what's in this issue (at least from a quick look). To implement this will take a lot of time to get right, but it might help converge thoughts and ideas.

I need to find time to fully understand where you're going, but fo you to go forward, the rule of thumb would be to avoid hacking into the internals of futures "too much". If there's a use case that you have that is not currently provided by the Future API, it's not unlikely we can figure a way to add it. Your and others higher-level usage of the Future API helps form its identity.

wlandau · 2018-02-18T20:39:45Z

Thanks Henrik. Now I understand future's current capabilities a little better. From what @kendonB has said, the main use case is the ability to specify different resources. But I think it would also be nice to be able to simultaneously tell some futures to run locally and others to run on HPC compute nodes. For example, some drake targets run long computations, and others just knit little R Markdown reports that annotate results summaries.

krlmlr

Thanks. Just a few minor nits. I'm really excited to try this!

krlmlr · 2018-02-20T07:55:38Z

R/future.R

+  workers <- initialize_workers(config)
+  # While any targets are queued or running...
+  while (work_remains(queue = queue, workers = workers, config = config)){
+    for (id in seq_along(workers)){


for (id in get_idle_workers(workers)) { would reduce indent by one, but I'm not sure of the consequences.

Do you mean it should be outside the while(work_remains(...)){ loop?

No, I meant collapsing the for and the if. Probably not worth it.

krlmlr · 2018-02-20T07:57:35Z

R/future.R

+        )
+      }
+    }
+    Sys.sleep(1e-9)


Does this actually sleep? Maybe it's cleaner to offload to a function that waits for a task to finish, some parallel backends may provide blocking methods that don't poll.

In my preliminary debugging/testing, I found it helpful to set the interval to 0.1, and I did notice the sleeping. Definitely in favor of a better alternative, I just do not know what that would be specifically.

Just to make sure we're on the same page: There's polling (which always works), and there's notifications/passive waiting (which needs backend support). To avoid full CPU load when polling, it's advisable to sleep -- not too much, but also not too briefly.

Here, I'd suggest to extract a function that returns as soon as the scheduler needs to do some work. This allows to use passive waiting where available, or polling otherwise.

krlmlr · 2018-02-20T15:13:21Z

R/future.R

+    future::plan("next")
+  structure(
+    future::future(
+      expr = drake_future_task(


To avoid exporting drake_future_task, we'd need to sneak in the triple colon here somehow to avoid R CMD check failures. Not sure if it's worth the effort.

Yeah, I'm not sure it's worth the effort either.

krlmlr · 2018-02-20T15:18:46Z

R/future.R

+}
+
+all_concluded <- function(workers, config){
+  lapply(


This is one of the rare situations where I'd use a for loop for early abortion. I haven't found a purrr verb that implements this.

And as far as looping is concerned, the number of workers is likely to be small. See c753067.

krlmlr · 2018-02-20T15:21:59Z

R/future.R

+      if (is_idle(worker)){
+        NULL
+      } else {
+        # It's hard to make this line run in a small test workflow


Would this line be hit in a parallel workflow with Sleep(1) till Sleep(5) and two processors? Maybe also with Sleep(0.1) till Sleep(0.5) ?

I would hope so. I have had problems with Sys.sleep() in the past, but I think those issues are unrelated.

krlmlr · 2018-02-20T15:22:48Z

R/future.R

+  out
+}
+
+decrease_revdep_keys <- function(worker, config, queue){


Nomenclature: "rev" vs. "downstream"/"upstream"?

I chose "revdeps" here because drake talks about "dependencies" a lot in its internals. There is far less "downstream"/"upstream" terminology, and maybe we should make all the terms agree everywhere.

krlmlr · 2018-02-20T15:24:34Z

R/queue.R

+# The real priority queue will be
+# https://github.com/dirmeier/datastructures
+# once the CRAN version has decrease-key
+# (https://github.com/dirmeier/datastructures/issues/4).


We could add

Remotes: dirmeier/datastructures

to DESCRIPTION.

Thinking about it. I haven't had a chance to play with decrease-key in datastructures, but I will soon.

krlmlr · 2018-02-20T15:31:10Z

R/queue.R

+        self$pop(n = 1, what = what)
+      }
+    },
+    # This is all wrong and inefficient.


It's vectorized, so it's not too bad ;-) We'll have to implement efficient name lookups (or a mapping from target names to integers) for full efficiency. I wouldn't worry too much unless it becomes a bottleneck.

Thanks for the encouragement. It was the best I could do for a cheap imitation priority queue given R's suboptimal scalability when it comes to recursion and looping. I will be thinking about this as we move to datastructures.

krlmlr · 2018-02-20T15:31:54Z

R/workplan.R

@@ -153,7 +153,7 @@ drake_plan_override <- function(target, field, config){
  if (is.null(in_plan)){
    return(config[[field]])
  } else {
-    return(in_plan[config$plan$target == target])
+    return(in_plan[[which(config$plan$target == target)]])


What happens if which() returns a vector of length != 1?

Thanks for catching that. Fixed in 825f146.

krlmlr · 2018-02-20T16:15:58Z

R/future.R

+          # suitable enough for unit testing, but
+          # I did artificially stall targets and verified that this line
+          # is reached in the future::multisession backend as expected.
+          next # nocov


Do we also need to Sys.sleep() here?

I did not see a reason, but maybe I am missing something. What would it do there?

Avoid full CPU load?

krlmlr · 2018-02-20T22:08:56Z

I'm now missing progress output with parallelism = "future_lapply" and future::plan(future.callr::callr).

wlandau · 2018-02-20T22:12:59Z

Do you mean metadata returned by drake::progress() or do you mean stdout/stderr logs?

krlmlr · 2018-02-20T22:13:33Z

The green target and the target name.

HenrikBengtsson · 2018-02-20T22:17:31Z

Quick comment on the Future API: Adding support for querying the progress of a worker/future, will have to fall under "optional" features since we cannot demand all future backends to support it. Adding support for optional features to the Future API has to be done in a way that will be compatible with whatever possible future backends a random person will invent in the future (pun/sic!) - that is - extending the API has to be done with great care which is why it does not "just" exist (it's easy for some backends). I've started HenrikBengtsson/future#172 for discussing how to go forward and find minimal common denominators for these type of features.

wlandau · 2018-02-20T22:19:29Z

Kirill: With parallelism = "future" and caching = "master", you may have more luck. But with future_lapply(), all those messages get printed by the workers and lost in their futures. Related: HenrikBengtsson/future#67, HenrikBengtsson/future#141, HenrikBengtsson/future#171, and HenrikBengtsson/future#172.

Henrik: thanks for clarifying. It sounds like drake might have to do its own job monitoring at some point.

krlmlr · 2018-02-21T09:25:17Z

Thanks. I'd expect the scheduler (master) to print progress messages. Will file a new issue.

wlandau-lilly and others added 30 commits February 10, 2018 14:11

Sketch new backends

87d813b

Merge branch 'master' into i227-simple-sketch

b666798

Add new future backends. Need to test.

b1f5f16

future_total seems to work, but not future_commands

ebd4ee1

There's a problem with file targets: need to make sure they stay up to date.

Merge branch 'master' into i227-attempt2

8828624

Update namespace

014a884

Merge branch 'master' into i227-attempt2

a2f6dad

Sketch fixes to major flaws

dd1ca71

Creates bugs too

Fix a minor issue with pruning environments

f0ea640

Make a correction to build logic

5c80fe9

Fix some build logic in future_commands and future_total

098bc00

Put environment pruning inside futures

2450d34

Add R6 queue

83a3ea6

Use queue in new future-powered backends

b592141

Compartmentalize queue generation

48d855a

Add inline comments

a73c248

Tweak names

c2e6b48

Use a separate caching arg

7bd061f

Tweak logic

1477e07

Differentiate between empty and concluded workers

8b8d3ef

Tweak is_idle()

97ba2f6

Add tests for future backend / other caching

fa3087a

Add back comments

97603eb

Avoid testing queue functions

5213abc

Add a test for when nothing needs building

add1f16

Add more testing of future scheduler

d0390a3

Try to make drake fns available to future tasks

65ce784

Fix doc warning

94cc080

Add naive placeholder for priority queue

9783fb3

Use drake context

ff864c2

wlandau mentioned this pull request Feb 17, 2018

Use different computing resources for different targets. #169

Closed

wlandau-lilly added 3 commits February 18, 2018 08:23

Implement a solution to #169

e564655

Fix #169 via #227

dd7f02d

This commit documents the experimental "future" backend and "caching" argument to `make()`. These features need time to play out in real projects, but they are ready for code review and hopefully alpha testing.

Clarify: future backend is experimental

d23f4f8

Add evaluator to plan in vignette

0528e09

cc @kendonB

Merge branch 'master' into i227-attempt2

832139e

krlmlr approved these changes Feb 20, 2018

View reviewed changes

wlandau mentioned this pull request Feb 20, 2018

Handle files with file_input(), file_output(), and knitr_input() instead of double quotes. #258

Merged

wlandau-lilly added 3 commits February 20, 2018 14:09

Merge branch 'master' into i227-attempt2

2c94944

Use a loop to quit early

c753067

Make drake_plan_override() safer

825f146

wlandau merged commit cfac8fd into master Feb 20, 2018

wlandau deleted the i227-attempt2 branch February 20, 2018 20:14

wlandau mentioned this pull request Feb 22, 2018

Catch worker crashes #275

Merged

5 tasks

wlandau mentioned this pull request Oct 9, 2018

evaluator column for future_lapply parallelism? #540

Closed

Initial work on a manual scheduler (#227) #259

Initial work on a manual scheduler (#227) #259

Conversation

wlandau commented Feb 17, 2018

codecov-io commented Feb 17, 2018 • edited Loading

Codecov Report

wlandau commented Feb 18, 2018

kendonB commented Feb 18, 2018

wlandau commented Feb 18, 2018

HenrikBengtsson commented Feb 18, 2018

wlandau commented Feb 18, 2018

HenrikBengtsson commented Feb 18, 2018

wlandau commented Feb 18, 2018

krlmlr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

krlmlr Feb 20, 2018 • edited Loading

Choose a reason for hiding this comment

krlmlr commented Feb 20, 2018

wlandau commented Feb 20, 2018

krlmlr commented Feb 20, 2018

HenrikBengtsson commented Feb 20, 2018

wlandau commented Feb 20, 2018

krlmlr commented Feb 21, 2018

codecov-io commented Feb 17, 2018 •

edited

Loading

krlmlr Feb 20, 2018 •

edited

Loading