Integration with crew #1044

wlandau · 2023-04-04T16:55:18Z

Prework

I understand and agree to the code of conduct and the contributing guidelines.
I have already submitted a discussion topic or issue to discuss my idea with the maintainer.

Related GitHub issues and pull requests

Ref: Semi-transient and cloud workers #753

Summary

This PR integrates targets with crew. crew already has machinery for auto-scaling and heterogeneous workers, and through its launcher plugin interface, its ecosystem will eventually support distributed computing from SLURM to AWS Batch. crew will become the main high-performance computing backend in targets when it is more mature.

wlandau · 2023-04-04T16:59:01Z

With the enhancement in this PR, crew is easy to activate in targets. Simply supply a crew controller or controller group to tar_option_set() in _targets.R. Then tar_make() automatically finds the controller and uses it. (tar_make() still works for non-parallel situations, simply omit the controller.)

# _targets.R file
library(targets)
tar_option_set(controller = crew::crew_controller_local(workers = 2))
do_work <- function() {
  Sys.sleep(3)
  rnorm(n = 1)
}
list(
  tar_target(w, do_work()),
  tar_target(x, do_work()),
  tar_target(y, do_work()), 
  tar_target(z, do_work())
)

Additionally, if you supply a controller group to tar_option_set(), then you can use tar_resources_crew(controller = "...") to select the controller within the controller group within one or more targets. This is how targets will support heterogeneous workers going forward.

wlandau · 2023-04-10T09:11:59Z

@shikokuchuo, I am trying to integrate crew into targets with this PR, and I am having trouble because the unit tests are hanging on the Ubuntu runners and those workflows are stalling and timing out at around 5.5 hours. I spent several hours troubleshooting, and this only happens in R CMD check on GitHub Actions. Tests run fast and past locally with R CMD check, even using an Ubuntu Docker image that closely approximates the ubuntu-latest runner. Similarly, tests pass on GitHub Actions outside R CMD check.

I tried to isolate the problem using https://github.com/ropensci/targets/tree/753-debug. In a simple pipeline of 100 independent targets, execution quickly reaches a point where the mirai tasks are stuck in an unresolved state and completion happens at a crawl. (At this point the dispatcher is still running.) Although it may be possible, I have not been able to reproduce the problem using only crew or only mirai.

Is there something about GitHub Actions + R CMD check that would prevent tasks from resolving after the first few? Have you seen anything like this before? How would you suggest I troubleshoot?

R version 4.2.3 (2023-03-15) -- "Shortstop Beagle"
Copyright (C) 2023 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(targets)
> pipeline <- targets:::pipeline_init(
+   lapply(
+     seq_len(100),
+     function(index) {
+       tar_target_raw(
+         name = paste0("task_", index),
+         command = quote(stats::rnorm(1000))
+       )
+     }
+   )
+ )
> targets:::tar_runtime$set_fun("tar_make")
> controller <- crew::crew_controller_local()
> out <- targets:::crew_init(
+   pipeline = pipeline,
+   controller = controller,
+   reporter = "timestamp"
+ )
> out$run()
• +0000 UTC 2023-04-10 02:09 38.19 start target task_60
• +0000 UTC 2023-04-10 02:09 38.37 start target task_61
• +0000 UTC 2023-04-10 02:09 38.57 start target task_62
• +0000 UTC 2023-04-10 02:09 38.66 start target task_63
• +0000 UTC 2023-04-10 02:09 38.75 start target task_64
• +0000 UTC 2023-04-10 02:09 38.85 start target task_65
• +0000 UTC 2023-04-10 02:09 38.95 start target task_66
• +0000 UTC 2023-04-10 02:09 39.80 start target task_67
• +0000 UTC 2023-04-10 02:09 39.90 start target task_68
• +0000 UTC 2023-04-10 02:09 39.99 start target task_69
• +0000 UTC 2023-04-10 02:09 40.08 start target task_1
• +0000 UTC 2023-04-10 02:09 40.17 start target task_2
     name
1 task_60
                                                                                                                                     command
1 targets::target_run_worker(target = target, envir = envir, path_store = path_store, \n    fun = fun, options = options, envvars = envvars)
  result seconds      seed error traceback warnings
1   done   0.024 857953997  <NA>      <NA>     <NA>
                                  launcher worker
1 9aaadb8ba81c51e5122bb6ff503bee81f6909ed1      1
                                  instance
1 53db515f68c47be2cb0291d02bc0ca70716397b4
• +0000 UTC 2023-04-10 02:09 40.60 built target task_60 [0.009 seconds]
• +0000 UTC 2023-04-10 02:09 40.60 start target task_3
• +0000 UTC 2023-04-10 02:09 40.70 start target task_4
• +0000 UTC 2023-04-10 02:09 40.80 start target task_5
• +0000 UTC 2023-04-10 02:09 41.01 start target task_6
     name
1 task_61
                                                                                                                                     command
1 targets::target_run_worker(target = target, envir = envir, path_store = path_store, \n    fun = fun, options = options, envvars = envvars)
  result seconds      seed error traceback warnings
1   done   0.002 244508592  <NA>      <NA>     <NA>
                                  launcher worker
1 9aaadb8ba81c51e5122bb6ff503bee81f6909ed1      1
                                  instance
1 53db515f68c47be2cb0291d02bc0ca70716397b4
• +0000 UTC 2023-04-10 02:09 41.15 built target task_61 [0 seconds]
• +0000 UTC 2023-04-10 02:09 41.15 start target task_7
• +0000 UTC 2023-04-10 02:09 41.24 start target task_50
• +0000 UTC 2023-04-10 02:09 41.33 start target task_40
• +0000 UTC 2023-04-10 02:09 41.42 start target task_8
• +0000 UTC 2023-04-10 02:09 41.51 start target task_51
• +0000 UTC 2023-04-10 02:09 41.61 start target task_41
• +0000 UTC 2023-04-10 02:09 41.69 start target task_9
• +0000 UTC 2023-04-10 02:09 41.79 start target task_52
• +0000 UTC 2023-04-10 02:09 41.88 start target task_42
• +0000 UTC 2023-04-10 02:09 41.97 start target task_53
• +0000 UTC 2023-04-10 02:09 42.06 start target task_43
• +0000 UTC 2023-04-10 02:09 42.15 start target task_54
• +0000 UTC 2023-04-10 02:09 42.24 start target task_44
• +0000 UTC 2023-04-10 02:09 42.33 start target task_55
• +0000 UTC 2023-04-10 02:09 42.42 start target task_45
• +0000 UTC 2023-04-10 02:09 42.51 start target task_56
• +0000 UTC 2023-04-10 02:09 42.60 start target task_46
• +0000 UTC 2023-04-10 02:09 42.70 start target task_57
• +0000 UTC 2023-04-10 02:09 42.79 start target task_47
• +0000 UTC 2023-04-10 02:09 42.88 start target task_58
• +0000 UTC 2023-04-10 02:09 42.97 start target task_48
• +0000 UTC 2023-04-10 02:09 43.06 start target task_59
• +0000 UTC 2023-04-10 02:09 43.15 start target task_49
• +0000 UTC 2023-04-10 02:09 43.25 start target task_30
• +0000 UTC 2023-04-10 02:09 43.34 start target task_31
• +0000 UTC 2023-04-10 02:09 43.43 start target task_32
• +0000 UTC 2023-04-10 02:09 43.53 start target task_33
• +0000 UTC 2023-04-10 02:09 43.62 start target task_34
• +0000 UTC 2023-04-10 02:09 43.72 start target task_35
• +0000 UTC 2023-04-10 02:09 43.81 start target task_36
• +0000 UTC 2023-04-10 02:09 43.90 start target task_37
• +0000 UTC 2023-04-10 02:09 43.99 start target task_38
• +0000 UTC 2023-04-10 02:09 44.09 start target task_39
• +0000 UTC 2023-04-10 02:09 44.18 start target task_100
• +0000 UTC 2023-04-10 02:09 44.27 start target task_20
• +0000 UTC 2023-04-10 02:09 44.39 start target task_21
• +0000 UTC 2023-04-10 02:09 44.49 start target task_22
• +0000 UTC 2023-04-10 02:09 44.58 start target task_23
• +0000 UTC 2023-04-10 02:09 44.67 start target task_24
• +0000 UTC 2023-04-10 02:09 44.77 start target task_25
• +0000 UTC 2023-04-10 02:09 44.86 start target task_26
• +0000 UTC 2023-04-10 02:09 44.96 start target task_27
• +0000 UTC 2023-04-10 02:09 45.05 start target task_28
• +0000 UTC 2023-04-10 02:09 45.15 start target task_29
• +0000 UTC 2023-04-10 02:09 45.24 start target task_10
• +0000 UTC 2023-04-10 02:09 45.33 start target task_11
• +0000 UTC 2023-04-10 02:09 45.43 start target task_12
• +0000 UTC 2023-04-10 02:09 45.52 start target task_13
• +0000 UTC 2023-04-10 02:09 45.61 start target task_14
• +0000 UTC 2023-04-10 02:09 45.70 start target task_15
• +0000 UTC 2023-04-10 02:09 45.80 start target task_16
• +0000 UTC 2023-04-10 02:09 45.90 start target task_17
• +0000 UTC 2023-04-10 02:09 45.99 start target task_90
• +0000 UTC 2023-04-10 02:09 46.09 start target task_18
• +0000 UTC 2023-04-10 02:09 46.19 start target task_91
• +0000 UTC 2023-04-10 02:09 46.28 start target task_19
• +0000 UTC 2023-04-10 02:09 46.37 start target task_92
• +0000 UTC 2023-04-10 02:09 46.47 start target task_93
• +0000 UTC 2023-04-10 02:09 46.57 start target task_94
• +0000 UTC 2023-04-10 02:09 46.66 start target task_95
• +0000 UTC 2023-04-10 02:09 46.75 start target task_96
• +0000 UTC 2023-04-10 02:09 46.85 start target task_97
• +0000 UTC 2023-04-10 02:09 46.94 start target task_98
• +0000 UTC 2023-04-10 02:09 47.03 start target task_99
• +0000 UTC 2023-04-10 02:09 47.13 start target task_80
• +0000 UTC 2023-04-10 02:09 47.22 start target task_81
• +0000 UTC 2023-04-10 02:09 47.32 start target task_82
• +0000 UTC 2023-04-10 02:09 47.41 start target task_83
• +0000 UTC 2023-04-10 02:09 47.51 start target task_84
• +0000 UTC 2023-04-10 02:09 47.60 start target task_85
• +0000 UTC 2023-04-10 02:09 47.70 start target task_86
• +0000 UTC 2023-04-10 02:09 47.79 start target task_87
• +0000 UTC 2023-04-10 02:09 47.89 start target task_88
• +0000 UTC 2023-04-10 02:09 47.98 start target task_89
• +0000 UTC 2023-04-10 02:09 48.08 start target task_70
• +0000 UTC 2023-04-10 02:09 48.18 start target task_71
• +0000 UTC 2023-04-10 02:09 48.27 start target task_72
• +0000 UTC 2023-04-10 02:09 48.37 start target task_73
• +0000 UTC 2023-04-10 02:09 48.47 start target task_74
• +0000 UTC 2023-04-10 02:09 48.57 start target task_75
• +0000 UTC 2023-04-10 02:09 48.66 start target task_76
• +0000 UTC 2023-04-10 02:09 48.76 start target task_77
• +0000 UTC 2023-04-10 02:09 48.86 start target task_78
• +0000 UTC 2023-04-10 02:09 48.96 start target task_79
Error in Exception(...) : 
  reached elapsed time limit [cpu=300s, elapsed=300s]
$connections
[1] 1

$daemons
                                                               online instance
ws://10.1.0.128:44051/53db515f68c47be2cb0291d02bc0ca70716397b4      1        1
                                                               assigned
ws://10.1.0.128:44051/53db515f68c47be2cb0291d02bc0ca70716397b4        2
                                                               complete
ws://10.1.0.128:44051/53db515f68c47be2cb0291d02bc0ca70716397b4        2

[1] 9923
[1] "dispatcher is running"
[1] "queue"
[1] 98
[1] "results"
[1] 0
[1] "pending tasks"
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
[1] "finished tasks"
• +0000 UTC 2023-04-10 02:14 43.42 end pipeline [5.13 minutes]
> unlink("_targets", recursive = TRUE)
> targets:::tar_runtime$unset_fun()
> controller$terminate()
> 
> proc.time()
   user  system elapsed 
255.944  19.628 308.907

shikokuchuo · 2023-04-10T09:37:59Z

Is there something about GitHub Actions + R CMD check that would prevent tasks from resolving after the first few? Have you seen anything like this before? How would you suggest I troubleshoot?

This is likely new - I have not seen Github Actions hanging on Ubuntu before for mirai. If this is a local controller, you should probably check if the individual instances are alive or dead.

So many unresolved seems to imply that something is causing the server to crash. But why this would happen after the first few are successful is strange... I will have to have a think.

shikokuchuo · 2023-04-10T09:52:50Z

It must be because GHA is using rcmdcheck rather than R CMD check. If you can replicate locally using the rcmdcheck package then we know it is due to something that package does.

wlandau · 2023-04-10T12:41:11Z

Thanks for the ideas. At https://github.com/ropensci/targets/actions/runs/4657411118, I reproduced the same problem with raw R CMD check. From the output log at https://github.com/ropensci/targets/suites/12131227600/artifacts/639574667, it looks like almost none of the tasks are assigned (the "assigned' counter is only 3 out of 100). All tasks run correctly locally on my Ubuntu setup using rcmdcheck::rcmdcheck().

shikokuchuo · 2023-04-10T13:17:55Z

This is useful. There are 3 assigned, 2 completed. This means the computation is stuck at the server (or it has crashed or ended after 2 tasks - are you able to poll it independently using processx to check the status at this point?). This is exactly what you'd expect to see if the server had a max tasks of 2.

As there is only one server, it is expected for assigned to increment by 1 each time, so nothing odd there.

wlandau · 2023-04-10T13:30:19Z

Yes, I polled the processx handle. When the pipeline reaches my manual timeout of 5 minutes, the server process is alive and sleeping.

shikokuchuo · 2023-04-10T13:54:21Z

Yes, I polled the processx handle. When the pipeline reaches my manual timeout of 5 minutes, the server process is alive and sleeping.

Interesting, 'sleeping' means sleeping on a wait or poll rather than computing anything. Are you able to print the results of the 2 successful cases to check they are what we expect? The most likely hypothesis at the moment is somehow it gets stuck sending back the evaluation result.

shikokuchuo · 2023-04-10T14:35:08Z

The other easy thing that can be done is to manually set the URL to say "abstract://mirai" - if this works then it could be some network issue specific to GH. But then you say it works outside of R CMD check, so I don't think it is this.

wlandau · 2023-04-10T15:06:20Z

Are you able to print the results of the 2 successful cases to check they are what we expect?

I will try. The last run completed no tasks within 5 minutes.

The other easy thing that can be done is to manually set the URL to say "abstract://mirai" - if this works then it could be some network issue specific to GH.

Maybe this will be possible, although crew gets the token from the worker websocket path and uses it in the metadata, so that might be tougher.

I am currently trying a spoofed version of the pipeline where targets just goes through the motions and sends only the bare minimum of code and data to mirai. Hopefully that will allow me to tell if it is something to do with running a target or with managing crew/mirai.

wlandau · 2023-04-10T15:09:42Z

The spoofed workflow at https://github.com/ropensci/targets/actions/runs/4658550150/jobs/8244363706 completed without the original problem. So it must be something to do with the work that happens inside a task rather than the way targets is invoking the interface of mirai via crew.

Meanwhile, I have been debugging on Windows, and I figured out that targets should not attempt to reset global options after each target. That finding was progress, but it does not solve the problem for Linux.

Are there sensitive environment variables in mirai that I should avoid setting or modifying?

shikokuchuo · 2023-04-10T15:12:42Z

Are there sensitive environment variables in mirai that I should avoid setting or modifying?

Neither mirai nor nanonext rely on any environment variables/options.

The only thing mirai determines on load is Sys.info()[["sysname"]] and I don't think targets can modify this :)

shikokuchuo · 2023-04-10T15:38:39Z

The spoofed workflow at https://github.com/ropensci/targets/actions/runs/4658550150/jobs/8244363706 completed without the original problem. So it must be something to do with the work that happens inside a task rather than the way targets is invoking the interface of mirai via crew.

Just in case, I thought I should mention that the .expr argument of mirai() now takes a language object directly so if you had in some workarounds from before, they may not be needed any more.

wlandau · 2023-04-10T17:49:37Z

Yes, thanks, I did see that. It will definitely be helpful to work directly with language objects. Sorry I have not responded to your original comment, I have been stuck on this bug for the last few days.

shikokuchuo · 2023-04-10T17:58:38Z

No need to explain - the integration with targets is much more important! I mentioned just in case it was messing up the existing evaluation logic.

wlandau · 2023-04-11T02:52:45Z

Some progress: I was finally able to reproduce this using crew and not targets: https://github.com/wlandau/crew/actions/runs/4662700123. The problem seems to be related to inputting environments and returning environments, as opposed to something targets does when it runs a task. I have not yet successfully created a mirai-only version, but I have been trying.

wlandau · 2023-04-11T17:30:53Z

After isolating the problem in shikokuchuo/mirai#53, I think it may be reasonable in the meantime to skip most of the crew tests on R CMD check. If/when shikokuchuo/mirai#53 is solved, maybe I will be able to unskip them.

wlandau-lilly added 10 commits April 3, 2023 15:54

Sketch #753

8d99943

Revert validation policy

606ae28

Use correct test function

2be930f

Implement new crew-based scheduling algorithm

0cffb2a

Fix lint

7736413

Add more crew tests

d2389e8

Use substitute arg in crew push()

1e7e419

crew resources

97078f6

Allow controllers to be passed to tar_option_set()

198b101

Fix #753

1424210

wlandau self-assigned this Apr 4, 2023

wlandau-lilly added 10 commits April 6, 2023 12:28

Bump required crew version

38a7e60

Add r-universe repo

f879e27

Avoid crew_session_*() stuff

ccd28e6

workflows

d8e4c81

Change testthat.R reporter

88d7ded

clustermq multicore backend

5dd4842

crew_test_sleep()

ea6149e

Try to improve coverage

a121865

Use controller$empty()

4d2b35b

Try removing a (redundant) test

48c8c40

Revert option policy and reinsert test

33e0cdb

wlandau mentioned this pull request Apr 11, 2023

Hanging tasks on Github Actions Ubuntu Runners (R CMD Check) shikokuchuo/mirai#53

Closed

wlandau-lilly added 2 commits April 11, 2023 13:42

Skip most crew tests on R CMD check

e5cf498

Skip all crew tests on R CMD check (but not covr etc.)

5d59c53

wlandau merged commit 13f5b31 into main Apr 11, 2023

wlandau deleted the 753 branch April 11, 2023 20:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integration with crew #1044

Integration with crew #1044

wlandau commented Apr 4, 2023

wlandau commented Apr 4, 2023

wlandau commented Apr 10, 2023 •

edited

Loading

shikokuchuo commented Apr 10, 2023 •

edited

Loading

shikokuchuo commented Apr 10, 2023

wlandau commented Apr 10, 2023

shikokuchuo commented Apr 10, 2023

wlandau commented Apr 10, 2023

shikokuchuo commented Apr 10, 2023

shikokuchuo commented Apr 10, 2023

wlandau commented Apr 10, 2023

wlandau commented Apr 10, 2023 •

edited

Loading

shikokuchuo commented Apr 10, 2023 •

edited

Loading

shikokuchuo commented Apr 10, 2023 •

edited

Loading

wlandau commented Apr 10, 2023

shikokuchuo commented Apr 10, 2023

wlandau commented Apr 11, 2023

wlandau commented Apr 11, 2023

Integration with crew #1044

Integration with crew #1044

Conversation

wlandau commented Apr 4, 2023

Prework

Related GitHub issues and pull requests

Summary

wlandau commented Apr 4, 2023

wlandau commented Apr 10, 2023 • edited Loading

shikokuchuo commented Apr 10, 2023 • edited Loading

shikokuchuo commented Apr 10, 2023

wlandau commented Apr 10, 2023

shikokuchuo commented Apr 10, 2023

wlandau commented Apr 10, 2023

shikokuchuo commented Apr 10, 2023

shikokuchuo commented Apr 10, 2023

wlandau commented Apr 10, 2023

wlandau commented Apr 10, 2023 • edited Loading

shikokuchuo commented Apr 10, 2023 • edited Loading

shikokuchuo commented Apr 10, 2023 • edited Loading

wlandau commented Apr 10, 2023

shikokuchuo commented Apr 10, 2023

wlandau commented Apr 11, 2023

wlandau commented Apr 11, 2023

wlandau commented Apr 10, 2023 •

edited

Loading

shikokuchuo commented Apr 10, 2023 •

edited

Loading

wlandau commented Apr 10, 2023 •

edited

Loading

shikokuchuo commented Apr 10, 2023 •

edited

Loading

shikokuchuo commented Apr 10, 2023 •

edited

Loading