Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration with crew #1044

Merged
merged 23 commits into from
Apr 11, 2023
Merged

Integration with crew #1044

merged 23 commits into from
Apr 11, 2023

Conversation

wlandau
Copy link
Member

@wlandau wlandau commented Apr 4, 2023

Prework

Related GitHub issues and pull requests

Summary

This PR integrates targets with crew. crew already has machinery for auto-scaling and heterogeneous workers, and through its launcher plugin interface, its ecosystem will eventually support distributed computing from SLURM to AWS Batch. crew will become the main high-performance computing backend in targets when it is more mature.

@wlandau wlandau self-assigned this Apr 4, 2023
@wlandau
Copy link
Member Author

wlandau commented Apr 4, 2023

With the enhancement in this PR, crew is easy to activate in targets. Simply supply a crew controller or controller group to tar_option_set() in _targets.R. Then tar_make() automatically finds the controller and uses it. (tar_make() still works for non-parallel situations, simply omit the controller.)

# _targets.R file
library(targets)
tar_option_set(controller = crew::crew_controller_local(workers = 2))
do_work <- function() {
  Sys.sleep(3)
  rnorm(n = 1)
}
list(
  tar_target(w, do_work()),
  tar_target(x, do_work()),
  tar_target(y, do_work()), 
  tar_target(z, do_work())
)

Additionally, if you supply a controller group to tar_option_set(), then you can use tar_resources_crew(controller = "...") to select the controller within the controller group within one or more targets. This is how targets will support heterogeneous workers going forward.

@wlandau
Copy link
Member Author

wlandau commented Apr 10, 2023

@shikokuchuo, I am trying to integrate crew into targets with this PR, and I am having trouble because the unit tests are hanging on the Ubuntu runners and those workflows are stalling and timing out at around 5.5 hours. I spent several hours troubleshooting, and this only happens in R CMD check on GitHub Actions. Tests run fast and past locally with R CMD check, even using an Ubuntu Docker image that closely approximates the ubuntu-latest runner. Similarly, tests pass on GitHub Actions outside R CMD check.

I tried to isolate the problem using https://github.com/ropensci/targets/tree/753-debug. In a simple pipeline of 100 independent targets, execution quickly reaches a point where the mirai tasks are stuck in an unresolved state and completion happens at a crawl. (At this point the dispatcher is still running.) Although it may be possible, I have not been able to reproduce the problem using only crew or only mirai.

Is there something about GitHub Actions + R CMD check that would prevent tasks from resolving after the first few? Have you seen anything like this before? How would you suggest I troubleshoot?

R version 4.2.3 (2023-03-15) -- "Shortstop Beagle"
Copyright (C) 2023 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(targets)
> pipeline <- targets:::pipeline_init(
+   lapply(
+     seq_len(100),
+     function(index) {
+       tar_target_raw(
+         name = paste0("task_", index),
+         command = quote(stats::rnorm(1000))
+       )
+     }
+   )
+ )
> targets:::tar_runtime$set_fun("tar_make")
> controller <- crew::crew_controller_local()
> out <- targets:::crew_init(
+   pipeline = pipeline,
+   controller = controller,
+   reporter = "timestamp"
+ )
> out$run()
• +0000 UTC 2023-04-10 02:09 38.19 start target task_60+0000 UTC 2023-04-10 02:09 38.37 start target task_61+0000 UTC 2023-04-10 02:09 38.57 start target task_62+0000 UTC 2023-04-10 02:09 38.66 start target task_63+0000 UTC 2023-04-10 02:09 38.75 start target task_64+0000 UTC 2023-04-10 02:09 38.85 start target task_65+0000 UTC 2023-04-10 02:09 38.95 start target task_66+0000 UTC 2023-04-10 02:09 39.80 start target task_67+0000 UTC 2023-04-10 02:09 39.90 start target task_68+0000 UTC 2023-04-10 02:09 39.99 start target task_69+0000 UTC 2023-04-10 02:09 40.08 start target task_1+0000 UTC 2023-04-10 02:09 40.17 start target task_2
     name
1 task_60
                                                                                                                                     command
1 targets::target_run_worker(target = target, envir = envir, path_store = path_store, \n    fun = fun, options = options, envvars = envvars)
  result seconds      seed error traceback warnings
1   done   0.024 857953997  <NA>      <NA>     <NA>
                                  launcher worker
1 9aaadb8ba81c51e5122bb6ff503bee81f6909ed1      1
                                  instance
1 53db515f68c47be2cb0291d02bc0ca70716397b4+0000 UTC 2023-04-10 02:09 40.60 built target task_60 [0.009 seconds]
• +0000 UTC 2023-04-10 02:09 40.60 start target task_3+0000 UTC 2023-04-10 02:09 40.70 start target task_4+0000 UTC 2023-04-10 02:09 40.80 start target task_5+0000 UTC 2023-04-10 02:09 41.01 start target task_6
     name
1 task_61
                                                                                                                                     command
1 targets::target_run_worker(target = target, envir = envir, path_store = path_store, \n    fun = fun, options = options, envvars = envvars)
  result seconds      seed error traceback warnings
1   done   0.002 244508592  <NA>      <NA>     <NA>
                                  launcher worker
1 9aaadb8ba81c51e5122bb6ff503bee81f6909ed1      1
                                  instance
1 53db515f68c47be2cb0291d02bc0ca70716397b4+0000 UTC 2023-04-10 02:09 41.15 built target task_61 [0 seconds]
• +0000 UTC 2023-04-10 02:09 41.15 start target task_7+0000 UTC 2023-04-10 02:09 41.24 start target task_50+0000 UTC 2023-04-10 02:09 41.33 start target task_40+0000 UTC 2023-04-10 02:09 41.42 start target task_8+0000 UTC 2023-04-10 02:09 41.51 start target task_51+0000 UTC 2023-04-10 02:09 41.61 start target task_41+0000 UTC 2023-04-10 02:09 41.69 start target task_9+0000 UTC 2023-04-10 02:09 41.79 start target task_52+0000 UTC 2023-04-10 02:09 41.88 start target task_42+0000 UTC 2023-04-10 02:09 41.97 start target task_53+0000 UTC 2023-04-10 02:09 42.06 start target task_43+0000 UTC 2023-04-10 02:09 42.15 start target task_54+0000 UTC 2023-04-10 02:09 42.24 start target task_44+0000 UTC 2023-04-10 02:09 42.33 start target task_55+0000 UTC 2023-04-10 02:09 42.42 start target task_45+0000 UTC 2023-04-10 02:09 42.51 start target task_56+0000 UTC 2023-04-10 02:09 42.60 start target task_46+0000 UTC 2023-04-10 02:09 42.70 start target task_57+0000 UTC 2023-04-10 02:09 42.79 start target task_47+0000 UTC 2023-04-10 02:09 42.88 start target task_58+0000 UTC 2023-04-10 02:09 42.97 start target task_48+0000 UTC 2023-04-10 02:09 43.06 start target task_59+0000 UTC 2023-04-10 02:09 43.15 start target task_49+0000 UTC 2023-04-10 02:09 43.25 start target task_30+0000 UTC 2023-04-10 02:09 43.34 start target task_31+0000 UTC 2023-04-10 02:09 43.43 start target task_32+0000 UTC 2023-04-10 02:09 43.53 start target task_33+0000 UTC 2023-04-10 02:09 43.62 start target task_34+0000 UTC 2023-04-10 02:09 43.72 start target task_35+0000 UTC 2023-04-10 02:09 43.81 start target task_36+0000 UTC 2023-04-10 02:09 43.90 start target task_37+0000 UTC 2023-04-10 02:09 43.99 start target task_38+0000 UTC 2023-04-10 02:09 44.09 start target task_39+0000 UTC 2023-04-10 02:09 44.18 start target task_100+0000 UTC 2023-04-10 02:09 44.27 start target task_20+0000 UTC 2023-04-10 02:09 44.39 start target task_21+0000 UTC 2023-04-10 02:09 44.49 start target task_22+0000 UTC 2023-04-10 02:09 44.58 start target task_23+0000 UTC 2023-04-10 02:09 44.67 start target task_24+0000 UTC 2023-04-10 02:09 44.77 start target task_25+0000 UTC 2023-04-10 02:09 44.86 start target task_26+0000 UTC 2023-04-10 02:09 44.96 start target task_27+0000 UTC 2023-04-10 02:09 45.05 start target task_28+0000 UTC 2023-04-10 02:09 45.15 start target task_29+0000 UTC 2023-04-10 02:09 45.24 start target task_10+0000 UTC 2023-04-10 02:09 45.33 start target task_11+0000 UTC 2023-04-10 02:09 45.43 start target task_12+0000 UTC 2023-04-10 02:09 45.52 start target task_13+0000 UTC 2023-04-10 02:09 45.61 start target task_14+0000 UTC 2023-04-10 02:09 45.70 start target task_15+0000 UTC 2023-04-10 02:09 45.80 start target task_16+0000 UTC 2023-04-10 02:09 45.90 start target task_17+0000 UTC 2023-04-10 02:09 45.99 start target task_90+0000 UTC 2023-04-10 02:09 46.09 start target task_18+0000 UTC 2023-04-10 02:09 46.19 start target task_91+0000 UTC 2023-04-10 02:09 46.28 start target task_19+0000 UTC 2023-04-10 02:09 46.37 start target task_92+0000 UTC 2023-04-10 02:09 46.47 start target task_93+0000 UTC 2023-04-10 02:09 46.57 start target task_94+0000 UTC 2023-04-10 02:09 46.66 start target task_95+0000 UTC 2023-04-10 02:09 46.75 start target task_96+0000 UTC 2023-04-10 02:09 46.85 start target task_97+0000 UTC 2023-04-10 02:09 46.94 start target task_98+0000 UTC 2023-04-10 02:09 47.03 start target task_99+0000 UTC 2023-04-10 02:09 47.13 start target task_80+0000 UTC 2023-04-10 02:09 47.22 start target task_81+0000 UTC 2023-04-10 02:09 47.32 start target task_82+0000 UTC 2023-04-10 02:09 47.41 start target task_83+0000 UTC 2023-04-10 02:09 47.51 start target task_84+0000 UTC 2023-04-10 02:09 47.60 start target task_85+0000 UTC 2023-04-10 02:09 47.70 start target task_86+0000 UTC 2023-04-10 02:09 47.79 start target task_87+0000 UTC 2023-04-10 02:09 47.89 start target task_88+0000 UTC 2023-04-10 02:09 47.98 start target task_89+0000 UTC 2023-04-10 02:09 48.08 start target task_70+0000 UTC 2023-04-10 02:09 48.18 start target task_71+0000 UTC 2023-04-10 02:09 48.27 start target task_72+0000 UTC 2023-04-10 02:09 48.37 start target task_73+0000 UTC 2023-04-10 02:09 48.47 start target task_74+0000 UTC 2023-04-10 02:09 48.57 start target task_75+0000 UTC 2023-04-10 02:09 48.66 start target task_76+0000 UTC 2023-04-10 02:09 48.76 start target task_77+0000 UTC 2023-04-10 02:09 48.86 start target task_78+0000 UTC 2023-04-10 02:09 48.96 start target task_79
Error in Exception(...) : 
  reached elapsed time limit [cpu=300s, elapsed=300s]
$connections
[1] 1

$daemons
                                                               online instance
ws://10.1.0.128:44051/53db515f68c47be2cb0291d02bc0ca70716397b4      1        1
                                                               assigned
ws://10.1.0.128:44051/53db515f68c47be2cb0291d02bc0ca70716397b4        2
                                                               complete
ws://10.1.0.128:44051/53db515f68c47be2cb0291d02bc0ca70716397b4        2

[1] 9923
[1] "dispatcher is running"
[1] "queue"
[1] 98
[1] "results"
[1] 0
[1] "pending tasks"
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
'unresolved' logi NA
[1] "finished tasks"+0000 UTC 2023-04-10 02:14 43.42 end pipeline [5.13 minutes]
> unlink("_targets", recursive = TRUE)
> targets:::tar_runtime$unset_fun()
> controller$terminate()
> 
> proc.time()
   user  system elapsed 
255.944  19.628 308.907 

@shikokuchuo
Copy link
Contributor

shikokuchuo commented Apr 10, 2023

Is there something about GitHub Actions + R CMD check that would prevent tasks from resolving after the first few? Have you seen anything like this before? How would you suggest I troubleshoot?

This is likely new - I have not seen Github Actions hanging on Ubuntu before for mirai. If this is a local controller, you should probably check if the individual instances are alive or dead.

So many unresolved seems to imply that something is causing the server to crash. But why this would happen after the first few are successful is strange... I will have to have a think.

@shikokuchuo
Copy link
Contributor

It must be because GHA is using rcmdcheck rather than R CMD check. If you can replicate locally using the rcmdcheck package then we know it is due to something that package does.

@wlandau
Copy link
Member Author

wlandau commented Apr 10, 2023

Thanks for the ideas. At https://github.com/ropensci/targets/actions/runs/4657411118, I reproduced the same problem with raw R CMD check. From the output log at https://github.com/ropensci/targets/suites/12131227600/artifacts/639574667, it looks like almost none of the tasks are assigned (the "assigned' counter is only 3 out of 100). All tasks run correctly locally on my Ubuntu setup using rcmdcheck::rcmdcheck().

@shikokuchuo
Copy link
Contributor

This is useful. There are 3 assigned, 2 completed. This means the computation is stuck at the server (or it has crashed or ended after 2 tasks - are you able to poll it independently using processx to check the status at this point?). This is exactly what you'd expect to see if the server had a max tasks of 2.

As there is only one server, it is expected for assigned to increment by 1 each time, so nothing odd there.

@wlandau
Copy link
Member Author

wlandau commented Apr 10, 2023

Yes, I polled the processx handle. When the pipeline reaches my manual timeout of 5 minutes, the server process is alive and sleeping.

@shikokuchuo
Copy link
Contributor

Yes, I polled the processx handle. When the pipeline reaches my manual timeout of 5 minutes, the server process is alive and sleeping.

Interesting, 'sleeping' means sleeping on a wait or poll rather than computing anything. Are you able to print the results of the 2 successful cases to check they are what we expect? The most likely hypothesis at the moment is somehow it gets stuck sending back the evaluation result.

@shikokuchuo
Copy link
Contributor

The other easy thing that can be done is to manually set the URL to say "abstract://mirai" - if this works then it could be some network issue specific to GH. But then you say it works outside of R CMD check, so I don't think it is this.

@wlandau
Copy link
Member Author

wlandau commented Apr 10, 2023

Are you able to print the results of the 2 successful cases to check they are what we expect?

I will try. The last run completed no tasks within 5 minutes.

The other easy thing that can be done is to manually set the URL to say "abstract://mirai" - if this works then it could be some network issue specific to GH.

Maybe this will be possible, although crew gets the token from the worker websocket path and uses it in the metadata, so that might be tougher.

I am currently trying a spoofed version of the pipeline where targets just goes through the motions and sends only the bare minimum of code and data to mirai. Hopefully that will allow me to tell if it is something to do with running a target or with managing crew/mirai.

@wlandau
Copy link
Member Author

wlandau commented Apr 10, 2023

The spoofed workflow at https://github.com/ropensci/targets/actions/runs/4658550150/jobs/8244363706 completed without the original problem. So it must be something to do with the work that happens inside a task rather than the way targets is invoking the interface of mirai via crew.

Meanwhile, I have been debugging on Windows, and I figured out that targets should not attempt to reset global options after each target. That finding was progress, but it does not solve the problem for Linux.

Are there sensitive environment variables in mirai that I should avoid setting or modifying?

@shikokuchuo
Copy link
Contributor

shikokuchuo commented Apr 10, 2023

Are there sensitive environment variables in mirai that I should avoid setting or modifying?

Neither mirai nor nanonext rely on any environment variables/options.

The only thing mirai determines on load is Sys.info()[["sysname"]] and I don't think targets can modify this :)

@shikokuchuo
Copy link
Contributor

shikokuchuo commented Apr 10, 2023

The spoofed workflow at https://github.com/ropensci/targets/actions/runs/4658550150/jobs/8244363706 completed without the original problem. So it must be something to do with the work that happens inside a task rather than the way targets is invoking the interface of mirai via crew.

Just in case, I thought I should mention that the .expr argument of mirai() now takes a language object directly so if you had in some workarounds from before, they may not be needed any more.

@wlandau
Copy link
Member Author

wlandau commented Apr 10, 2023

Yes, thanks, I did see that. It will definitely be helpful to work directly with language objects. Sorry I have not responded to your original comment, I have been stuck on this bug for the last few days.

@shikokuchuo
Copy link
Contributor

No need to explain - the integration with targets is much more important! I mentioned just in case it was messing up the existing evaluation logic.

@wlandau
Copy link
Member Author

wlandau commented Apr 11, 2023

Some progress: I was finally able to reproduce this using crew and not targets: https://github.com/wlandau/crew/actions/runs/4662700123. The problem seems to be related to inputting environments and returning environments, as opposed to something targets does when it runs a task. I have not yet successfully created a mirai-only version, but I have been trying.

@wlandau
Copy link
Member Author

wlandau commented Apr 11, 2023

After isolating the problem in shikokuchuo/mirai#53, I think it may be reasonable in the meantime to skip most of the crew tests on R CMD check. If/when shikokuchuo/mirai#53 is solved, maybe I will be able to unskip them.

@wlandau wlandau merged commit 13f5b31 into main Apr 11, 2023
@wlandau wlandau deleted the 753 branch April 11, 2023 20:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants