Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parallel::makeCluster() freezes RStudio on macOS with R 4.x #6692

Closed
3 of 4 tasks
RLumSK opened this issue Apr 19, 2020 · 60 comments
Closed
3 of 4 tasks

parallel::makeCluster() freezes RStudio on macOS with R 4.x #6692

RLumSK opened this issue Apr 19, 2020 · 60 comments
Assignees
Labels

Comments

@RLumSK
Copy link

RLumSK commented Apr 19, 2020

System details

RStudio Edition : Desktop
RStudio Version : 1.3.944
OS Version      : macOS Catalina 10.15.4 
R Version       : R-devel (2020-04-18 r78249)

Steps to reproduce the problem

cl <- parallel::makeCluster(1)

Same if the number of nodes is > 1.

Describe the problem in detail

I have got a (so far) unpublished R package using MC simulations that suddenly did not work anymore. Every time when I tried to start the simulation RStudio went to 100 % CPU load and did nothing anymore. Sometimes, in particular, during the first seconds, I could press the 'stop' button to shutdown the R process. If I waited too long, I had to kill the RStudio process. I narrowed down the problem to the single code line shown above. However, plain R (in the terminal or the R GUI) does work without any problem. I did run the following combinations:

R version R Studio (1.3.938) R Studio (1.3.944) R Studio (1.2.5042)
R-devel FREEZE FREEZE FREEZE
R 4.0 RC NA FREEZE FREEZE
R 3.6.3 NA OK NA

NA: not tested.

My best guess is that this issue is related to recent changes in 'parallel'
Important (and again), it works in the R terminal and the normal R GUI, so, therefore, I believe that something needs to be modified in RStudio (I might be wrong though). Side note: If I used the RStudio check and build functionality where I also test examples, RStudio works fine.

Update 2020-04-24

I realised that FREEZE is probably not correct, the default timeout for the connection between master and worker in parallel::makeCluster() is two minutes. If you wait the full time, R will throw the following error message :

> cl <- parallel::makeCluster(2)
Error in makePSOCKcluster(names = spec, ...) : 
  Cluster setup failed. 2 of 2 workers failed to connect.

Describe the behaviour you expected

Calling

cl <- parallel::makeCluster(1)

does not freeze RStudio or fails with the mentioned error message.

Update 2020-04-26, potential intermediate work-around

Since I had to continue testing with R 4.X and I wanted to use RStudio*, I played a little bit with the parameters, and (at least on my machine), I figured out the following:

If I use the parameter setup_timeout and set it to < 1 s, it still works. Example:

cl <- parallel::makeCluster(10, setup_timeout = 0.5)

I tested up to 100 nodes, without any problem. Obviously this is not a fix, and it may be different on other machines (not to mention that one may want to have these longer timeouts). However, it may provide an intermediate solution until a fix can be provided. Values > 1 s consistently did not work for me, according to the manual entry for parallel::makeCluster the default is 2 min.

@RLumSK RLumSK changed the title parallel::makeCluster(1) freezes RStudio on macOS with R 4.x parallel::makeCluster() freezes RStudio on macOS with R 4.x Apr 19, 2020
@ronblum
Copy link
Contributor

ronblum commented Apr 21, 2020

@RLumSK Thank you for raising the issue! I can reproduce this in RStudio Desktop on MacOS Catalina 10.15.4, but not in RStudio Server on Red Hat 7.8. Similarly, if I press on the stop sign quickly enough, it stops, but if I wait about 15 seconds, pressing on it aborts the R session. 1.2 on Mac behaves a little differently, in that the stop sign does nothing. We'll review this as we continue development of RStudio.

@ronblum ronblum added the bug label Apr 21, 2020
@kevinushey
Copy link
Contributor

Looks like another version of this issue:

#1997

Looks like R now calls socketSelect(..., timeout = N) when creating a PSOCK cluster:

https://github.com/wch/r-source/blob/37d0a8a7b7f75c1ca15f532be38597d80966048a/src/library/parallel/R/snowSOCK.R#L201-L202

@RLumSK
Copy link
Author

RLumSK commented Apr 22, 2020

@kevinushey Thanks for the quick response and feedback. Actually, before I did post this issue, I had looked into the issue you mention. For me, a connecting between these two issues was not so obvious, but clearly, you have better insight. The main reason why I flagged it was that, whatever it is, it prevents parallel jobs using 'parallel' on macOS from functioning in RStudio, so basically the issue (perhaps) now became more 'general'.

@kevinushey kevinushey added this to the v1.3-patch milestone Apr 22, 2020
@kevinushey
Copy link
Contributor

I think I agree. Given that this issue basically means that the parallel package now doesn't work at all with RStudio + R 4.0.0, we need to try and fix this for v1.3. (Unfortunately too late for the imminent v1.3 release, but hopefully as a follow-up patch release)

@jmcphers
Copy link
Member

@gtritchie will look at this.

@kevinushey
Copy link
Contributor

@gtritchie: You might consider looking at a solution that makes use of .rs.registerReplaceHook(); I am thinking we could inject our own hook for socketSelect() that manages the value of R_wait_usec before and after socketSelect() is called.

The other option of course is re-exploring https://github.com/rstudio/rstudio-pro/pull/155 but seeing if we can figure out why this stalls RPCs and if there's an alternate approach we can take there.

@RLumSK
Copy link
Author

RLumSK commented Apr 26, 2020

I updated my initial error report above. I don't know whether it helps, but if the timeout is set manually to < 1 s than it still works (on my machine).

@HenrikBengtsson
Copy link
Contributor

HenrikBengtsson commented Apr 26, 2020

[This is a confirmed workaround]

I don't have macOS, so can't verify, but try with:

cl <- parallel::makeCluster(2, setup_strategy = "sequential")

If this works, you could add the following to your ~/.Rprofile until this is fixed in R/RStudio:

## WORKAROUND: https://github.com/rstudio/rstudio/issues/6692
## Revert to 'sequential' setup of PSOCK cluster in RStudio Console on macOS and R 4.0.0
if (Sys.getenv("RSTUDIO") == "1" && !nzchar(Sys.getenv("RSTUDIO_TERM")) && 
    Sys.info()["sysname"] == "Darwin" && getRversion() >= "4.0.0") {
  parallel:::setDefaultClusterOptions(setup_strategy = "sequential")
}

EDIT 2020-08-03: Updated to use getRversion() >= "4.0.0" - not just ==

@storopoli
Copy link

@HenrikBengtsson
This works under Catalina 10.15.4 and R-4.0.0

@jeroen
Copy link
Contributor

jeroen commented Apr 27, 2020

@HenrikBengtsson thanks that solves the problem for me as well.

@RLumSK
Copy link
Author

RLumSK commented Apr 27, 2020

@HenrikBengtsson Thank you, this is an excellent and efficient intermediate solution. It also works for me.

@storopoli
Copy link

@HenrikBengtsson what moving from "parallel" to "sequential" does in terms of performance?

@jeroen
Copy link
Contributor

jeroen commented Apr 27, 2020

jashu pushed a commit to jashu/beset that referenced this issue Apr 30, 2020
…luster in R 4.0: rstudio/rstudio#6692. Otherwise beset functions may fail when run in RStudio on macOS after upgrading R to v. 4.0.
@jmcphers jmcphers mentioned this issue May 1, 2020
4 tasks
@serbinsh
Copy link

serbinsh commented May 22, 2020

Also failing for me:

> cl <- parallel::makeCluster(1)
Error in makePSOCKcluster(names = spec, ...) : 
  Cluster setup failed. 1 worker of 1 failed to connect.

10.14.16

> sessionInfo()
R version 4.0.0 (2020-04-24)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Mojave 10.14.6

This modification also worked for me

no_cores <- detectCores() - 1
library(doParallel)
# create the cluster for caret to use
#cl <- makePSOCKcluster(no_cores)
cl <- parallel::makeCluster(no_cores, setup_strategy = "sequential")
registerDoParallel(cl)

@serbinsh
Copy link

Any updates on this?

@gtritchie
Copy link
Member

Is being worked on, intent is to get it into a 1.3 bugfix release.

@serbinsh
Copy link

Awesome, thanks!

@agilebean
Copy link

@gtritchie as I'm working on publishing a paper, can you give a very rough estimate when this 1.3 bug release will be out?
I am just wondering if I should revert to R3.6 as alternative.

@HenrikBengtsson
Copy link
Contributor

@agilebean, see my #6692 (comment) above - it provides a simple workaround.

@tfjaeger
Copy link

tfjaeger commented Apr 30, 2021

The issue with the cluster persists in the latest daily build (1.5.17):

cl <- future::makeClusterPSOCK(workers = 1, outfile = "", verbose = TRUE)
[local output] Workers: [n = 1] ‘localhost’
[local output] Base port: 11429
[local output] Creating node 1 of 1 ...
[local output] - setting up node
[local output] - attempt #1 of 3
Testing if worker's PID can be inferred: ‘'/Library/Frameworks/R.framework/Resources/bin/Rscript' -e 'try(suppressWarnings(cat(Sys.getpid(),file="/var/folders/6n/gc2x63fn3sl1bsp1wxxn__lc0000gn/T//RtmpccKAAE/worker.rank=1.parallelly.parent=7968.1f20433af893.pid")), silent = TRUE)' -e "file.exists('/var/folders/6n/gc2x63fn3sl1bsp1wxxn__lc0000gn/T//RtmpccKAAE/worker.rank=1.parallelly.parent=7968.1f20433af893.pid')"’

  • Possible to infer worker's PID: TRUE
    [local output] Starting worker Clone fail-Permission denied (publickey) #1 on ‘localhost’: '/Library/Frameworks/R.framework/Resources/bin/Rscript' --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'try(suppressWarnings(cat(Sys.getpid(),file="/var/folders/6n/gc2x63fn3sl1bsp1wxxn__lc0000gn/T//RtmpccKAAE/worker.rank=1.parallelly.parent=7968.1f20433af893.pid")), silent = TRUE)' -e 'workRSOCK <- tryCatch(parallel:::.slaveRSOCK, error=function(e) parallel:::.workRSOCK); workRSOCK()' MASTER=localhost PORT=11429 OUT= TIMEOUT=2592000 XDR=FALSE
    [local output] - Exit code of system() call: 0
    [local output] Waiting for worker Clone fail-Permission denied (publickey) #1 on ‘localhost’ to connect back
    starting worker pid=8035 on localhost:11429 at 16:16:43.514

@tfjaeger
Copy link

tfjaeger commented May 1, 2021

After 8 hours of trying various things, I'm pretty sure that this was not an RStudio issue (apologies for posting here). I don't know why it started working again, but a combination of starting in safe mode and then doing a hardware check (pressing D while restarting) seems to have repaired what multiple previous restarts didn't. Posting here in case other Mac users experience the same issue. I hope that's ok (if not, I can delete the posts).

@tfjaeger
Copy link

tfjaeger commented May 2, 2021

It turns out the issue I describe above re-occurs whenever I run brms with multiple cores (or perhaps other parallel processes) through RStudio (either the most recent stable release or the daily build 1.5.17) and my computer (MacPro 2013, MacOS 10.13) goes into sleep mode. The sampling process completes but the prompt is not being returned. Somehow brm() does not recognize that the process has completed.

Following such an event, any parallelization call (through RStudio or R) hangs, including cl <- parallel::makeCluster(1, setup_strategy = "sequential"). It does not freeze RStudio (I can press stop to interrupt the process). If I run cl <- future::makeClusterPSOCK(workers = 1, outfile = "", verbose = TRUE) I get the output posted in previous posts.

Unfortunately restarting the Mac in safemode does not seem to fix the issue. The upshot is that I have no idea what 'fixed' the issue last time, and that I now again can't use multiple cores. Any pointers---including where else to ask for help---would be much appreciated.

@tfjaeger
Copy link

tfjaeger commented May 2, 2021

Update: apparently, writing

127.0.0.1 localhost

into /etc/hosts does the trick. (thanks to a little note at the very end of https://www.javaer101.com/es/article/36653434.html). Removing this line makes the issue re-appear, adding it make it disappear. Does this mean that when RStudio hangs after going to sleep on a multicore process it somehow modifies the hosts file?!

@HenrikBengtsson
Copy link
Contributor

@tfjaeger, when doing (*):

cl <- parallelly::makeClusterPSOCK(workers = nbr_of_workers)

internally you end up with the same as:

cl <- parallelly::makeClusterPSOCK(workers = rep("localhost", times = nbr_of_workers))

From your most recent comment, it sounds like your local machine didn't know what localhost is, and you had to manually add a mapping to 127.0.0.1. That's unusual, but I guess it happens.

An alternative to editing /etc/hosts, which not everyone has the admin rights to do, you can also do:

cl <- parallelly::makeClusterPSOCK(workers = rep("127.0.0.1", times = nbr_of_workers))

The same trick should also work when you use parallel, e.g.

cl <- parallel::makeCluster(workers = rep("127.0.0.1", times = nbr_of_workers))

(*) Note that future::makeClusterPSOCK() is just an "alias" for parallelly::makeClusterPSOCK(), where the implementation lives these days. You can still use both, but one day we might ask everyone so specify parallelly::....

FWIW, I've added it to parallelly's todo list to be able to control the default hostname via an option/env var (HenrikBengtsson/parallelly#51).

@tfjaeger
Copy link

tfjaeger commented May 3, 2021

Thank you @HenrikBengtsson. I agree that it's highly unusual that /etc/hosts is empty (it wasn't). Bizarrely, this problem (empty hosts file) seems to be itself a consequence of RStudio freezing at the end of parallelized sampling in brms::brm (as in a call to brm() with multiple cores that compiles, samples, but then hangs after all chains have completed sampling). When that freeze is resolved by force killing RStudio on my Mac, this seems to result in some system unstability. When I then restart my Mac (OS 10.13) that seems to cause the hosts file to be empty afterwards.

While I have been able to run parallel now (with the host file restored), I still run into the exact same brm issue (sampling completed and then it is stuck, as if the main rsession doesn't recognize that the separate r session started for the purpose of parallelization have completed). In search of a solution, I just installed the cmdstanr backend. With this alternative backened, brm does not seem to get stuck at the end of sampling (cmdstanr doesn't seem to create multiple rsessions, so I suspect that this is the reason).

@kevinushey
Copy link
Contributor

Does this mean that when RStudio hangs after going to sleep on a multicore process it somehow modifies the hosts file?!

No, it does not.

Thank you @HenrikBengtsson. I agree that it's highly unusual that /etc/hosts is empty (it wasn't). Bizarrely, this problem (empty hosts file) seems to be itself a consequence of RStudio freezing at the end of parallelized sampling in brms::brm (as in a call to brm() with multiple cores that compiles, samples, but then hangs after all chains have completed sampling). When that freeze is resolved by force killing RStudio on my Mac, this seems to result in some system unstability. When I then restart my Mac (OS 10.13) that seems to cause the hosts file to be empty afterwards.

I have no idea why this would be the case, but (unless I'm missing something) RStudio does not touch the /etc/hosts file.

@tfjaeger
Copy link

Well, fwiw, I just had the same issue recur on my laptop (MacOS 10.13.6, Macbook Pro 2017; RStudio 1.2.1335; R 3.6.0), and again adding localhost to /etc/hosts fixed the issue. I note that---like for my other computer---this issue was absent until recently, and occurred without a software update.

@HenrikBengtsson
Copy link
Contributor

HenrikBengtsson commented Jun 11, 2021

Just want to inquire about the plan going forward with this setup_strategy ="parallel" issue:

In the most recent version, you're falling back to setup_strategy ="sequential" in RStudio Console across the board regardless of operating system (done on 2021-03-08, cf. c9e657d);

.rs.registerPackageLoadHook("parallel", function(...)
{
# enforce sequential setup of cluster
# https://github.com/rstudio/rstudio/issues/6692
parallel:::setDefaultClusterOptions(setup_strategy = "sequential")
})

  1. @kevinushey, in your comment from 2021-01-24, it was only done for macOS users. Were the reports from other OSes that causes you to do this? (FWIW, the "parallel" strategy seems to work for me on Ubuntu 18.04 Linux in the RStudio Console with R 4.1.0.)

  2. Do you have a good understanding of the underlying issue, and if so, is it fixable? Do you anticipate that the RStudio Console will eventually support setup_strategy ="parallel"? Is there a timeline/ETA, or is this tracked under "someday"?

@kevinushey
Copy link
Contributor

#6692 (comment) mentioned it occurred on Manjaro Linux as well.

The underlying issue was a bug in R (https://bugs.r-project.org/bugzilla/show_bug.cgi?id=18119) which should be fixed in R-devel now.

My guess is that perhaps the issue only effects certain versions of the C standard library; e.g. perhaps most glibc versions are okay, but musl or other C standard library implementations don't accept the timespec struct produced by R.

@HenrikBengtsson
Copy link
Contributor

Thxs. Ah, I see. So, there's nothing to be fixed in RStudio, other than rolling back the workaround patch when fixed in R.

I didn't make the connection to that bug fix in R-devel on 2021-06-10. Hopefully, it'll also make its way into R 4.1.1.

@HenrikBengtsson
Copy link
Contributor

One more thing, just so I get it correct: From you https://bugs.r-project.org/bugzilla/show_bug.cgi?id=18119#c0 report, it looks like this is not specific to RStudio Console. Is that correct? If so, I don't understand why some report that the problem goes away when they try the same thing running R in the terminal, e.g. #1997 (comment) and HenrikBengtsson/future#511 (comment). So, I'm still a bit confused what's going on here and whether this workaround is needed also outside the RStudio Console.

@kevinushey
Copy link
Contributor

The bug only affects R front-ends that set R_wait_usec; AFAIK the built-in R front-ends don't do that.

You can reproduce in a plain R console if you explicitly set R_wait_usec (via C++ code or similar; see my example in the ticket)

@HenrikBengtsson
Copy link
Contributor

Got it. Thanks for clarifying. Very helpful.

@HenrikBengtsson
Copy link
Contributor

The bug only affects R front-ends that set R_wait_usec; AFAIK the built-in R front-ends don't do that.

FWIW, the tcltk package may set R_wait_usec. I just tracked down a case where this causes this problem on CRAN's macOS servers, which can be reproduced using rhub::check(platform="macos-highsierra-release-cran") and on GitHub Actions w/ R 4.1.0 on macOS:

R version 4.1.0 (2021-05-18) -- "Camp Pontanezen"
Copyright (C) 2021 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin17.0 (64-bit)

...

> loadNamespace("tcltk")
<environment: namespace:tcltk>
Warning message:
In fun(libname, pkgname) : couldn't connect to display ":50"
> cl <- parallel::makePSOCKcluster(1, setup_timeout = 15.0)
Error in parallel::makePSOCKcluster(1, setup_timeout = 15) : 
  Cluster setup failed. 1 worker of 1 failed to connect.
Execution halted

There's another package test without loadNamespace("tcltk") and that works just fine.

This tcltk "interaction" seems to only affect macOS.

@kevinushey
Copy link
Contributor

@HenrikBengtsson Thank you for the investigation!

@squirreltoken
Copy link

squirreltoken commented Jul 19, 2021

Have the same issue. As my simulation depends on it entirely and I will fail my thesis If I cannot fix this problem I feel very lost and frustrated:

The problem:

cl <- makeCluster(1) results in a neverending computation that does not resolve for two minutes.

  • After two minutes we get a long error message:

Failed to launch and connect to R worker on local machine ‘localhost’ from local machine ‘IRMW1234’.
The error produced by socketConnection() was: ‘reached elapsed time limit’ (which suggests that the connection timeout of 120 seconds (argument 'connectTimeout') kicked in)
The localhost socket connection that failed to connect to the R worker used port 11085 using a communication timeout of 2592000 seconds and a connection timeout of 120 seconds.
Worker launch call: "C:/PROGRA1/R/R-411.0/bin/x64/Rscript" --default-packages=datasets,utils,grDevices,graphics,stats,methods -e "try(suppressWarnings(cat(Sys.getpid(),file="C:/Users/heja/AppData/Local/Temp/RtmpgvXRoz/worker.rank=1.parallelly.parent=2760.ac82e7022df.pid")), silent = TRUE)" -e "workRSOCK <- tryCatch(parallel:::.workRSOCK, error=function(e) parallel:::.slaveRSOCK); workRSOCK()" MASTER=IRMW1234 PORT=11085 OUT=/dev/null TIMEOUT=2592000 XDR=FALSE SETUPTIMEOUT=120 SETUPSTRATEGY=sequential.
Failed to kill loc

  • Today is Monday. It worked perfectly fine on Friday. On Saturday, the problem started. Nothintg changed in between that I can recall. This is the worst.
  • I allowed R Studio as well as R to be granted permission by the Windows Firewall (was mentioned a couple of time online)
  • I looked into ect/hosts . There we have a a server named localhost. However when I ping localhost it mentions another server "::1". Might this be a problem? I dont have admin rights and tried this with the IT department. I use a computer which runs on the the institutional network.
  • I use Win 10. The problem exists for ALL other computers (3) in this room, as well. I dont know if it worked before on them, though.
  • Also tried: makeCluster("127.0.0.1",2), same result, also tried to put "IRM1234" there and also "::1"

Any help is much appreciated.

@squirreltoken
Copy link

I accidentally fixed it: FYI; There were multiple different Versions and R libraries installed on my account, which I used on different computers. After deleting all kinds if R versions found on my account and installing R on the Computer, not on my account again, it all works perfectly fine. Dont know more details but hope it helps.

@leedrake5
Copy link

Just here to chime in - this problem has plagued me for years, and still continues with R 4.3.2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.