-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parallel::makeCluster() freezes RStudio on macOS with R 4.x #6692
Comments
@RLumSK Thank you for raising the issue! I can reproduce this in RStudio Desktop on MacOS Catalina 10.15.4, but not in RStudio Server on Red Hat 7.8. Similarly, if I press on the stop sign quickly enough, it stops, but if I wait about 15 seconds, pressing on it aborts the R session. 1.2 on Mac behaves a little differently, in that the stop sign does nothing. We'll review this as we continue development of RStudio. |
Looks like another version of this issue: Looks like R now calls |
@kevinushey Thanks for the quick response and feedback. Actually, before I did post this issue, I had looked into the issue you mention. For me, a connecting between these two issues was not so obvious, but clearly, you have better insight. The main reason why I flagged it was that, whatever it is, it prevents parallel jobs using 'parallel' on macOS from functioning in RStudio, so basically the issue (perhaps) now became more 'general'. |
I think I agree. Given that this issue basically means that the |
@gtritchie will look at this. |
@gtritchie: You might consider looking at a solution that makes use of The other option of course is re-exploring https://github.com/rstudio/rstudio-pro/pull/155 but seeing if we can figure out why this stalls RPCs and if there's an alternate approach we can take there. |
I updated my initial error report above. I don't know whether it helps, but if the timeout is set manually to < 1 s than it still works (on my machine). |
[This is a confirmed workaround]I don't have macOS, so can't verify, but try with: cl <- parallel::makeCluster(2, setup_strategy = "sequential") If this works, you could add the following to your ## WORKAROUND: https://github.com/rstudio/rstudio/issues/6692
## Revert to 'sequential' setup of PSOCK cluster in RStudio Console on macOS and R 4.0.0
if (Sys.getenv("RSTUDIO") == "1" && !nzchar(Sys.getenv("RSTUDIO_TERM")) &&
Sys.info()["sysname"] == "Darwin" && getRversion() >= "4.0.0") {
parallel:::setDefaultClusterOptions(setup_strategy = "sequential")
} EDIT 2020-08-03: Updated to use |
@HenrikBengtsson |
@HenrikBengtsson thanks that solves the problem for me as well. |
@HenrikBengtsson Thank you, this is an excellent and efficient intermediate solution. It also works for me. |
@HenrikBengtsson what moving from "parallel" to "sequential" does in terms of performance? |
…luster in R 4.0: rstudio/rstudio#6692. Otherwise beset functions may fail when run in RStudio on macOS after upgrading R to v. 4.0.
Also failing for me:
10.14.16
This modification also worked for me
|
Any updates on this? |
Is being worked on, intent is to get it into a 1.3 bugfix release. |
Awesome, thanks! |
@gtritchie as I'm working on publishing a paper, can you give a very rough estimate when this 1.3 bug release will be out? |
@agilebean, see my #6692 (comment) above - it provides a simple workaround. |
The issue with the cluster persists in the latest daily build (1.5.17):
|
After 8 hours of trying various things, I'm pretty sure that this was not an RStudio issue (apologies for posting here). I don't know why it started working again, but a combination of starting in safe mode and then doing a hardware check (pressing D while restarting) seems to have repaired what multiple previous restarts didn't. Posting here in case other Mac users experience the same issue. I hope that's ok (if not, I can delete the posts). |
It turns out the issue I describe above re-occurs whenever I run brms with multiple cores (or perhaps other parallel processes) through RStudio (either the most recent stable release or the daily build 1.5.17) and my computer (MacPro 2013, MacOS 10.13) goes into sleep mode. The sampling process completes but the prompt is not being returned. Somehow brm() does not recognize that the process has completed. Following such an event, any parallelization call (through RStudio or R) hangs, including cl <- parallel::makeCluster(1, setup_strategy = "sequential"). It does not freeze RStudio (I can press stop to interrupt the process). If I run cl <- future::makeClusterPSOCK(workers = 1, outfile = "", verbose = TRUE) I get the output posted in previous posts. Unfortunately restarting the Mac in safemode does not seem to fix the issue. The upshot is that I have no idea what 'fixed' the issue last time, and that I now again can't use multiple cores. Any pointers---including where else to ask for help---would be much appreciated. |
Update: apparently, writing
into /etc/hosts does the trick. (thanks to a little note at the very end of https://www.javaer101.com/es/article/36653434.html). Removing this line makes the issue re-appear, adding it make it disappear. Does this mean that when RStudio hangs after going to sleep on a multicore process it somehow modifies the hosts file?! |
@tfjaeger, when doing (*): cl <- parallelly::makeClusterPSOCK(workers = nbr_of_workers) internally you end up with the same as: cl <- parallelly::makeClusterPSOCK(workers = rep("localhost", times = nbr_of_workers)) From your most recent comment, it sounds like your local machine didn't know what An alternative to editing cl <- parallelly::makeClusterPSOCK(workers = rep("127.0.0.1", times = nbr_of_workers)) The same trick should also work when you use parallel, e.g. cl <- parallel::makeCluster(workers = rep("127.0.0.1", times = nbr_of_workers)) (*) Note that FWIW, I've added it to parallelly's todo list to be able to control the default hostname via an option/env var (HenrikBengtsson/parallelly#51). |
Thank you @HenrikBengtsson. I agree that it's highly unusual that /etc/hosts is empty (it wasn't). Bizarrely, this problem (empty hosts file) seems to be itself a consequence of RStudio freezing at the end of parallelized sampling in brms::brm (as in a call to brm() with multiple cores that compiles, samples, but then hangs after all chains have completed sampling). When that freeze is resolved by force killing RStudio on my Mac, this seems to result in some system unstability. When I then restart my Mac (OS 10.13) that seems to cause the hosts file to be empty afterwards. While I have been able to run parallel now (with the host file restored), I still run into the exact same brm issue (sampling completed and then it is stuck, as if the main rsession doesn't recognize that the separate r session started for the purpose of parallelization have completed). In search of a solution, I just installed the cmdstanr backend. With this alternative backened, brm does not seem to get stuck at the end of sampling (cmdstanr doesn't seem to create multiple rsessions, so I suspect that this is the reason). |
No, it does not.
I have no idea why this would be the case, but (unless I'm missing something) RStudio does not touch the |
Well, fwiw, I just had the same issue recur on my laptop (MacOS 10.13.6, Macbook Pro 2017; RStudio 1.2.1335; R 3.6.0), and again adding localhost to /etc/hosts fixed the issue. I note that---like for my other computer---this issue was absent until recently, and occurred without a software update. |
Just want to inquire about the plan going forward with this In the most recent version, you're falling back to rstudio/src/cpp/session/modules/SessionPatches.R Lines 35 to 40 in 98979de
|
#6692 (comment) mentioned it occurred on Manjaro Linux as well. The underlying issue was a bug in R (https://bugs.r-project.org/bugzilla/show_bug.cgi?id=18119) which should be fixed in R-devel now. My guess is that perhaps the issue only effects certain versions of the C standard library; e.g. perhaps most glibc versions are okay, but musl or other C standard library implementations don't accept the timespec struct produced by R. |
Thxs. Ah, I see. So, there's nothing to be fixed in RStudio, other than rolling back the workaround patch when fixed in R. I didn't make the connection to that bug fix in R-devel on 2021-06-10. Hopefully, it'll also make its way into R 4.1.1. |
One more thing, just so I get it correct: From you https://bugs.r-project.org/bugzilla/show_bug.cgi?id=18119#c0 report, it looks like this is not specific to RStudio Console. Is that correct? If so, I don't understand why some report that the problem goes away when they try the same thing running R in the terminal, e.g. #1997 (comment) and HenrikBengtsson/future#511 (comment). So, I'm still a bit confused what's going on here and whether this workaround is needed also outside the RStudio Console. |
The bug only affects R front-ends that set R_wait_usec; AFAIK the built-in R front-ends don't do that. You can reproduce in a plain R console if you explicitly set R_wait_usec (via C++ code or similar; see my example in the ticket) |
Got it. Thanks for clarifying. Very helpful. |
FWIW, the tcltk package may set R version 4.1.0 (2021-05-18) -- "Camp Pontanezen"
Copyright (C) 2021 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin17.0 (64-bit)
...
> loadNamespace("tcltk")
<environment: namespace:tcltk>
Warning message:
In fun(libname, pkgname) : couldn't connect to display ":50"
> cl <- parallel::makePSOCKcluster(1, setup_timeout = 15.0)
Error in parallel::makePSOCKcluster(1, setup_timeout = 15) :
Cluster setup failed. 1 worker of 1 failed to connect.
Execution halted There's another package test without This tcltk "interaction" seems to only affect macOS. |
@HenrikBengtsson Thank you for the investigation! |
Have the same issue. As my simulation depends on it entirely and I will fail my thesis If I cannot fix this problem I feel very lost and frustrated: The problem:
Failed to launch and connect to R worker on local machine ‘localhost’ from local machine ‘IRMW1234’.
Any help is much appreciated. |
I accidentally fixed it: FYI; There were multiple different Versions and R libraries installed on my account, which I used on different computers. After deleting all kinds if R versions found on my account and installing R on the Computer, not on my account again, it all works perfectly fine. Dont know more details but hope it helps. |
Just here to chime in - this problem has plagued me for years, and still continues with R 4.3.2. |
System details
Steps to reproduce the problem
Same if the number of nodes is > 1.
Describe the problem in detail
I have got a (so far) unpublished R package using MC simulations that suddenly did not work anymore. Every time when I tried to start the simulation RStudio went to 100 % CPU load and did nothing anymore. Sometimes, in particular, during the first seconds, I could press the 'stop' button to shutdown the R process. If I waited too long, I had to kill the RStudio process. I narrowed down the problem to the single code line shown above. However, plain R (in the terminal or the R GUI) does work without any problem. I did run the following combinations:
NA: not tested.
My best guess is that this issue is related to recent changes in 'parallel'
Important (and again), it works in the R terminal and the normal R GUI, so, therefore, I believe that something needs to be modified in RStudio (I might be wrong though). Side note: If I used the RStudio check and build functionality where I also test examples, RStudio works fine.
Update 2020-04-24
I realised that
FREEZE
is probably not correct, the default timeout for the connection between master and worker inparallel::makeCluster()
is two minutes. If you wait the full time, R will throw the following error message :Describe the behaviour you expected
Calling
does not freeze RStudio or fails with the mentioned error message.
Update 2020-04-26, potential intermediate work-around
Since I had to continue testing with R 4.X and I wanted to use RStudio*, I played a little bit with the parameters, and (at least on my machine), I figured out the following:
If I use the parameter
setup_timeout
and set it to < 1 s, it still works. Example:I tested up to 100 nodes, without any problem. Obviously this is not a fix, and it may be different on other machines (not to mention that one may want to have these longer timeouts). However, it may provide an intermediate solution until a fix can be provided. Values > 1 s consistently did not work for me, according to the manual entry for
parallel::makeCluster
the default is 2 min.Not applicable, it does not really crash.
The text was updated successfully, but these errors were encountered: