Parallelise styling #277

lorenzwalthert · 2017-11-08T21:48:26Z

Functions like style_pkg() could be parallelised well, at least in principle - and speed up styling much.

For pre-commit hook, we should probably disable this.

Edit: a related issue is caching of results to the end of improving speed. See #320.

The text was updated successfully, but these errors were encountered:

lorenzwalthert · 2018-03-13T15:55:20Z

Reference: #370.

krlmlr · 2018-03-13T15:58:58Z

Could be on by default, works for me except for the output of incomplete lines (to indicate progress). I think we can solve this by printing complete lines after completion in the multicore case.

We need to evaluate if it's worth to enable multiprocess on Windows.

Check if purrr::map() already supports parallel execution when implementing this: tidyverse/purrr#121 (comment).

pat-s · 2019-05-10T13:50:31Z

Can you give me a heads-up on the current status here? I was planning to create a macro for tic (running on Travis CI for every build) that uses multicore support.

I guess no one knows when purrr will have a native integration so using furrr seems to be a good alternative right now?

lorenzwalthert · 2019-05-10T15:16:39Z

Yes, progress bars are an issue. We could first also implement a verbose argument as suggested in #375 and turn verbose off for multicore support.

We need to evaluate if it's worth to enable multiprocess on Windows.

Why does the operating matter here and would you prefer future::multicore (forked process, not supported on Windows) over future::multisession?

Using furrr sounds like a good plan to me. The only thing I am not sure about is how to choose the strategy. Because in the help file, it says:

Please refrain from modifying the future strategy inside your packages / functions, i.e. do not call plan() in your code. Instead, leave the control on what backend to use to the end user. This idea is part of the core philosophy of the future framework - as a developer you can never know what future backends the user have access to. Moreover, by not making any assumptions about what backends are available, your code will also work automatically with any new backends developed after you wrote your code.

If you think it is necessary to modify the future strategy within a function, then make sure to undo the changes when exiting the function. This can be done using:

oplan <- plan()
on.exit(plan(oplan), add = TRUE)
[...]

So not calling plan in styler code would mean that the user has to call it if he wants to turn on parallel processing, right? Is that the preferred option I guess?

lorenzwalthert · 2019-08-03T12:38:50Z

I did some more investigation on this.

furrr (which uses future under the hood) might indeed be a good solution, but we can also just use https://github.com/HenrikBengtsson/future.apply and https://github.com/HenrikBengtsson/progressr for progress bars.
However, there is a bigger problem. The usage of the strategy multicore is not possible on windows, leaving only multisession and cluster as parallel options for windows users. However, they both don't play with withr::with_dir(). Essentially, they don't respect the working directory. Do you think we should file an issue in the future repo?

get_wd_from_temp_dir <- function() {
  withr::with_dir(tempdir(), {
    future.apply::future_sapply(1:2, function(x) {
      print(getwd())
    })
  })
}

future::plan(future::sequential)
get_wd_from_temp_dir() # works
#> [1] "/private/var/folders/8l/fhmv3yj12kncddcjqwc19tkr0000gr/T/Rtmpoj656A"
#> [1] "/private/var/folders/8l/fhmv3yj12kncddcjqwc19tkr0000gr/T/Rtmpoj656A"
#> [1] "/private/var/folders/8l/fhmv3yj12kncddcjqwc19tkr0000gr/T/Rtmpoj656A"
#> [2] "/private/var/folders/8l/fhmv3yj12kncddcjqwc19tkr0000gr/T/Rtmpoj656A"
future::plan(future::multisession)
getwd()
#> [1] "/private/var/folders/8l/fhmv3yj12kncddcjqwc19tkr0000gr/T/RtmpdmE1EH/reprex34736a6dec94"
get_wd_from_temp_dir() # does not work
#> [1] "/private/var/folders/8l/fhmv3yj12kncddcjqwc19tkr0000gr/T/RtmpdmE1EH/reprex34736a6dec94"
#> [1] "/private/var/folders/8l/fhmv3yj12kncddcjqwc19tkr0000gr/T/RtmpdmE1EH/reprex34736a6dec94"
#> [1] "/private/var/folders/8l/fhmv3yj12kncddcjqwc19tkr0000gr/T/RtmpdmE1EH/reprex34736a6dec94"
#> [2] "/private/var/folders/8l/fhmv3yj12kncddcjqwc19tkr0000gr/T/RtmpdmE1EH/reprex34736a6dec94"

^{Created on 2019-08-03 by the reprex package (v0.3.0)}

lorenzwalthert · 2019-08-03T12:39:55Z

We could probably work around that by converting paths to absolute paths at some point and not use withr::with_dir(), but I am not sure this is a good way to solve the problem. The working directory can be set in low-level function makeClusterPSOCK() but I don't know how to pass values there when just using plan(multisession).

RichardJActon · 2019-08-25T20:56:10Z

I was going to say have you considered plan(multiprocess) which runs multi-core or multithreaded where available and multi-session where not, but I ran the below to check on the behaviour for passing variables to the child process and apparently it does not do this in Rstudio but falls back on multisession.

If you can define a working dir variable in package scope it should get passed to a child process in which it is referenced by default with some caveats it can also be done explicitly see

tmpDir <- tempdir()

get_wd_from_temp_dir <- function() {
    future.apply::future_sapply(1:2, function(x) {
        withr::with_dir(tmpDir, {
            print(getwd())
        })
    })
}

future::plan(future::sequential)

get_wd_from_temp_dir()
#> [1] "/tmp/RtmpfOVKI1"
#> [1] "/tmp/RtmpfOVKI1"
#> [1] "/tmp/RtmpfOVKI1" "/tmp/RtmpfOVKI1"

future::plan(future::multiprocess)
#> Warning: [ONE-TIME WARNING] Forked processing ('multicore') is disabled
#> in future (>= 1.13.0) when running R from RStudio, because it is
#> considered unstable. Because of this, plan("multicore") will fall
#> back to plan("sequential"), and plan("multiprocess") will fall back to
#> plan("multisession") - not plan("multicore") as in the past. For more
#> details, how to control forked processing or not, and how to silence this
#> warning in future R sessions, see ?future::supportsMulticore

get_wd_from_temp_dir()
#> [1] "/tmp/RtmpfOVKI1"
#> [1] "/tmp/RtmpfOVKI1"
#> [1] "/tmp/RtmpfOVKI1" "/tmp/RtmpfOVKI1"

^{Created on 2019-08-25 by the reprex package (v0.3.0)}

lorenzwalthert · 2020-03-05T20:45:06Z

I opened an issue: https://github.com/HenrikBengtsson/future.apply/issues/50

pat-s · 2020-03-05T21:00:39Z

Regarding parallel backends: Consider the callr backend. It solves many issues which multicore/multisession face.

HenrikBengtsson · 2020-03-06T17:38:07Z

As @RichardJActon suggests in #277 (comment), you could do:

withr::with_dir(tempdir(), {
  pwd <- getwd()
  future.apply::future_sapply(1:2, function(x) {
    setwd(pwd)
    print(getwd())
  })
})

This will work with all future backends that run on the localhost (e.g. multicore, multisession, future.callr::callr, future.batchtools::batchtools_local) and should produce an error other types of backends that don't have access to the local filesystem of the main R session. To robustify it further, test also for localhost, e.g.

You can robustify this and provide more informative error messages by using:

get_wd <- function() {
  structure(getwd(), hostname = Sys.info()[["nodename"]])
}

set_wd <- function(pwd) {
  target_hostname <- attr(pwd, "hostname")
  if (is.null(target_hostname)) {
    stop("Cannot change directory. Argument 'pwd' lacks attribute 'hostname'.")
  }
  hostname <- Sys.info()[["nodename"]]
  if (!identical(target_hostname, hostname)) {
    stop(sprintf("Cannot change directory on %s. Target directory %s is on another machine (%s).", sQuote(hostname), sQuote(pwd), target_hostname))
  }
  if (!utils::file_test("-d", pwd)) {
    stop(sprintf("No such directory on %s: %s", sQuote(hostname), sQuote(pwd)))
  }
  setwd(pwd)
}

and then

withr::with_dir(tempdir(), {
  pwd <- get_wd()
  future.apply::future_sapply(1:2, function(x) {
    set_wd(pwd)
    print(getwd())
  })
})

e.g.

Error in set_wd(pwd) : 
  Cannot change directory on ‘remote.machine.org’. Target directory '/tmp/hb/Rtmp8JlYCf' is on another machine ('alice-laptop').

For a full explanation, see HenrikBengtsson/future#363 (comment).

lorenzwalthert · 2020-03-07T12:17:40Z

Thanks for taking time fur such a detailed answer @HenrikBengtsson. As of now, style_dir() and style_pkg() work like this:

change to root directory to style.

# simplified
for (file in files_to_style) {
  # read content
  # style content
  # write content
  # print status of styling
}

To make styling work with remote workers, I think we had to change this to:

change to root directory to style.
read all content before calling a future backend

out <- future.apply::future_sapply(file_contents, 
  function() {
    # style content
    # communicate before writing back? or only later
})

writing back contents, potentially communicate styling results.

I think I don't like the fact that we can write files only at the very end of the process (please correct me if this assumption is wrong). I think it's nice when you style many files and you see progress (on the console and with git diff) as you go. Also, I don't know who would want to distribute a task like styling files on a cluster and it's almost always done interactively 😄. So, the simplest solution would probably be to:

Only allow strategies to resolve on localhost. Is there a way to check if the currently active strategy is going to be resolved on the localhost or not except from enumerating (which is dangerous because in the future there might be more backends that are local we won't add them to our list)? I only found future:::is_localhost().
If checking the above is easy: If the current strategy is not going to resolve futures on to localhost, temporarily change the strategy to sequential (Is this a good fallback?), emit a warning, proceed and set the initial strategy on exit as described in ?plan.

This essentially would boil down to what @HenrikBengtsson suggested in #277 (comment).

Just one edge case: What if a remote machine has the same node name and the directory to change to exists, will it fail or not? Because I think it should fail. Otherwise, it will have side effects on the remote machine (i.e. styling the files there instead of the localhost I think).

HenrikBengtsson · 2020-03-07T15:56:50Z

Is there a way to check if a strategy is going to be resolved on the localhost or not?

No. The reason is that such features need to come in at the design level (as I mention in the corresponding future issue). That is major work to solve. I don't want to open up for half-baked thing before because then we're ending up with too many hacks and a dead end in the long run.

I found future:::is_localhost()

That's completely different and definitely not meant for others to use.

Just one edge case: What if a remote machine has the same node name and the directory to change to exists, will it fail or not?

You'd need to add even more protections, e.g. write a unique file on master and verify that the worker sees it.

RichardJActon · 2020-03-09T12:18:30Z

I just tried a quick implementation of this in my fork https://github.com/RichardJActon/styler passing the directory to the workers and changing this line: https://github.com/RichardJActon/styler/blob/4a28e70617aaafbe9713c465983e562d9da8ee5f/R/transform-files.R#L23 to the furrr version which seems to work fine. The progress only shows up in chunks as the workers return though.

lorenzwalthert · 2020-03-09T13:16:37Z

Nice. Interested in a PR? Yeah I think that's not so desirable. Any idea how to fix it? Also, why furrr and not future.apply directly?

pat-s · 2020-03-09T13:35:05Z

Also, why furrr and not future.apply directly?

Just my 2 cents while watching: {furrr} hasn't seen a commit in 2 years and I'd rely on {future.apply}, especially for pkg use.
There is nothing that {future.apply} cannot do what {furrr} can.

RichardJActon · 2020-03-09T14:16:09Z

Using future.apply instead is fine, I just use furrr a fair bit day-to-day for the drop-in purrr::map* replacements. I'm not really sure where to start on getting progress to show as individual files are completed - I suspect that it would be pretty complex. I could do a PR, i'm a first time contributor here so anything in particular I need to do before I open one? Also should I drop an example of using the parallelised versions in the examples for the style_pkg & style_dir docs something like:

library(future)
plan(multisession, workers = 4)
style_pkg()

lorenzwalthert · 2020-05-19T08:09:06Z

Some new updates to {progressr} that are relevant here too: https://twitter.com/henrikbengtsson/status/1260336782421323777

lorenzwalthert · 2021-02-02T16:12:15Z

@krlmlr any progress on that with the {callr} approach?

krlmlr · 2021-02-02T16:55:28Z

Not yet.

lorenzwalthert added Complexity: High Priority: Low Status: Postponed Type: Infrastructure Priority: Medium and removed Priority: Low labels Nov 8, 2017

lorenzwalthert closed this as completed Nov 8, 2017

lorenzwalthert mentioned this issue Mar 13, 2018

Multicore support #370

Closed

lorenzwalthert reopened this Aug 3, 2019

lorenzwalthert mentioned this issue Mar 6, 2020

Working directory set with withr::with_dir() not respected in multisession HenrikBengtsson/future#363

Open

RichardJActon pushed a commit to RichardJActon/styler that referenced this issue Mar 10, 2020

added issue number (r-lib#277) to news entry

2a9646f

RichardJActon mentioned this issue Mar 10, 2020

Parallel styling of files by style_pkg & style_dir (#277) #617

Closed

krlmlr linked a pull request Oct 20, 2020 that will close this issue

Parallel styling #682

Draft

lorenzwalthert mentioned this issue Nov 28, 2020

Future for package developers and temporary overriding of plans HenrikBengtsson/future#450

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelise styling #277

Parallelise styling #277

lorenzwalthert commented Nov 8, 2017 •

edited

Loading

lorenzwalthert commented Mar 13, 2018

krlmlr commented Mar 13, 2018

pat-s commented May 10, 2019

lorenzwalthert commented May 10, 2019

lorenzwalthert commented Aug 3, 2019 •

edited

Loading

lorenzwalthert commented Aug 3, 2019 •

edited

Loading

RichardJActon commented Aug 25, 2019

lorenzwalthert commented Mar 5, 2020

pat-s commented Mar 5, 2020

HenrikBengtsson commented Mar 6, 2020

lorenzwalthert commented Mar 7, 2020 •

edited

Loading

HenrikBengtsson commented Mar 7, 2020

RichardJActon commented Mar 9, 2020

lorenzwalthert commented Mar 9, 2020

pat-s commented Mar 9, 2020

RichardJActon commented Mar 9, 2020

lorenzwalthert commented May 19, 2020 •

edited

Loading

lorenzwalthert commented Feb 2, 2021

krlmlr commented Feb 2, 2021

Parallelise styling #277

Parallelise styling #277

Comments

lorenzwalthert commented Nov 8, 2017 • edited Loading

lorenzwalthert commented Mar 13, 2018

krlmlr commented Mar 13, 2018

pat-s commented May 10, 2019

lorenzwalthert commented May 10, 2019

lorenzwalthert commented Aug 3, 2019 • edited Loading

lorenzwalthert commented Aug 3, 2019 • edited Loading

RichardJActon commented Aug 25, 2019

lorenzwalthert commented Mar 5, 2020

pat-s commented Mar 5, 2020

HenrikBengtsson commented Mar 6, 2020

lorenzwalthert commented Mar 7, 2020 • edited Loading

HenrikBengtsson commented Mar 7, 2020

RichardJActon commented Mar 9, 2020

lorenzwalthert commented Mar 9, 2020

pat-s commented Mar 9, 2020

RichardJActon commented Mar 9, 2020

lorenzwalthert commented May 19, 2020 • edited Loading

lorenzwalthert commented Feb 2, 2021

krlmlr commented Feb 2, 2021

lorenzwalthert commented Nov 8, 2017 •

edited

Loading

lorenzwalthert commented Aug 3, 2019 •

edited

Loading

lorenzwalthert commented Aug 3, 2019 •

edited

Loading

lorenzwalthert commented Mar 7, 2020 •

edited

Loading

lorenzwalthert commented May 19, 2020 •

edited

Loading