Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel styling #682

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from
Draft

Parallel styling #682

wants to merge 4 commits into from

Conversation

krlmlr
Copy link
Member

@krlmlr krlmlr commented Oct 20, 2020

via furrr.

Closes #277.

Closes #617 (=supersedes).


Things to consider:

  • check how suggested and imported CRAN packages use styler and what it means for parallelisation.
  • check impact on speed with pre-commit.
  • check impact with /style command in r-lib/actions.

@lorenzwalthert
Copy link
Collaborator

lorenzwalthert commented Oct 20, 2020

Thanks @krlmlr. Couple of points (some also discussed in #277):

  • I am not sure we should use {furrr}. Is it because of progress bars? We could just use future.apply or similar to not expand the dependency graph (I haven't checked if it actually does).
  • I think for this to work with future.apply() in Parallel styling of files by style_pkg & style_dir (#277) #617, we needed to pass absolute file paths to the styler because the current working directory set with withr::with_dir(). There must be workarounds as in Parallel styling of files by style_pkg & style_dir (#277) #617 or as discussed in Parallelise styling #277 (comment). Seems like {furrr} handles that out of the box?
  • The authors of the future package do not recommend changing the plan in package code (see help(future::plan())). I am not sure we really want to modify the plan. Why can't we leave to the user and they can set an appropriate plan, e.g. in their .Rprofile? Also, what numbers of cores should/are used with your suggestion. For styling one file or for testing, it's desirable to not turn multicore support on.

@codecov-io
Copy link

Codecov Report

Merging #682 into master will decrease coverage by 0.27%.
The diff coverage is 30.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #682      +/-   ##
==========================================
- Coverage   90.43%   90.15%   -0.28%     
==========================================
  Files          47       47              
  Lines        2226     2234       +8     
==========================================
+ Hits         2013     2014       +1     
- Misses        213      220       +7     
Impacted Files Coverage Δ
R/transform-files.R 93.91% <30.00%> (-6.09%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c50338f...e8615e8. Read the comment docs.

@krlmlr
Copy link
Member Author

krlmlr commented Oct 21, 2020

  • furrr::future_map_lgl() is a drop-in replacement for map_lgl(), gives a progress bar, and has seen an update very recently. It's a suggested package, the default is to revert to serial styling. (We might want to show a one-time message, though, and/or offer a parallel = NA argument.

  • It just works with furrr.

  • From ?plan:

    If you think it is necessary to modify the future strategy within a function, then make sure to undo the changes when exiting the function. This can be done using: ...

    and this is what we do here. I'd say this improves user experience -- users who style their code might not want to even learn about future and whatnot.

@lorenzwalthert
Copy link
Collaborator

lorenzwalthert commented Oct 21, 2020

I see.

  • With regard to the future strategy, I'd like to 1) check if a plan has been set explicitly, if yes, follow this (I don't know if this is possible?). If not, set temporarily to future::plan("multiprocess") and restore on exit. Because with the current implementation and furrr installed, the user has no option to decide on the strategy, which I think is against the spirit of future (apart from setting the testthat env variable, but that shouldn't be the way to avoid a certain future strategy).
  • Also, setting the future strategy to plan('multiprocess') on my macOS almost takes 2s, so we should really only do it if there are two or more files to style.
  • Regarding progress bar, I note that the .progress argument will be deprecated in the future in favour of an approach compatible with the progressr package, so not sure we should introduce this here. It is a very visible user-facing change. As I understand from Adding a progress bar by default in a package HenrikBengtsson/progressr#78, progressr is only at the beginning and at the moment, you can either force progress display as a developer (which is against the principles of progressr) or ask the user to wrap every call to styler into with_progress(), which 99% of the users won't do. Other global options are potentially considered later.
  • Another problem is that we don't get real time updates on the styled files as with sequential processing. Seems like a future.apply approach with progressr also does not allow this, so we might need to drop it.
slow_sqrt <- function(xs) {
  p <- progressor(along = xs)
  future.apply::future_lapply(xs, function(x) {
    message("Calculating the square root of ", x)
    Sys.sleep(2)
    p(sprintf("x=%g", x))
    sqrt(x)
  })
}

library(progressr)
handlers("progress")
with_progress(y <- slow_sqrt(1:10))
  • No matter how we decide in the end, we should add clear documentation and unit test various cases.

That said, I understand your point that 99% of the users probably don't want to bother with future strategies and progress bar settings but I think it's important to keep things transparent if possible.

@lorenzwalthert
Copy link
Collaborator

OK, I will try to finish this up.

@krlmlr
Copy link
Member Author

krlmlr commented Nov 29, 2020

Thanks! Happy to review.

The startup costs are a problem. On Linux we could start one child process with callr, which then could spawn subprocesses very efficiently with the multicore plan. We could do the same on Windows and multiprocess: In both cases this would also eliminate the problem of temporarily altering the local plan. What do you think?

@lorenzwalthert
Copy link
Collaborator

But going through another sub-process will probably not decrease the start-up costs, would it? Because on top of future::plan(), a new R session has to be started, so costs add up? What would be the advantages of using a sub-process? To avoid the RStudio incompatibility of forked processes on UNIX? I don't think much can go wrong with capturing the plan and setting it on exit.

@lorenzwalthert
Copy link
Collaborator

lorenzwalthert commented Dec 5, 2020

I thought about it again. There is an additional problem: When using future, setting a plan is time consuming, even when it's sequential. Hence, when there are less than 3 files to style and we don't set multisession, we should rather use map_lgl() instead of future_map_lgl(). Also, the user should have the option to fully control the process, as discussed in HenrikBengtsson/future#450. For that reason< I think we should use this heuristic with the env variable R_STYLER_USE_FUTURE: (to be added to the docs):
IMG_1237

I think we should make a short vignette for this.

@lorenzwalthert lorenzwalthert marked this pull request as draft June 3, 2021 17:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parallelise styling
3 participants