Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snapshot very slow on larger package libraries #347

Closed
jmcphers opened this issue Dec 13, 2016 · 13 comments
Closed

Snapshot very slow on larger package libraries #347

jmcphers opened this issue Dec 13, 2016 · 13 comments

Comments

@jmcphers
Copy link
Member

@jmcphers jmcphers commented Dec 13, 2016

When the Packrat library is large, the .snapshotImpl() function can take a very long time. This is relevant because this function is run on RStudio's startup for Packrat projects, in order to annotate records for the files pane with their Packrat status. For example, for a project with 143 packages in the library, this takes almost three quarters of a minute:

> system.time(packrat:::getActionMessages("snapshot", "."))
   user  system elapsed 
 38.674   1.147  39.991 

All of the time appears to be spent in a recursive-descent getPackageRecords, which performs a huge amount of disk I/O and eventually generates full a package dependency tree weighing in at 2.4mb.

packrat/R/pkg.R

Lines 236 to 252 in b13dcaa

if (recursive) {
allRecords <- lapply(allRecords, function(record) {
deps <- getPackageDependencies(pkgs = record$name,
lib.loc = lib.loc,
available.packages = available)
if (!is.null(deps)) {
record$depends <- getPackageRecords(
deps,
project = project,
available,
TRUE,
lib.loc = lib.loc,
missing.package = missing.package,
check.lockfile = check.lockfile,
fallback.ok = fallback.ok
)
}

It seems like this could be made much more efficient by either:

a) re-using an existing dependency subtree if we've already generated it, or

b) changing the callers of getPackageRecords to accept a flat list of packages rather than a tree, and changing the recursive descent code to append packages to end of the list if not present.

@kevinushey
Copy link
Contributor

@kevinushey kevinushey commented Dec 22, 2016

Unfortunately, the snapshot code is especially crufty and barring a full rewrite I think it's going to be quite difficult to do this correctly and without regression. :/

This commit contains a minimal change that should greatly improve performance in this case, however:

ac1a18c

The new assumption made here is that the package library is in a consistent state, and so recursive dependency search of packages installed within a library is unnecessary. This implies that, if a user were to get themselves into a state where a required package dependency was not installed, Packrat may no longer detect that and just literally snapshot what the user has.

Since an R user would have to take some explicit action to put their library into an inconsistent state like this, I think that in practice this won't be an issue.

@ras44
Copy link
Contributor

@ras44 ras44 commented Apr 2, 2018

@kevinushey I've run into this same issue and noticed that the change ac1a18c mentioned above looks like it has since been reverted:

recursive = TRUE)

My temporary solution to this has been:

packrat::set_opts(ignored.directories=c("all","my","R","src","directories")

Then for snapshots:

run packrat::snapshot(ignore.stale=TRUE, infer.dependencies=FALSE)

I am only using the first workflow mentioned in the README (snapshot as-you-go). By turning off the whole dependency inference functionality, it runs quite a bit faster.

@kevinushey
Copy link
Contributor

@kevinushey kevinushey commented Apr 2, 2018

Yes, we unfortunately had to revert that change as it broke a couple internal RStudio applications using Packrat (some required recursive dependencies were not being discovered with that change).

I think this could ultimately be fixed with some intelligent caching of discovered dependencies.

@ras44
Copy link
Contributor

@ras44 ras44 commented Apr 4, 2018

Would you consider a global option that turns off all inference functionality? For people who exclusively work in the snapshot-as-you-go workflow, it might speed up functionality.

@kevinushey
Copy link
Contributor

@kevinushey kevinushey commented Apr 4, 2018

I would be open to that, but the default would need to stay off.

@tchevri
Copy link

@tchevri tchevri commented Apr 11, 2019

That would be great to have - like everyone else, I love packrat - i think it's a fantastic tool. However, it does slow things down dramatically - up to the point of defeating the purpose. In my case, I have a project with an .Rnw file, calling R code with subfunctions and the constant error messages about failing to parse that file for dependencies are not just annoying - installing any package can take up to 20 or even 30 min... that's way too much. So I have tried the new packrat 0.50 option, options("packrat.dependency.discovery.disabled"=TRUE), but i still feel (mistakenly?) that the same dependency checks are happening (rstudio/rstudio#2859) - it's just no longer displaying messages... (the right panel with packrat and libraries keep flickering, Rstudio keeps hanging, and top right showing Git panel shows file appearing / disappearing - everything as before, i just do no longer get the failed to parse ....Rnw for dependencies) - I am still a newbie despite using packrat since almost the beginning - so apologies if anything i typed is dead wrong, it's my user impression! This said, with packrat off, everything runs lightening fast, narrowing down the issue... many thanks. PS: I am on alienware m17x running windows 10 and everything is fully updated, e.g. packrat 0.50, R 3.5.3, latest rstudio, etc...

@slopp
Copy link

@slopp slopp commented Apr 11, 2019

You might be interested in trying rstudio/renv which addresses many of these challenges while maintaining packrats API

@ras44
Copy link
Contributor

@ras44 ras44 commented Apr 12, 2019

I had the same experience as @tchevri, particularly when one project contained multiple packages and R Markdown documents: flickering right panel as libraries are rebuilt/loaded, but with little output as to what was going on to diagnose. I assumed it was due to knitting during the attempt to discover dependencies each time I snapshotted or installed libraries. Since I was using the snapshot-as-you-go approach, I just ignored all R/Rmd source files and manually snapshot whenever I felt the need to do so.

@slopp it looks like rstudio/renv still attempts to crawl R files, but it sounds like that might only be in the init() step. Does the dependency discovery only happen on init? If not, then it seems like this could lead to the same slow knitting/dependency issues as packrat- unless there have been some performance improvements. Is RStudio recommending transitioning to renv? Thank you!

@kevinushey
Copy link
Contributor

@kevinushey kevinushey commented Apr 12, 2019

We're not recommending a transition to renv yet -- it's still in the early stages, but we welcome testing from users. We will eventually plan to transition away from packrat towards renv, but not until we're absolutely sure everything is rock-solid and stable.

renv has an improved dependency discovery system relative to Packrat. In particular, we no longer attempt to use rmarkdown / knitr to render R Markdown files to discover dependencies; instead, we implement our own simple parser that should handle such files better. Furthermore, this is only called during init(); snapshot() will not need to perform that kind of dependency discovery.

In addition, renv introduces a mechanism for ignoring certain files during dependency discovery if required. renv will respect the ignores in a project's .gitignore files; or, alternatively, you can write .renvignore files using the same format as a .gitignore. This can be helpful if you need renv to ignore directories with a large number of data files.

@ras44
Copy link
Contributor

@ras44 ras44 commented Apr 12, 2019

That sounds very promising. I will gladly chip in to test it out 👍

@tchevri
Copy link

@tchevri tchevri commented Apr 17, 2019

very interesting - just read the renv vignette
Cool stuff, as usual.
Do you happen to have an (approximate) timeline for when we should or can transition to this renv?
Will there be a recommended way to switch from packrat to renv?
Many thanks
thomas

@kevinushey
Copy link
Contributor

@kevinushey kevinushey commented Apr 17, 2019

It's tough to give a timeline, but I would say hopefully by summer we'll be getting ready to submit to CRAN. Before that, we would welcome any and all user testing if you have the time to spare. In particular, it'd be useful to know if there's something that Packrat does (and you rely on) that we haven't yet implemented in renv that we should try to bring over.

@aronatkins
Copy link
Contributor

@aronatkins aronatkins commented Nov 4, 2021

The changes in #615 avoid repeatedly processing the same package dependency. Most of the cost in getPackageRecords was due to redundant work.

Given a packrat library with 130+ package created by installing tidyverse, rmarkdown, shiny, and devtools:

> system.time(packrat:::getActionMessages("snapshot", "."))
   user  system elapsed 
  2.366   0.409   2.841 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants