Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caching the set of packages installed in a package library #78

Open
hturner opened this issue Aug 31, 2023 · 2 comments
Open

Caching the set of packages installed in a package library #78

hturner opened this issue Aug 31, 2023 · 2 comments
Assignees
Labels

Comments

@hturner
Copy link
Member

hturner commented Aug 31, 2023

As described in Uwe's talk in the kick-off session on Day 1: create a database for each library of installed R packages.

This can help to speed up functions that check which packages are installed.

@gmbecker
Copy link

We had a very good discussion with Uwe. Uwe and I are going to collaborate on implementing this feature. I will keep the issue updated. @bnaras can you put the notes you took during the meeting in a comment here?

@bnaras
Copy link

bnaras commented Aug 31, 2023

Problem

installed.packages() takes a long time to execute the first time in a session when a large number of packages are installed in a library. (The subsequent invocations are fast because of caching.)

Impact

The issue is acute in settings where library is shared via network mounted drives, as is not uncommon for educational labs etc. In Windows installations, even with < 100 packages, the function takes 2 seconds or more on a (reasonably powerful) machine as Uwe demonstrated. This is also a problem for an Rstudio user because, upon startup, Rstudio seeks to ascertain all installed packages making it unusable in a networked shared library setting.

Core Issue

The time it takes for installed.packages() is dominated by the time to read every DESCRIPTION file in all the installed packages.

Proposal

Maintain an up-to-date database---we use the term loosely, for now---of installed packages so that the information is readily available for installed.packages() to epxploit.

Desiderata are:

  • Ensuring integrity
  • Caching and synchronization/invalidation/rebuilding as needed
  • Ensure it works with parallel installation processes (arg: Ncpus > 1). The parallel installation process already calculates dependencies and puts most important dependencies first

Initial Approach

  • Figure out a mechanism to build up a serialized R object such as PACKAGES.RDS that reflects what's actually installed.
  • Allow for an environmental variable to be set that enables keeping the object up-to-date by default, i.e. rebuilding the database if packages are installed or uninstalled. This may be mostly used by system administrators to keep things up-to-date automatically, but so may users if they so desire.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants