Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maybe use a cache for larger data sets #2

Closed
DavisVaughan opened this issue Dec 2, 2019 · 12 comments
Closed

Maybe use a cache for larger data sets #2

DavisVaughan opened this issue Dec 2, 2019 · 12 comments
Labels
feature a feature request or enhancement

Comments

@DavisVaughan
Copy link
Member

DavisVaughan commented Dec 2, 2019

To get around CRAN's package size limit, we could try and have URLs that point to data sets which would live on github in this repo, and then cache them on the user's machine.

I imagine it would look like:

data_ames <- function() {
  if (has_data_in_cache("ames")) {
    get_data_from_cache("ames")
  } else {
    get_data_from_url_and_cache_it("ames")
  }
}

We could follow the lead of pak, which uses the following function to determine where R's global permanent cache is:

https://github.com/r-lib/pak/blob/e65de1e9630dbfcaf1044718b742bf806486b107/R/utils.R#L84

and then we could save into <cache-path>/model-data/ames.rds

To be even faster, we would only load the data once per R session. Once we load it from the cache directory, we would store it in an environment internal to modeldata and pull it from there each time data_ames() is called. So it might look more like:

data_ames <- function() {
  if (has_data_in_internal_environment("ames")) {
    get_data_from_internal_environment("ames")
  } else if (has_data_in_cache("ames")) {
    get_data_from_cache("ames")
  } else {
    get_data_from_url_and_cache_it("ames")
  }
}

The datasets themselves would actually live in a folder in this repo that would be .Rbuildignore-d. For example: inst/data/ames.rds and then ignore inst/data

@EmilHvitfeldt
Copy link
Member

https://github.com/emilhvitfeldt/textdata is an expanded version of what you are proposing. You are free to take bits and pieces as you need.

@DavisVaughan
Copy link
Member Author

Yea that looks great! I also didn't know rappdirs was an actual package. This will be helpful too

https://github.com/EmilHvitfeldt/textdata/blob/2b5e9f7bd8b6b722970d2c5b54a8989f542d252f/R/load_dataset.R#L10

@EmilHvitfeldt
Copy link
Member

yes rappdirs is gonna save you a lot of headaches

@DavisVaughan
Copy link
Member Author

DavisVaughan commented Dec 2, 2019

Alternatively, gargle stores things in <os-home>/.R/gargle/gargle-auth, with the additional opportunity to override this with a global option options("gargle_auth_cache" = <path>)

@DavisVaughan
Copy link
Member Author

DavisVaughan commented Dec 2, 2019

Should have:

  • A function to clean the cache
  • A function to check if a data set is outdated
  • Maybe create subfolders by package version?
  • Provide a way to access the package version specific data if you have downloaded it in the past, even if you are on a higher version of the package (for reproducibility)

@EmilHvitfeldt
Copy link
Member

should also have a way to overwrite the file path

@DavisVaughan
Copy link
Member Author

DavisVaughan commented Dec 2, 2019

yea i think the order of figuring out what path to use should look something like:

data_ames(path = NULL)

  • Use path if not NULL
  • Use modeldata.cache if not NA
  • Use global cache path into <cache-path>/model-data/ames.rds

@topepo
Copy link
Member

topepo commented Dec 3, 2019

Perfect timing for some new data too. @trang1618 😃

@EmilHvitfeldt
Copy link
Member

Would pins be able to solve this problem as well?

@topepo
Copy link
Member

topepo commented Jan 6, 2020

datastorr looks perfect for this application.

@topepo topepo added the feature a feature request or enhancement label Dec 4, 2020
@EmilHvitfeldt
Copy link
Member

We are doing all this in https://github.com/tidymodels/modeldatatoo 😄

@github-actions
Copy link

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Mar 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

3 participants