-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cache filtered tables, suggestion of functionality #144
Comments
This might be useful. One problem is that (to my understanding) packages should not write on user's disk based on CRAN guidelines. If caching can be implemented in the working memory that could help. |
The current caching writes already to user's disk. To me it is bit unclear what this new functionality would add. I think the I also regret a bit that we added caching inside the |
Oh I forgot this. Nice that we got through. We could consider taking it outside. |
To @jhuovari: I think that function get_eurostat() do not cache tables with filters, there is: |
OK, I forgot that. The reason was, i think, that for a filtered data a caching check should have been changed. Now arguments are added to a file name and based on that it is checked whether data is already in cache. However, filter argument could be too long for a file name. If you could make a PR request, I think your solution would be fine. Right @antagomir ? |
Yes. Looking fwd! |
I return to this vintage issue from 2019 since I've been working on this now in 2023: I have been thinking how to rework caching so that it would satisfy the following conditions:
While I loved the elegant solution of embedding parts of the query to cache data file name as mentioned above by @jhuovari, this made it unfeasible to cache filtered datasets as filters can be of any length. I think modern filesystems are not that finicky with long filenames but even 260 characters long file names, that is the maximum in Windows, is quite long in my opinion. By hashing the query as an md5 hash string we can circumvent this limitation and continue to use mostly the same code for checking whether a file has already been cached or not. By saving queries to a separate .json file in temp folder we can see if a superset of a filtered dataset has already been saved to disk and that can be used for filtering instead. By making caching possible for all downloads instead of just some downloads we remove the confusion that came from having a See draft PR #267 for code changes and testing. Any feedback much appreciated. |
Closed with the CRAN release of package version 4.0.0 |
Hi,
for off-line work or for minimize touching web database during debugging scripts, I think, that possibility of caching filtered tables can be useful.
(Filters also are cached as separate file and simply tested for equality.)
I attache a suggestion.
Marek
get_eurostat_flt <- function(file, filters, cache_dir=NULL, ...) {
if (is.null(cache_dir)) {
cache_dir <- getOption("eurostat_cache_dir", NULL)
if (is.null(cache_dir)) {
cache_dir <- file.path(tempdir(), "eurostat")
if (!file.exists(cache_dir))
dir.create(cache_dir)
}
}
if (!file.exists(cache_dir))
stop("Cache dir ", cache_dir, " does not exist.")
cache_file <- file.path(cache_dir, paste0(file, ".rds"))
new functionality:
cache_filters <- file.path(cache_dir, paste0(file, "_flt", ".rds"))
if (file.exists(cache_filters)) {
flt <- readRDS(cache_filters)
} else {
flt <- NA
}
if (identical(flt, filters) && file.exists(cache_file)) {
tmp <- readRDS(cache_file)
message("Reading cache file ", cache_file)
} else {
tmp <- get_eurostat(..., filters)
saveRDS(tmp, cache_file)
message("Table is cached at ", cache_file)
saveRDS(filters, cache_filters)
message("Filters are cached at ", cache_filters)
}
return(tmp)
}
Usage:
get_eurostat_flt(file="cz_sk", id="educ_thflds", time_format = "num",
filters=list(
indic_ed = "TC02_12",
geo = c("CZ", "SK"),
time = 2009:2010
)
)
The text was updated successfully, but these errors were encountered: