Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cache filtered tables, suggestion of functionality #144

Closed
pompm opened this issue May 16, 2019 · 8 comments · Fixed by #267
Closed

cache filtered tables, suggestion of functionality #144

pompm opened this issue May 16, 2019 · 8 comments · Fixed by #267
Assignees

Comments

@pompm
Copy link

pompm commented May 16, 2019

Hi,
for off-line work or for minimize touching web database during debugging scripts, I think, that possibility of caching filtered tables can be useful.
(Filters also are cached as separate file and simply tested for equality.)
I attache a suggestion.
Marek

get_eurostat_flt <- function(file, filters, cache_dir=NULL, ...) {
if (is.null(cache_dir)) {
cache_dir <- getOption("eurostat_cache_dir", NULL)
if (is.null(cache_dir)) {
cache_dir <- file.path(tempdir(), "eurostat")
if (!file.exists(cache_dir))
dir.create(cache_dir)
}
}
if (!file.exists(cache_dir))
stop("Cache dir ", cache_dir, " does not exist.")
cache_file <- file.path(cache_dir, paste0(file, ".rds"))

new functionality:

cache_filters <- file.path(cache_dir, paste0(file, "_flt", ".rds"))
if (file.exists(cache_filters)) {
flt <- readRDS(cache_filters)
} else {
flt <- NA
}
if (identical(flt, filters) && file.exists(cache_file)) {
tmp <- readRDS(cache_file)
message("Reading cache file ", cache_file)
} else {
tmp <- get_eurostat(..., filters)
saveRDS(tmp, cache_file)
message("Table is cached at ", cache_file)
saveRDS(filters, cache_filters)
message("Filters are cached at ", cache_filters)
}
return(tmp)
}

Usage:
get_eurostat_flt(file="cz_sk", id="educ_thflds", time_format = "num",
filters=list(
indic_ed = "TC02_12",
geo = c("CZ", "SK"),
time = 2009:2010
)
)

@antagomir
Copy link
Member

This might be useful. One problem is that (to my understanding) packages should not write on user's disk based on CRAN guidelines. If caching can be implemented in the working memory that could help.

@jhuovari
Copy link

The current caching writes already to user's disk.

To me it is bit unclear what this new functionality would add. I think the get_eurostat() is already caching with or without filters.

I also regret a bit that we added caching inside the get_eurostat -function, as it does write to user's disk by default.

@antagomir
Copy link
Member

Oh I forgot this. Nice that we got through. We could consider taking it outside.

@pompm
Copy link
Author

pompm commented May 16, 2019

To @jhuovari: I think that function get_eurostat() do not cache tables with filters, there is:
if (is.null(filters) || filters != "none") {
cache <- FALSE
}

@jhuovari
Copy link

OK, I forgot that.

The reason was, i think, that for a filtered data a caching check should have been changed. Now arguments are added to a file name and based on that it is checked whether data is already in cache. However, filter argument could be too long for a file name.

If you could make a PR request, I think your solution would be fine. Right @antagomir ?

@antagomir
Copy link
Member

Yes. Looking fwd!

@pitkant
Copy link
Member

pitkant commented Aug 11, 2023

I return to this vintage issue from 2019 since I've been working on this now in 2023:

I have been thinking how to rework caching so that it would satisfy the following conditions:

  • Caching would be available for all downloads, not only whole datasets (bulk downloads or complete datasets downloaded from JSON api without filters -argument), as mentioned in this issue
  • The package would check if a superset of a filtered dataset (essentially a bulk download or complete dataset from JSON api) would already be available, making it unnecessary to download the filtered dataset from API, as mentioned in issue Cached datasets #257
  • Fix, or actually remove, the bits of code that caused confusion when to filter datasets and when not, as mentioned in issue get_eurostat() does not save .rds files #258

While I loved the elegant solution of embedding parts of the query to cache data file name as mentioned above by @jhuovari, this made it unfeasible to cache filtered datasets as filters can be of any length. I think modern filesystems are not that finicky with long filenames but even 260 characters long file names, that is the maximum in Windows, is quite long in my opinion. By hashing the query as an md5 hash string we can circumvent this limitation and continue to use mostly the same code for checking whether a file has already been cached or not.

By saving queries to a separate .json file in temp folder we can see if a superset of a filtered dataset has already been saved to disk and that can be used for filtering instead.

By making caching possible for all downloads instead of just some downloads we remove the confusion that came from having a cache = TRUE argument that was overridden by having a filters argument that was something else than NULL or "none".

See draft PR #267 for code changes and testing. Any feedback much appreciated.

@pitkant pitkant linked a pull request Aug 11, 2023 that will close this issue
@pitkant pitkant added this to In progress in eurostat 4.0.0 Aug 11, 2023
@pitkant pitkant moved this from In progress to Done in eurostat 4.0.0 Aug 17, 2023
@pitkant pitkant mentioned this issue Nov 3, 2023
@pitkant
Copy link
Member

pitkant commented Dec 20, 2023

Closed with the CRAN release of package version 4.0.0

@pitkant pitkant closed this as completed Dec 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

4 participants