cache filtered tables, suggestion of functionality #144

pompm · 2019-05-16T04:47:21Z

Hi,
for off-line work or for minimize touching web database during debugging scripts, I think, that possibility of caching filtered tables can be useful.
(Filters also are cached as separate file and simply tested for equality.)
I attache a suggestion.
Marek

get_eurostat_flt <- function(file, filters, cache_dir=NULL, ...) {
if (is.null(cache_dir)) {
cache_dir <- getOption("eurostat_cache_dir", NULL)
if (is.null(cache_dir)) {
cache_dir <- file.path(tempdir(), "eurostat")
if (!file.exists(cache_dir))
dir.create(cache_dir)
}
}
if (!file.exists(cache_dir))
stop("Cache dir ", cache_dir, " does not exist.")
cache_file <- file.path(cache_dir, paste0(file, ".rds"))

new functionality:

cache_filters <- file.path(cache_dir, paste0(file, "_flt", ".rds"))
if (file.exists(cache_filters)) {
flt <- readRDS(cache_filters)
} else {
flt <- NA
}
if (identical(flt, filters) && file.exists(cache_file)) {
tmp <- readRDS(cache_file)
message("Reading cache file ", cache_file)
} else {
tmp <- get_eurostat(..., filters)
saveRDS(tmp, cache_file)
message("Table is cached at ", cache_file)
saveRDS(filters, cache_filters)
message("Filters are cached at ", cache_filters)
}
return(tmp)
}

Usage:
get_eurostat_flt(file="cz_sk", id="educ_thflds", time_format = "num",
filters=list(
indic_ed = "TC02_12",
geo = c("CZ", "SK"),
time = 2009:2010
)
)

antagomir · 2019-05-16T06:41:13Z

This might be useful. One problem is that (to my understanding) packages should not write on user's disk based on CRAN guidelines. If caching can be implemented in the working memory that could help.

jhuovari · 2019-05-16T06:58:05Z

The current caching writes already to user's disk.

To me it is bit unclear what this new functionality would add. I think the get_eurostat() is already caching with or without filters.

I also regret a bit that we added caching inside the get_eurostat -function, as it does write to user's disk by default.

antagomir · 2019-05-16T07:00:13Z

Oh I forgot this. Nice that we got through. We could consider taking it outside.

pompm · 2019-05-16T11:14:11Z

To @jhuovari: I think that function get_eurostat() do not cache tables with filters, there is:
if (is.null(filters) || filters != "none") {
cache <- FALSE
}

jhuovari · 2019-05-16T11:33:12Z

OK, I forgot that.

The reason was, i think, that for a filtered data a caching check should have been changed. Now arguments are added to a file name and based on that it is checked whether data is already in cache. However, filter argument could be too long for a file name.

If you could make a PR request, I think your solution would be fine. Right @antagomir ?

antagomir · 2020-01-26T09:38:39Z

Yes. Looking fwd!

pitkant · 2023-08-11T07:57:52Z

I return to this vintage issue from 2019 since I've been working on this now in 2023:

I have been thinking how to rework caching so that it would satisfy the following conditions:

Caching would be available for all downloads, not only whole datasets (bulk downloads or complete datasets downloaded from JSON api without filters -argument), as mentioned in this issue
The package would check if a superset of a filtered dataset (essentially a bulk download or complete dataset from JSON api) would already be available, making it unnecessary to download the filtered dataset from API, as mentioned in issue Cached datasets #257
Fix, or actually remove, the bits of code that caused confusion when to filter datasets and when not, as mentioned in issue get_eurostat() does not save .rds files #258

While I loved the elegant solution of embedding parts of the query to cache data file name as mentioned above by @jhuovari, this made it unfeasible to cache filtered datasets as filters can be of any length. I think modern filesystems are not that finicky with long filenames but even 260 characters long file names, that is the maximum in Windows, is quite long in my opinion. By hashing the query as an md5 hash string we can circumvent this limitation and continue to use mostly the same code for checking whether a file has already been cached or not.

By saving queries to a separate .json file in temp folder we can see if a superset of a filtered dataset has already been saved to disk and that can be used for filtering instead.

By making caching possible for all downloads instead of just some downloads we remove the confusion that came from having a cache = TRUE argument that was overridden by having a filters argument that was something else than NULL or "none".

See draft PR #267 for code changes and testing. Any feedback much appreciated.

pitkant · 2023-12-20T08:41:16Z

Closed with the CRAN release of package version 4.0.0

antagomir added the enhancement label May 16, 2019

antagomir assigned pompm Jan 26, 2020

pitkant mentioned this issue Aug 11, 2023

Improved cache handling #267

Merged

pitkant linked a pull request Aug 11, 2023 that will close this issue

Improved cache handling #267

Merged

pitkant added this to In progress in eurostat 4.0.0 Aug 11, 2023

pitkant moved this from In progress to Done in eurostat 4.0.0 Aug 17, 2023

pitkant mentioned this issue Nov 3, 2023

4.0.0 rc1 #281

Merged

pitkant closed this as completed Dec 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cache filtered tables, suggestion of functionality #144

cache filtered tables, suggestion of functionality #144

pompm commented May 16, 2019

antagomir commented May 16, 2019

jhuovari commented May 16, 2019

antagomir commented May 16, 2019

pompm commented May 16, 2019

jhuovari commented May 16, 2019

antagomir commented Jan 26, 2020

pitkant commented Aug 11, 2023

pitkant commented Dec 20, 2023

cache filtered tables, suggestion of functionality #144

cache filtered tables, suggestion of functionality #144

Comments

pompm commented May 16, 2019

new functionality:

antagomir commented May 16, 2019

jhuovari commented May 16, 2019

antagomir commented May 16, 2019

pompm commented May 16, 2019

jhuovari commented May 16, 2019

antagomir commented Jan 26, 2020

pitkant commented Aug 11, 2023

pitkant commented Dec 20, 2023