Cached datasets #257

lz1nwm · 2023-03-09T16:49:59Z

By default eurosatat caches datasets when it is run for the first time during the session, but it does not check if the cached table contains all the data needed to proceed the consecutive requests to the same table in Eurostat. I'm not sure if this is the intended behaviour. Please see the following example:

> get_eurostat('nama_10_gdp', filters = list(geo = c('EA'), 
+                                            unit = c('CP_MEUR'), 
+                                            na_item = c('B1GQ'),
+                                            time = c(2020:2022)), 
+              time_format = "date_last")
Table nama_10_gdp cached at ...
# A tibble: 3 x 6
  freq  unit    na_item geo   time          values
  <chr> <chr>   <chr>   <chr> <date>         <dbl>
1 A     CP_MEUR B1GQ    EA    2020-12-31 11456918.
2 A     CP_MEUR B1GQ    EA    2021-12-31 12318505.
3 A     CP_MEUR B1GQ    EA    2022-12-31 13338550.

> get_eurostat('nama_10_gdp', filters = list(geo = c('DE'), 
+                                            unit = c('CP_MEUR'), 
+                                            na_item = c('B1GQ'),
+                                            time = c(2020:2022)), 
+              time_format = "date_last")
Reading cache file ...
# A tibble: 3 x 6
  freq  unit    na_item geo   time          values
  <chr> <chr>   <chr>   <chr> <date>         <dbl>
1 A     CP_MEUR B1GQ    EA    2020-12-31 11456918.
2 A     CP_MEUR B1GQ    EA    2021-12-31 12318505.
3 A     CP_MEUR B1GQ    EA    2022-12-31 13338550.

The text was updated successfully, but these errors were encountered:

antagomir · 2023-03-09T18:55:34Z

Ah, right. Probably not intended and should be fixed as soon as the time will allow.

Could you consider making a PR?

lz1nwm · 2023-03-10T07:53:08Z

Could you consider making a PR?

Unfortunately, I have no practice with PRs but I 'll see if I could do something.

antagomir · 2023-03-11T12:29:52Z

Here some instructions:
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request

pitkant · 2023-03-13T12:51:49Z

I can see the inconvenience but I think it's debatable whether this is unintended behaviour or not. The point of caching is to make the least amount of requests to Eurostat servers and writing a fix that would constantly compare the cached file with the unfiltered remote file would create unnecessary web traffic between end-users and Eurostat.

Caching can be easily disabled, although it is currently enabled by default. Maybe this is more of an issue related to documentation? Would adding some explicit messages when downloading and caching data make users more aware of this limitation?

lz1nwm · 2023-03-13T14:19:22Z

Just to clarify, my point was that I would expect the second query in my example to return an empty table and/or send the query to Eurostat. Basically, the cached table after the first query is only a small part of the dataset and obviously it could not be used for broader queries.

pitkant · 2023-03-13T14:43:05Z

Thank you for clarifying. The reason (whether it be good or not, you decide) why it works like that is that the query parameters are passed onto the request made to the Eurostat database. For some query parameters no filtering is done locally, whereas in some cases there is some at least some processing done locally (if not filtering). An example of the latter is handling Eurostat date strings and turning them to date objects.

Yes, we could be possible to add some additional local checks before printing the output, to see whether the geo column has the desired areas or if the time frame is as desired; if not, then print a message to the user or attempt to refresh the cached dataset. Or maybe the query could be saved with the cached dataset and only use the cached data if the queries are identical.

pitkant · 2023-05-09T13:01:33Z

As referenced in issue #258 it might make more sense to cache datasets that were downloaded without filtering than caching filtered datasets. Then, if the complete dataset was cached locally, it could also be filtered locally, solving both issues at a single stroke.

pitkant · 2023-12-20T08:36:20Z

Closed with the CRAN release of package version 4.0.0

pitkant added the enhancement label Apr 18, 2023

pitkant added the help wanted label May 9, 2023

pitkant mentioned this issue May 9, 2023

get_eurostat() does not save .rds files #258

Closed

This was referenced Aug 11, 2023

Improved cache handling #267

Merged

cache filtered tables, suggestion of functionality #144

Closed

pitkant linked a pull request Aug 11, 2023 that will close this issue

Improved cache handling #267

Merged

pitkant mentioned this issue Nov 3, 2023

4.0.0 rc1 #281

Merged

pitkant closed this as completed Dec 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cached datasets #257

Cached datasets #257

lz1nwm commented Mar 9, 2023

antagomir commented Mar 9, 2023

lz1nwm commented Mar 10, 2023

antagomir commented Mar 11, 2023

pitkant commented Mar 13, 2023

lz1nwm commented Mar 13, 2023

pitkant commented Mar 13, 2023

pitkant commented May 9, 2023

pitkant commented Dec 20, 2023

Cached datasets #257

Cached datasets #257

Comments

lz1nwm commented Mar 9, 2023

antagomir commented Mar 9, 2023

lz1nwm commented Mar 10, 2023

antagomir commented Mar 11, 2023

pitkant commented Mar 13, 2023

lz1nwm commented Mar 13, 2023

pitkant commented Mar 13, 2023

pitkant commented May 9, 2023

pitkant commented Dec 20, 2023