Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cached datasets #257

Closed
lz1nwm opened this issue Mar 9, 2023 · 8 comments · Fixed by #267
Closed

Cached datasets #257

lz1nwm opened this issue Mar 9, 2023 · 8 comments · Fixed by #267

Comments

@lz1nwm
Copy link

lz1nwm commented Mar 9, 2023

By default eurosatat caches datasets when it is run for the first time during the session, but it does not check if the cached table contains all the data needed to proceed the consecutive requests to the same table in Eurostat. I'm not sure if this is the intended behaviour. Please see the following example:

> get_eurostat('nama_10_gdp', filters = list(geo = c('EA'), 
+                                            unit = c('CP_MEUR'), 
+                                            na_item = c('B1GQ'),
+                                            time = c(2020:2022)), 
+              time_format = "date_last")
Table nama_10_gdp cached at ...
# A tibble: 3 x 6
  freq  unit    na_item geo   time          values
  <chr> <chr>   <chr>   <chr> <date>         <dbl>
1 A     CP_MEUR B1GQ    EA    2020-12-31 11456918.
2 A     CP_MEUR B1GQ    EA    2021-12-31 12318505.
3 A     CP_MEUR B1GQ    EA    2022-12-31 13338550.

> get_eurostat('nama_10_gdp', filters = list(geo = c('DE'), 
+                                            unit = c('CP_MEUR'), 
+                                            na_item = c('B1GQ'),
+                                            time = c(2020:2022)), 
+              time_format = "date_last")
Reading cache file ...
# A tibble: 3 x 6
  freq  unit    na_item geo   time          values
  <chr> <chr>   <chr>   <chr> <date>         <dbl>
1 A     CP_MEUR B1GQ    EA    2020-12-31 11456918.
2 A     CP_MEUR B1GQ    EA    2021-12-31 12318505.
3 A     CP_MEUR B1GQ    EA    2022-12-31 13338550.
@antagomir
Copy link
Member

Ah, right. Probably not intended and should be fixed as soon as the time will allow.

Could you consider making a PR?

@lz1nwm
Copy link
Author

lz1nwm commented Mar 10, 2023

Could you consider making a PR?

Unfortunately, I have no practice with PRs but I 'll see if I could do something.

@pitkant
Copy link
Member

pitkant commented Mar 13, 2023

I can see the inconvenience but I think it's debatable whether this is unintended behaviour or not. The point of caching is to make the least amount of requests to Eurostat servers and writing a fix that would constantly compare the cached file with the unfiltered remote file would create unnecessary web traffic between end-users and Eurostat.

Caching can be easily disabled, although it is currently enabled by default. Maybe this is more of an issue related to documentation? Would adding some explicit messages when downloading and caching data make users more aware of this limitation?

@lz1nwm
Copy link
Author

lz1nwm commented Mar 13, 2023

Just to clarify, my point was that I would expect the second query in my example to return an empty table and/or send the query to Eurostat. Basically, the cached table after the first query is only a small part of the dataset and obviously it could not be used for broader queries.

@pitkant
Copy link
Member

pitkant commented Mar 13, 2023

Thank you for clarifying. The reason (whether it be good or not, you decide) why it works like that is that the query parameters are passed onto the request made to the Eurostat database. For some query parameters no filtering is done locally, whereas in some cases there is some at least some processing done locally (if not filtering). An example of the latter is handling Eurostat date strings and turning them to date objects.

Yes, we could be possible to add some additional local checks before printing the output, to see whether the geo column has the desired areas or if the time frame is as desired; if not, then print a message to the user or attempt to refresh the cached dataset. Or maybe the query could be saved with the cached dataset and only use the cached data if the queries are identical.

@pitkant
Copy link
Member

pitkant commented May 9, 2023

As referenced in issue #258 it might make more sense to cache datasets that were downloaded without filtering than caching filtered datasets. Then, if the complete dataset was cached locally, it could also be filtered locally, solving both issues at a single stroke.

@pitkant
Copy link
Member

pitkant commented Dec 20, 2023

Closed with the CRAN release of package version 4.0.0

@pitkant pitkant closed this as completed Dec 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants