Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_eurostat_json fails with time / time_period parameter on quarterly data #260

Closed
pitkant opened this issue May 24, 2023 · 5 comments · Fixed by #271
Closed

get_eurostat_json fails with time / time_period parameter on quarterly data #260

pitkant opened this issue May 24, 2023 · 5 comments · Fixed by #271

Comments

@pitkant
Copy link
Member

pitkant commented May 24, 2023

The following code produces an error:

tbl_hp <- get_eurostat("TIPSHO40", filter = list(time = c(2005:2022)))
Error in `[.data.frame`(variables, rev(names(variables))) : 
  undefined columns selected

whereas this works normally:

tbl_hp <- get_eurostat("TIPSHO40", filter = list(sinceTimePeriod = 2005))
Table TIPSHO40 cached at /var/folders/f4/h_r3y60n0nn0qm6qx5hnx1s00000gn/T//RtmpjICMVw/eurostat/TIPSHO40_date_code_FF.rds

With other datasets using time parameters (e.g. filter = list(time(2005:2023)) ) works normally so this issue might be related to the fact that this dataset is quarterly. We will have to explore this.

> head(tbl_hp)
# A tibble: 6 × 5
  freq  unit  geo   time       values
  <chr> <chr> <chr> <date>      <dbl>
1 Q     INX_Q BE    2005-01-01   69.7
2 Q     INX_Q BE    2005-04-01   71.2
3 Q     INX_Q BE    2005-07-01   73.8
4 Q     INX_Q BE    2005-10-01   75.2
5 Q     INX_Q BE    2006-01-01   76.7
6 Q     INX_Q BE    2006-04-01   78.8
> sessionInfo()
R version 4.3.0 (2023-04-21)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.4

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Helsinki
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] eurostat_3.8.3

loaded via a namespace (and not attached):
 [1] rappdirs_0.3.3     utf8_1.2.3         generics_0.1.3     tidyr_1.3.0        class_7.3-21      
 [6] xml2_1.3.4         KernSmooth_2.23-20 stringi_1.7.12     hms_1.1.3          magrittr_2.0.3    
[11] countrycode_1.4.0  timechange_0.2.0   ISOweek_0.6-2      cellranger_1.1.0   rprojroot_2.0.3   
[16] plyr_1.8.8         jsonlite_1.8.4     e1071_1.7-13       backports_1.4.1    httr_1.4.6        
[21] purrr_1.0.1        fansi_1.0.4        regions_0.1.8      bibtex_0.5.1       cli_3.6.1         
[26] crayon_1.5.2       rlang_1.1.1        bit64_4.0.5        withr_2.5.0        parallel_4.3.0    
[31] tools_4.3.0        tzdb_0.4.0         dplyr_1.1.2        here_1.0.1         broom_1.0.4       
[36] curl_5.0.0         assertthat_0.2.1   vctrs_0.6.2        R6_2.5.1           proxy_0.4-27      
[41] lifecycle_1.0.3    lubridate_1.9.2    classInt_0.4-9     RefManageR_1.4.0   stringr_1.5.0     
[46] bit_4.0.5          vroom_1.6.3        pkgconfig_2.0.3    pillar_1.9.0       glue_1.6.2        
[51] Rcpp_1.0.10        tibble_3.2.1       tidyselect_1.2.0   rstudioapi_0.14    readr_2.1.4       
[56] compiler_4.3.0     readxl_1.4.2      
@pitkant pitkant added the bug label May 24, 2023
@pitkant
Copy link
Member Author

pitkant commented May 25, 2023

Some notes:

  • Eurostat website states that "Using more than one Time parameter in the same query is not accepted." and that "The only exception is using both sinceTimePeriod and untilTimePeriod. This will return the filtered file based on the specific values for the two Time parameters.". However, using both sinceTimePeriod and untilTimePeriod in querying actually produces the following error message in the browser:
"{ "error": [{"status": 400,"id": 140,"label": "ERR_TIME_DIMENSION_DUPLICATED: Time dimension is duplicated. Time dimension can not be defined in more than one clause and can not be defined as TimedimensionValue and TIME_PERIOD in the same request."}]}"

and the following error message in R:

eurostat::clean_eurostat_cache()
tbl_hp <- get_eurostat("TIPSHO40", filter = list(sinceTimePeriod = 2005, untilTimePeriod = 2007))
Request failed [400]. Retrying in 1 seconds...
Request failed [400]. Retrying in 1 seconds...
Error in get_eurostat_json(id, filters, type = type, stringsAsFactors = stringsAsFactors,  : 
  The requested url cannot be found within the get_eurostat_json function: https://ec.europa.eu/eurostat/api/dissemination/statistics/1.0/data/TIPSHO40?sinceTimePeriod=2005&untilTimePeriod=2007&format=JSON&lang=EN"

(Slightly unrelated note to this issue: It would be a good idea to take the error message label and print it to R users as well)

  • Contradictory to Eurostat documentation, we have been using time = c(2005:2022) style notation for a long time and it hasn't been a problem. It has also been handy in retrieving data from years that are not necessarily sequential from:to values but can also be select years, like c(2005, 2011, 2017) so I wouldn't stop using it in examples
  • It would seem that the query also fails when only one time parameter is given:
eurostat::clean_eurostat_cache()
tbl_hp <- get_eurostat("TIPSHO40", filter = list(time_period = 2005))
Error in `[.data.frame`(variables, rev(names(variables))) : 
  undefined columns selected
  • The only time filters that work are fromTimePeriod, untilTimePeriod and lastTimePeriod:
eurostat::clean_eurostat_cache()
tbl_hp <- get_eurostat("TIPSHO40", filter = list(sinceTimePeriod = 2005))
eurostat::clean_eurostat_cache()
tbl_hp <- get_eurostat("TIPSHO40", filter = list(untilTimePeriod = 2007))
eurostat::clean_eurostat_cache()
tbl_hp <- get_eurostat("TIPSHO40", filter = list(lastTimePeriod = 3))

This issue might actually be a problem on Eurostat's side.

@pitkant
Copy link
Member Author

pitkant commented May 25, 2023

I have received a reply from Eurostat User Support. Their workaround was to (in the specific case of time frame 2005-2007) retrieve each time dimension 1 by 1 with the following query:

https://ec.europa.eu/eurostat/api/dissemination/statistics/1.0/data/TIPSHO40?format=JSON&lang=en&freq=Q&unit=INX_Q&geo=BE&geo=BG&geo=CZ&geo=DK&geo=DE&geo=EE&geo=IE&geo=EL&geo=ES&geo=FR&geo=HR&geo=IT&geo=CY&geo=LV&geo=LT&geo=LU&geo=HU&geo=MT&geo=NL&geo=AT&geo=PL&geo=PT&geo=RO&geo=SI&geo=SK&geo=FI&geo=SE&time=2005-Q1&time=2005-Q2&time=2005-Q3&time=2005-Q4&time=2006-Q1&time=2006-Q2&time=2006-Q3&time=2006-Q4&time=2007-Q1&time=2007-Q2&time=2007-Q3&time=2007-Q4

It is unlikely we will implement this workaround to eurostat package as Eurostat developers are apparently working with the issue right now. We will continue monitoring this.

@pitkant
Copy link
Member Author

pitkant commented Jun 9, 2023

I tested the queries above and things seem to work now, somewhat.

The method as described in the documentation (using both sinceTimePeriod and untilTimePeriod in the same query) work:

get_eurostat("TIPSHO40", filter = list(sinceTimePeriod = 2005, untilTimePeriod = 2007))

Table TIPSHO40 cached at /var/folders/f4/h_r3y60n0nn0qm6qx5hnx1s00000gn/T//RtmpNkdn9b/eurostat/TIPSHO40_date_code_FF.rds
# A tibble: 972 × 5
   freq  unit  geo   time       values
   <chr> <chr> <chr> <date>      <dbl>
 1 Q     INX_Q BE    2005-01-01   69.7
 2 Q     INX_Q BE    2005-04-01   71.2
 3 Q     INX_Q BE    2005-07-01   73.8
 4 Q     INX_Q BE    2005-10-01   75.2
 5 Q     INX_Q BE    2006-01-01   76.7
 6 Q     INX_Q BE    2006-04-01   78.8
 7 Q     INX_Q BE    2006-07-01   81.1
 8 Q     INX_Q BE    2006-10-01   81.7
 9 Q     INX_Q BE    2007-01-01   83.6
10 Q     INX_Q BE    2007-04-01   85.3
# ℹ 962 more rows

However, the following example still doesn't work:

> eurostat::clean_eurostat_cache()
Deleted .rds/.RData files from /var/folders/f4/h_r3y60n0nn0qm6qx5hnx1s00000gn/T//RtmpNkdn9b/eurostat

> get_eurostat("TIPSHO40", filter = list(time = c(2005:2008)))
Error in `[.data.frame`(variables, rev(names(variables))) : 
  undefined columns selected

@pitkant
Copy link
Member Author

pitkant commented Aug 25, 2023

Eurostat user support answered me that the allowed filter = list(sinceTimePeriod = 2005, untilTimePeriod = 2007) works now and, as already seen before, I can confirm it.

The explicitly not-accepted filtering filters = list(time = c(2005:2008) does not work in case of TIPSHO40 dataset but it works with other datasets. Maybe this is related to TIPSHO40 being quarterly data? I asked Eurostat to clarify their documentation on this.

What should work also on quarterly data, according to Eurostat documentation, is filters = list(time = 2005). It does not work and I've asked Eurostat for clarification. As a workaround data for a singular year can be filtered with filters = list(sinceTimePeriod = 2005, untilTimePeriod = 2005) or maybe with eurostat own suggestion filters = list(time = 2005-Q1, time = 2005-Q2, time = 2005-Q3, time = 2005-Q4) (which funnily enough was supposed to be not-accepted).

If quarterly data works differently from other types of data, I will clarify this in eurostat function documentation

@pitkant pitkant linked a pull request Sep 4, 2023 that will close this issue
@pitkant pitkant mentioned this issue Nov 3, 2023
@pitkant
Copy link
Member Author

pitkant commented Dec 20, 2023

Closed with the CRAN release of package version 4.0.0

@pitkant pitkant closed this as completed Dec 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant