Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All NAs output fable::season() #220

Closed
AEBilgrau opened this issue Jan 23, 2020 · 9 comments
Closed

All NAs output fable::season() #220

AEBilgrau opened this issue Jan 23, 2020 · 9 comments
Assignees
Labels
bug Something isn't working

Comments

@AEBilgrau
Copy link

AEBilgrau commented Jan 23, 2020

I have an issue where the season() in a model context where the function creates an all NA factor. It is apparently due to the fact that my index is not exactly integer. I imagine this is a general problem if one has sub-second data or otherwise has been through an imperfect aggregation (in this case to hourly data).

The issue seems to be present both on CRAN and the latest commit from GitHub.

Please see the reproducible example below where I also propose some solutions. If you wish, I can try to make pull request with a solution.

# remotes::install_github('tidyverts/fable') 
# install.packages('fable')
library("fable")

# A data subset
x <- structure(list(data = c(0.654340987099764, 0.306863543295109, 0.472474420817171, -1.09341948531794, 
    1.20966833172894, 0.265420322089116, -1.91999831324977, -0.276682839817029, 0.159697643465573, 0.611967188546101), 
    timestamp = structure(c(1560254400.0001, 1560258000.0001, 1560261600.0001, 1560265200.0001, 1560268800.0001, 
        1560272400.0001, 1560276000.0001, 1560279600.0001, 1560283200.0001, 1560286800.0001), class = c("POSIXct", 
        "POSIXt"), tzone = "UTC")), row.names = c(NA, 10L), envir = "prod", key = structure(list(.rows = list(1:10)), 
    row.names = c(NA, -1L), class = c("tbl_df", "tbl", "data.frame")), index = structure("timestamp", 
    ordered = TRUE), index2 = "timestamp", interval = structure(list(year = 0, quarter = 0, month = 0, 
    week = 0, day = 0, hour = 1, minute = 0, second = 0, millisecond = 0, microsecond = 0, nanosecond = 0, 
    unit = 0), class = "interval"), class = c("tbl_ts", "tbl_df", "tbl", "data.frame"))

print(x)
## # A tsibble: 10 x 2 [1h] <UTC>
##      data timestamp          
##  *  <dbl> <dttm>             
##  1  0.654 2019-06-11 12:00:00
##  2  0.307 2019-06-11 13:00:00
##  3  0.472 2019-06-11 14:00:00
##  4 -1.09  2019-06-11 15:00:00
##  5  1.21  2019-06-11 16:00:00
##  6  0.265 2019-06-11 17:00:00
##  7 -1.92  2019-06-11 18:00:00
##  8 -0.277 2019-06-11 19:00:00
##  9  0.160 2019-06-11 20:00:00
## 10  0.612 2019-06-11 21:00:00

The high-level issue and error message:

x %>% model(TSLM(data ~ season()))
## A mable: 1 x 1
#  `TSLM(data ~ season())`
#  <model>                
#1 <NULL model>           
#Warning message:
#1 error encountered for TSLM(data ~ season())
#[1] 0 (non-NA) cases

The error message when calling model(TSLM(data ~ trend() + season())) is much harder to understand.

Anyway, calling fable:::season(x) in the model context (I cannot readily see why it does not work directly) ultimately causes

#fable:::season(x) ->
fable:::season.tbl_ts(x, NULL)
## # A tibble: 10 x 1
##    day  
##    <fct>
##  1 <NA> 
##  2 <NA> 
##  3 <NA> 
##  4 <NA> 
##  5 <NA> 
##  6 <NA> 
##  7 <NA> 
##  8 <NA> 
##  9 <NA> 
## 10 <NA> 

As far as I can tell, that ultimatly calls fable:::season.numeric and creates what is equivalent to:

idx_num <- c(433404.000000028, 433405.000000028, 433406.000000028, 433407.000000028, 433408.000000028, 
    433409.000000028, 433410.000000028, 433411.000000028, 433412.000000028, 433413.000000028)
factor((idx_num%%24) + 1, levels = seq_len(24))
##  [1] <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## Levels: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

It's easy to see here why it fails, but it was not so obvious when printing idx_num as the digits are not printed.

Some possible solutions. Make the code evaluate to:

# Solution 1 (unsafe?)
factor(as.integer((idx_num%%24) + 1), levels = seq_len(24))
##  [1] 13 14 15 16 17 18 19 20 21 22
## Levels: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24


# Solution 2  - something _equivalent_ to
l <- (idx_num%%24) + 1
if (isTRUE(all.equal(ll <- as.integer(l), l))) {
    result <- factor(ll, levels = seq_len(24))
} else {
    result <- factor(l, levels = seq_len(24))
}

Or, get users to fix it. In this case, it is simple. But I wonder what would work also for subsecond data.

# Solution 3 - 
y <- x
y$timestamp <- as.POSIXct(as.integer(y$timestamp), origin = "1970-01-01 00:00", tz = "UTC")
fable:::season.tbl_ts(y, NULL)
## # A tibble: 10 x 1
##    day  
##    <fct>
##  1 13   
##  2 14   
##  3 15   
##  4 16   
##  5 17   
##  6 18   
##  7 19   
##  8 20   
##  9 21   
## 10 22   

Some session info for completeness:

sessionInfo()
## R version 3.6.2 (2019-12-12)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.3 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
##  [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
## [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] fable_0.1.1.9000 fabletools_0.1.1
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.3       formatR_1.7      pillar_1.4.3     compiler_3.6.2   remotes_2.1.0   
##  [6] tools_3.6.2      zeallot_0.1.0    packrat_0.5.0-24 lubridate_1.7.4  tsibble_0.8.5   
## [11] lifecycle_0.1.0  tibble_2.1.3     gtable_0.3.0     anytime_0.3.7    pkgconfig_2.0.3 
## [16] rlang_0.4.2      cli_2.0.1        rstudioapi_0.10  curl_4.2         dplyr_0.8.3     
## [21] stringr_1.4.0    generics_0.0.2   vctrs_0.2.1      grid_3.6.2       tidyselect_0.2.5
## [26] glue_1.3.1       R6_2.4.1         fansi_0.4.1      purrr_0.3.3      ggplot2_3.2.1   
## [31] tidyr_1.0.0      magrittr_1.5     backports_1.1.5  scales_1.1.0     assertthat_0.2.1
## [36] colorspace_1.4-1 utf8_1.1.4       stringi_1.4.5    lazyeval_0.2.2   munsell_0.5.0   
## [41] crayon_1.3.4    
@mitchelloharawild mitchelloharawild self-assigned this Jan 23, 2020
@mitchelloharawild mitchelloharawild added the bug Something isn't working label Jan 23, 2020
@mitchelloharawild
Copy link
Member

The bug is that the code assumes an integer base (via seq_len()), but in reality an hourly interval doesn't need to start on the hour.

@mitchelloharawild
Copy link
Member

Thanks for the detailed bug report - MRE below

library(tsibble)
library(lubridate)
library(dplyr)
library(fable)
pedestrian %>% 
  mutate(Date_Time = Date_Time + minutes(30)) %>% 
  model(TSLM(Count ~ season()))
#> Warning: 4 errors (1 unique) encountered for TSLM(Count ~ season())
#> [4] 0 (non-NA) cases
#> # A mable: 4 x 2
#> # Key:     Sensor [4]
#>   Sensor                        `TSLM(Count ~ season())`
#>   <chr>                         <model>                 
#> 1 Birrarung Marr                <NULL model>            
#> 2 Bourke Street Mall (North)    <NULL model>            
#> 3 QV Market-Elizabeth St (West) <NULL model>            
#> 4 Southern Cross Station        <NULL model>

Created on 2020-01-23 by the reprex package (v0.3.0)

@mitchelloharawild
Copy link
Member

mitchelloharawild commented Jan 23, 2020

I think your first solution is reasonable:

factor(as.integer((idx_num%%24) + 1), levels = seq_len(24))

I've changed this to use floor() instead, but it achieves the same purpose.

Any reason why you think this is unsafe? It seems fine to me.

@AEBilgrau
Copy link
Author

AEBilgrau commented Jan 23, 2020

Thanks for such a quick fix and for the package as well!

My thinking was mostly on sub-second data where things seems to go wrong. The call to tsibble::units_since in fable:::season.tbl_ts will cause downstream issues in such cases. So the fix itself is probably OK (for our examples) but will not work for data like the following (I guess):

x <- Sys.time() + lubridate::milliseconds(1:6)
x
# [1] "2020-01-23 10:09:53.922 UTC" "2020-01-23 10:09:53.923 UTC" "2020-01-23 10:09:53.924 UTC"
# [4] "2020-01-23 10:09:53.925 UTC" "2020-01-23 10:09:53.926 UTC" "2020-01-23 10:09:53.927 UTC"

format(tsibble::units_since(x), digits = 13)
# [1] "1579774193.922" "1579774193.923" "1579774193.924" "1579774193.925" "1579774193.926"
# [6] "1579774193.927" 

But I do not know if a general solution is feasible. I work a lot with subsecond data, so I bang my head against similar walls often.

@mitchelloharawild
Copy link
Member

Haha, good edit 😄
I'll make a reprex with simulated millisecond data, but I think it should work fine.

@mitchelloharawild
Copy link
Member

Looks like there is an unrelated issue with tsibble's interval calculation for subsecond data.
This leads to problems generating the correct seasonal period in fable.

library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#> 
#>     date
library(tsibble)
#> 
#> Attaching package: 'tsibble'
#> The following objects are masked from 'package:lubridate':
#> 
#>     interval, new_interval
tsibble(
  time = Sys.time() + seconds((1:1000)*.5),
  y = rnorm(1000),
  index = time
)
#> # A tsibble: 1,000 x 2 [499ms] <?>
#>    time                      y
#>    <dttm>                <dbl>
#>  1 2020-01-23 21:27:59  0.443 
#>  2 2020-01-23 21:27:59  0.438 
#>  3 2020-01-23 21:28:00 -0.610 
#>  4 2020-01-23 21:28:00 -1.83  
#>  5 2020-01-23 21:28:01 -0.137 
#>  6 2020-01-23 21:28:01  0.0330
#>  7 2020-01-23 21:28:02  1.75  
#>  8 2020-01-23 21:28:02  0.262 
#>  9 2020-01-23 21:28:03 -1.61  
#> 10 2020-01-23 21:28:03 -0.538 
#> # … with 990 more rows

Created on 2020-01-23 by the reprex package (v0.3.0)

@AEBilgrau
Copy link
Author

Yeah --- problems are everywhere with sub-second and irregular data. Hence the edit! :)

@mitchelloharawild
Copy link
Member

Fortunately its looking like a simple fix in tsibble.
I like to think its a good thing that sub-second data is being used with tsibble and fable. I can't imagine the trouble it would come with in the forecast package.

@mitchelloharawild
Copy link
Member

With the PR to tsibble fixing sub-second intervals, the fix in edece72 works fine for sub-second data.

Unfortunately fable/fabletools isn't updated for the dev version of tsibble yet, so installing dev tsibble won't fix the issue yet.

earowang pushed a commit to tidyverts/tsibble that referenced this issue Feb 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants