Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

forecast.TSLM does not use t distribution for prediction intervals #343

Closed
mbg-unsw opened this issue Aug 17, 2021 · 6 comments
Closed

forecast.TSLM does not use t distribution for prediction intervals #343

mbg-unsw opened this issue Aug 17, 2021 · 6 comments
Assignees
Labels
enhancement New feature or request

Comments

@mbg-unsw
Copy link

When I compared the output of predict.lm to forecast.TSLM the prediction intervals are wider. Looking at the code, it appears predict.lm uses the t distribution (as I'd expect) while forecast.TSLM uses the normal distribution.

Perhaps this is a convention in the time series world but if so could it be documented?

Happy to provide a MWE if that helps.

@robjhyndman
Copy link
Member

Because we assume normality when creating combination forecasts and in reconciliation, it would be problematic to use t distributions in fables by default. So it is probably best to add some comment in the documentation for forecast() to say we use a normal approximation.

@mbg-unsw
Copy link
Author

Thanks Rob, that sounds good to me.

@mitchelloharawild mitchelloharawild self-assigned this Aug 26, 2021
@mitchelloharawild mitchelloharawild added the enhancement New feature or request label Aug 26, 2021
@mitchelloharawild
Copy link
Member

I've updated the docs and added a approx_normal argument to allow forecast(<TSLM>, approx_normal=FALSE) which returns Student T forecasts.

library(fpp3)
#> ── Attaching packages ──────────────────────────────────────────── fpp3 0.4.0 ──
#> ✓ tibble      3.1.3          ✓ tsibble     1.0.1     
#> ✓ dplyr       1.0.7          ✓ tsibbledata 0.3.0     
#> ✓ tidyr       1.1.3          ✓ feasts      0.2.2     
#> ✓ lubridate   1.7.10         ✓ fable       0.3.1.9000
#> ✓ ggplot2     3.3.5.9000
#> ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
#> x lubridate::date()    masks base::date()
#> x dplyr::filter()      masks stats::filter()
#> x tsibble::intersect() masks base::intersect()
#> x tsibble::interval()  masks lubridate::interval()
#> x dplyr::lag()         masks stats::lag()
#> x tsibble::setdiff()   masks base::setdiff()
#> x tsibble::union()     masks base::union()
as_tsibble(USAccDeaths) %>%
  model(lm = TSLM(log(value) ~ trend() + season())) %>%
  forecast(approx_normal = FALSE) %>% 
  hilo()
#> # A tsibble: 24 x 6 [1M]
#> # Key:       .model [1]
#>    .model    index                value .mean                   `80%`
#>    <chr>     <mth>               <dist> <dbl>                  <hilo>
#>  1 lm     1979 Jan t(t(59, 8.9, 0.055)) 7620. [7083.734,  8171.471]80
#>  2 lm     1979 Feb t(t(59, 8.8, 0.055)) 6899. [6413.748,  7398.606]80
#>  3 lm     1979 Mar t(t(59, 8.9, 0.055)) 7639. [7101.707,  8192.203]80
#>  4 lm     1979 Apr   t(t(59, 9, 0.055)) 7841. [7289.238,  8408.530]80
#>  5 lm     1979 May t(t(59, 9.1, 0.055)) 8645. [8036.810,  9270.895]80
#>  6 lm     1979 Jun t(t(59, 9.1, 0.055)) 9087. [8447.703,  9744.882]80
#>  7 lm     1979 Jul t(t(59, 9.2, 0.055)) 9909. [9211.088, 10625.489]80
#>  8 lm     1979 Aug t(t(59, 9.1, 0.055)) 9237. [8587.107,  9905.692]80
#>  9 lm     1979 Sep   t(t(59, 9, 0.055)) 8237. [7657.235,  8833.035]80
#> 10 lm     1979 Oct   t(t(59, 9, 0.055)) 8517. [7917.291,  9133.023]80
#> # … with 14 more rows, and 1 more variable: 95% <hilo>
as_tsibble(USAccDeaths) %>%
  model(lm = TSLM(log(value) ~ trend() + season())) %>%
  forecast(approx_normal = TRUE) %>% 
  hilo()
#> # A tsibble: 24 x 6 [1M]
#> # Key:       .model [1]
#>    .model    index            value .mean                   `80%`
#>    <chr>     <mth>           <dist> <dbl>                  <hilo>
#>  1 lm     1979 Jan t(N(8.9, 0.003)) 7620. [7089.402,  8164.937]80
#>  2 lm     1979 Feb t(N(8.8, 0.003)) 6899. [6418.880,  7392.690]80
#>  3 lm     1979 Mar t(N(8.9, 0.003)) 7639. [7107.390,  8185.653]80
#>  4 lm     1979 Apr   t(N(9, 0.003)) 7841. [7295.070,  8401.807]80
#>  5 lm     1979 May t(N(9.1, 0.003)) 8645. [8043.240,  9263.482]80
#>  6 lm     1979 Jun t(N(9.1, 0.003)) 9087. [8454.462,  9737.091]80
#>  7 lm     1979 Jul t(N(9.2, 0.003)) 9908. [9218.459, 10616.993]80
#>  8 lm     1979 Aug t(N(9.1, 0.003)) 9237. [8593.978,  9897.772]80
#>  9 lm     1979 Sep   t(N(9, 0.003)) 8237. [7663.362,  8825.973]80
#> 10 lm     1979 Oct   t(N(9, 0.003)) 8516. [7923.626,  9125.721]80
#> # … with 14 more rows, and 1 more variable: 95% <hilo>

Created on 2021-08-26 by the reprex package (v2.0.0)

@mitchelloharawild
Copy link
Member

On a related issue @robjhyndman, when we generate() future samples shouldn't this come from a Student's T also?

fable/R/lm.R

Line 347 in 9b46cd8

new_data$.innov <- stats::rnorm(length(pred), sd = sqrt(vars))

@robjhyndman
Copy link
Member

  • Using t for the t-distribution as well as for transformations is potentially confusing. Perhaps transformations should be f?
  • The model assumes Gaussian errors. The reason t comes up for the forecast distribution is that the parameters of the normal distribution are estimated, so if we use simulation, we should really draw from the distribution of the parameter estimates and then use normal variates. I think that's too complicated as it would require estimating the parameter covariance matrix which is not stored (or easily calculated) for most models. So I think just using Gaussian errors in simulations is fine, conditional on the parameter estimates. If someone really wants to include parameter uncertainty in the simulations, they can write their own code.

@mitchelloharawild
Copy link
Member

* Using `t` for the t-distribution as well as for transformations is potentially confusing. Perhaps transformations should be `f`?

I also noticed this. I was thinking of using * but perhaps this isn't evident enough t(3,4,5)*.

* The model assumes Gaussian errors. The reason t comes up for the forecast distribution is that the parameters of the normal distribution are estimated, so if we use simulation, we should really draw from the distribution of the parameter estimates and then use normal variates. I think that's too complicated as it would require estimating the parameter covariance matrix which is not stored (or easily calculated) for most models. So I think just using Gaussian errors in simulations is fine, conditional on the parameter estimates. If someone really wants to include parameter uncertainty in the simulations, they can write their own code.

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants