Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confusion with se_mean and standard deviation #8699

Open
Elisei-Kungurov opened this issue Feb 22, 2023 · 3 comments
Open

Confusion with se_mean and standard deviation #8699

Elisei-Kungurov opened this issue Feb 22, 2023 · 3 comments

Comments

@Elisei-Kungurov
Copy link

I would like to clarify that I understood conception of se_mean and standard deviation in statsmodels correctly. Could you help me with this?

In documentation statsmodels.tsa.base.prediction.PredictionResults.se_mean we have description that se_mean is the standard deviation of the predicted mean. At the same time in Release 0.8.0 there is a passage that get_forecast provides standard errors. As far as I know standard deviation and standard errors of mean are not the same things.

As a rookie in statistics I found in wiki that std is a variation in measurements, while the standard error of the mean is a probabilistic statement about how the sample size will provide a better bound on estimates of the population mean, in light of the central limit theorem. However, standard error can be described as an estimation of that standard deviation. Does it mean that in statespace.sarimax we estimate possible future values of standard deviations and the model outputs std which depends on the number of time series point (more additional time points, better prediction of std)?

I build SARIMAX model and want to construct сonfidence interval as a variation in measurements for forecast. Is it possible to use mean_se for this or I need to convert these values to std by multiplying SE by sqrt(n)? And does n equal the number of data points in time series before forecasting?

Thank you for yor reply in advance!

@ChadFulton
Copy link
Member

ChadFulton commented Feb 23, 2023

For a simple state space model (of which SARIMAX is a special case), we have:

$$y_t = Z \alpha_t + \varepsilon_t, \varepsilon_t \sim N(0, H)$$

$$\alpha_t = T \alpha_{t-1} + \zeta_t, \zeta_t \sim N(0, Q)$$

Here we will assume that the matrices $Z, H, T, Q$ are known. (Actually, the estimated parameters of the model are in those matrices, but the state space model prediction results standard errors and confidence intervals supported by Statsmodels never account for parameter uncertainty, so we can ignore that for now).

By default, get_prediction (or get_forecast) gives one-step-ahead predictions of $y_t$, so that:

  • PredictionResults.predicted_mean = $E[y_t | y_{t-1}, y_{t-2}, \dots]$
  • PredictionResults.se_mean = $StdDev[y_t | y_{t-1}, y_{t-2}, \dots]$

(Aside: @josef-pkt pointed out that this actually doesn't match the intended/typical Statsmodels usage of the _mean suffix, which I believe would be intended to capture e.g. $E[Z \alpha_t | y_{t-1}, y_{t-2}, \dots]$. But things are a little bit different in state space models, because (a) many models do not have a $\varepsilon_t$ term anyway, e.g. the SARIMAX model, and (b) you can always rewrite any state-space model such that it doesn't have a $\varepsilon_t$ term, by placing that term into the state vector $\alpha$).

I'm not sure if that answers your question or not, but please feel free to follow up.

@josef-pkt
Copy link
Member

  • PredictionResults.predicted_mean = $E[y_t | y_{t-1}, y_{t-2}, \dots]$

"mean" here sounds fine, it's a conditional expectation of y

  • PredictionResults.se_mean = $StdDev[y_t | y_{t-1}, y_{t-2}, \dots]$

In OLS I used se_obs for similar (which includes parameter uncertainty plus residual standard deviation), corresponding to prediction interval.

se_mean would be the uncertainty of the conditional expectations (coming from parameter uncertainty)
y_hat = $E[y_t | y_{t-1}, y_{t-2}, \dots]$
se_mean = std(y_hat | ...)

aside:
In the newer prediction results class I use only se because get_prediction for discrete models can predict other statistics than mean. Outside of tsa and linear models, we don't have prediction intervals and se_obs yet.

@ChadFulton
Copy link
Member

Thanks @josef-pkt!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants