Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tsa.statespace.sarimax.SARIMAXResults.get_prediction().conf_int() returns missing values #7305

Open
ahgraber opened this issue Feb 2, 2021 · 4 comments

Comments

@ahgraber
Copy link

ahgraber commented Feb 2, 2021

I have a fit SARIMAX model with 3 binary exogenous variables - in pseudocode:

df = pd.DataFrame(
  data={
    'endog': [...],
    'exog_1': [...],
    'exog_n: [...]
  },
  index=dates
)

# ...

model = SARIMAX(
    endog=df[endog],
    exog=df[endog_list],
    order=params[:3],
    seasonal_order=params[3:],
    ...
  ).fit(method='powell', disp=0)

# ...

fcast = model.get_prediction(startdate, forecastend, exog=exog_df)
fcast_means = fcast.predicted_mean
fcast_intervals = fcast.conf_int(alpha=0.05)
lower_bounds = fcast_intervals['lower ' + endog]
upper_bounds = fcast_intervals['upper ' + endog]

In some edge cases, I get a point forecast (complete/non-missing fcast_means) but intermittent NaN values for forecast_intervals. It appears the cause is that there are negative values in the variance of the predicted mean values fcast.var_pred_mean which result in np.nan when we take the square root in fcast.se_mean.

Example output:

date mean mean_se mean_ci_lower mean_ci_upper
2020-01-01 1513.416667 288.675135 947.623800 2079.209534
2020-02-01 1291.583333 288.675135 725.790466 1857.376200
2020-03-01 1906.000000 0.000002 1905.999995 1906.000005
2020-04-01 1037.703432 NaN NaN NaN
2020-05-01 2474.406863 0.000002 2474.406859 2474.406868
2020-06-01 1816.406863 0.000002 1816.406859 1816.406868
@ChadFulton
Copy link
Member

Thanks for the report! This would be a good issue to track down, but it might be tricky if you can't post a fully replicable example.

A couple of questions:

  1. Is 2020-04-01 the first out-of-sample date?
  2. Can you post data that replicates this problem?

Thanks!

@ahgraber
Copy link
Author

ahgraber commented Feb 3, 2021

Yes; 2020-04-01 is the first OOS date; I get another NaN set caused by the same issue at the last OOS date 2021-03-01. I don't believe I can share the data; I may be able to obfuscate it if I can find the time.

@Misha123457
Copy link

Hello People,
I also noticed such situation, when NaN values appear.

@d-a-bunin
Copy link

I managed to create an example that reproduces the error on my machine:

import numpy as np

from statsmodels.tsa.statespace.sarimax import SARIMAX


def make_data() -> np.ndarray:
    base = np.array([10.0, 20, 30, 40, 50, 60, 70, 80, 90, 100])
    result = np.repeat(base[:, np.newaxis], 20, axis=1).T.ravel()
    result = result[:100]
    return result


def main():
    data = make_data()

    model = SARIMAX(
        endog=data,
        exog=None,
        order=(10, 0, 10),
        seasonal_order=(0, 1, 0, 52)
    )
    fitted_model = model.fit()

    result = fitted_model.get_forecast(5)

    print(result.conf_int(0.5))


if __name__ == "__main__":
    main()

I have a suspicion that this problem is closely related to #5459, because adding noise like this:

def make_data() -> np.ndarray:
    base = np.array([10.0, 20, 30, 40, 50, 60, 70, 80, 90, 100])
    result = np.repeat(base[:, np.newaxis], 20, axis=1).T.ravel()
    result = result[:100]

    rng = np.random.default_rng(0)
    noise = rng.normal(scale=1.0, size=len(result))
    result += noise

    return result

resulted in LU decomposition error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants