Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARIMA fit failing for small set of data due to invalid maxlag #1146

Closed
feathj opened this issue Oct 24, 2013 · 15 comments · Fixed by #1149
Closed

ARIMA fit failing for small set of data due to invalid maxlag #1146

feathj opened this issue Oct 24, 2013 · 15 comments · Fixed by #1149

Comments

@feathj
Copy link

feathj commented Oct 24, 2013

I am not sure if this is expected behavior or not, but ARIMA models which have endog of length less than or equal to 5 will always fail due to maxlag < nobs check.

Maxlag is calculated with the following value if not provided:

int(round(12*(nobs/100.)**(1/4.)))

Since the call to the parent "fit" method does not explicitly pass the maxlag argument, it is calculated using the formula above and results in this error:

error with ARIMA model (4, 1, 2):  maxlag should be < nobs
Traceback (most recent call last):
  File "/Users/jfeatherstone/dev/deos/post_processing/market_data/mrforecaster.py", line 126, in createModel
    results = model.fit(disp=-1, exog=range(1,len(items)), maxiter=100)
  File "/Users/jfeatherstone/dev/virtualenvs/deos/lib/python2.7/site-packages/statsmodels/tsa/arima_model.py", line 1005, in fit
    full_output, disp, callback, **kwargs)
  File "/Users/jfeatherstone/dev/virtualenvs/deos/lib/python2.7/site-packages/statsmodels/tsa/arima_model.py", line 828, in fit
    start_params = self._fit_start_params((k_ar,k_ma,k), method)
  File "/Users/jfeatherstone/dev/virtualenvs/deos/lib/python2.7/site-packages/statsmodels/tsa/arima_model.py", line 453, in _fit_start_params
    start_params = self._fit_start_params_hr(order)
  File "/Users/jfeatherstone/dev/virtualenvs/deos/lib/python2.7/site-packages/statsmodels/tsa/arima_model.py", line 399, in _fit_start_params_hr
    armod = AR(endog).fit(ic='bic', trend='nc')
  File "/Users/jfeatherstone/dev/virtualenvs/deos/lib/python2.7/site-packages/statsmodels/tsa/ar_model.py", line 548, in fit
    k_ar = self.select_order(k_ar, ic, trend, method)
  File "/Users/jfeatherstone/dev/virtualenvs/deos/lib/python2.7/site-packages/statsmodels/tsa/ar_model.py", line 425, in select_order
    X = self._stackX(maxlag, trend) # sets k_trend
  File "/Users/jfeatherstone/dev/virtualenvs/deos/lib/python2.7/site-packages/statsmodels/tsa/ar_model.py", line 393, in _stackX
    X = lagmat(endog, maxlag=k_ar, trim='both')
  File "/Users/jfeatherstone/dev/virtualenvs/deos/lib/python2.7/site-packages/statsmodels/tsa/tsatools.py", line 305, in lagmat
    raise ValueError("maxlag should be < nobs")
ValueError: maxlag should be < nobs

Note that an arima call in "R" with 5 items completes successfully.

@jseabold
Copy link
Member

Can you post some example data that returns successfully in R?

@feathj
Copy link
Author

feathj commented Oct 24, 2013

Given:
(2008, -1214.360173) (2009, -1848.209905), (2010, -2100.918158) (2011, -3647.483678) (2012, -4711.186773)
Order (2, 0, 3)

Forecasted anualy to 2022 in R:
-4803.626759 -6027.364361 -7466.85036 -7631.179054 -8431.198696 -10081.71807 -10528.78902 -10929.02852 -12563.78066 -13421.09138

@feathj
Copy link
Author

feathj commented Oct 24, 2013

Apologies for the formatting

@josef-pkt
Copy link
Member

The forecasts in R are not stationary. Did you include a trend? Or, does R not impose stationarity for ARIMA.

@jseabold
Copy link
Member

Yeah, I'm wary of a 'successful' completion here. It's why I ask to see data. My guess is that R just dumped something, anything back. That's been my experience in the past. You'll almost never get the same answer across stats packages for a problem like this. There's just not enough information to estimate anything meaningful here.

y <- c(-1214.360173, -1848.209905, -2100.918158, -3647.483678, -4711.186773)
arima(y, c(2,0,3))

Call:
arima(x = y, order = c(2, 0, 3))

Coefficients:
         ar1      ar2      ma1     ma2     ma3  intercept
      1.7587  -0.9983  -1.0965  0.4163  0.0255  -3767.755
s.e.  0.7167   0.0117   1.5237  1.0616  0.3522   3872.650

sigma^2 estimated as 87412:  log likelihood = -40.25,  aic = 94.5
Warning messages:
1: In log(s2) : NaNs produced
2: In log(s2) : NaNs produced
3: In log(s2) : NaNs produced
4: In log(s2) : NaNs produced

@jseabold
Copy link
Member

R does impose stationarity but it doesn't impose that what you get back is anything that makes sense IME.

@josef-pkt
Copy link
Member

However, adjusting the maxlag is a still valid bug/enhancement.

We might be able to estimate "reasonably ok" an ARMA(1,1) or ARMA(0,1) in this case.

@jseabold
Copy link
Member

gretl and X-12 ARIMA refuse to estimate the model. Stata estimates it (after going through 90 iterations with a lot of switches in the optimization algorithm and backing up) but warns that there is not sufficient data and is unable to calculate any inferential statistics for the constant or the sigma. The parameters it gives are different than R (and are more or less meaningless)

            v1:       ARMA:       ARMA:       ARMA:       ARMA:       ARMA:      sigma:
                         L.         L2.          L.         L2.         L3.            
         _cons          ar          ar          ma          ma          ma       _cons
y1  -3475.7847   1.6594282  -.99558649  -1.9151185   1.9151354  -1.0000227    179.6545

I somewhat recall doing this exercise before and deciding it's better to just refuse to do any estimation here rather than try to give some garbage back. Thoughts?

@jseabold
Copy link
Member

I can certainly fix the maxlags issue though. Maybe make it a little more informative, if and when we can't estimate a model.

@feathj
Copy link
Author

feathj commented Oct 24, 2013

Your comments make sense. I am developing a system that was prototyped using the R arima package, and I am seeing how far off we are from the original forecasting logic. I am not sure if returning a series (albeit not very accurate) or giving a better error message makes more sense, but I think that it would really help anyone with a similar problem going forward. Thanks for your prompt response @jseabold @josef-pkt

@josef-pkt
Copy link
Member

@feathj Please report back with any agreement or disagreement that you find out.

To your original data, just so we have an idea about use cases: Are you using ARIMA or time series analysis on very short samples? or was this just a test case?

There are several places where the very small sample behavior for time series analysis hasn't been checked, because most cases in our background (economics) have at least a hundred or a few thousand observations.
for example PR #1018 just adds the minimum nobs check to the grangercausalitytest.

@josef-pkt
Copy link
Member

I am not sure if returning a series (albeit not very accurate) or giving a better error message makes more sense

If I had to bet my money on the outcome of the ARIMA forecast in this case, I would refuse the bet.

@feathj
Copy link
Author

feathj commented Oct 24, 2013

@josef-pkt Unfortunately, some of our datasets are that small. We are working with annual data and some of our sources do not have anything before 2008. Thankfully, human interaction occurs after the fact to fix most of the obvious issues.

@feathj
Copy link
Author

feathj commented Oct 24, 2013

If I had to bet my money on the outcome of the ARIMA forecast in this case, I would refuse the bet.

Agreed :)

jseabold added a commit to jseabold/statsmodels that referenced this issue Oct 25, 2013
jseabold added a commit to jseabold/statsmodels that referenced this issue Oct 25, 2013
@jseabold
Copy link
Member

See #1149 for a discussion of the behavior I decided on. I'll wait for any comments on that PR.

jseabold added a commit that referenced this issue Nov 23, 2013
This closes #1146 and #1046.

1. We now check to make sure that we have at least one degree of freedom to estimate the problem. If so, then we try the estimation.
   1. Most / all of these estimations will return garbage. We have an extra check that we can estimate stationary initial params. Usually we can't in these cases, so the usual error will be raised here asking to set start_params. This should be enough of a warning to the user that this is "odd." If in the small chance the estimation goes through for a model with 5 observations and 1 degree of freedom, it's on the user then to determine things are no good.
2. We now avoid the problem of maxlag >= nobs happening in the call to AR so this avoids the problem of #1046 that also presented itself as part of #1146.
PierreBdR pushed a commit to PierreBdR/statsmodels that referenced this issue Sep 2, 2014
PierreBdR pushed a commit to PierreBdR/statsmodels that referenced this issue Sep 2, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants