ENH: Statespace: Add diagnostics. #2431
Conversation
I should note that I haven't yet added unit tests - I thought it best to wait until I see if / how we want to proceed with this. |
Looks fine overall, nice notebook. I didn't look at the details yet, except to check that matplotlib usage is optional. 2 things: So far we have not much of a convention for attaching plots to models/results. There are open issues, but OLS didn't get any plots or diagnostic tests because I didn't know which ones of those should be added and in which way. Kerby also started to add plots, and the tsa models have several plots. Do you know if or when the diagnostic tests are appropriate? |
Yes, this is what I was curious about also. I'll just wait and see how things shape up on naming conventions, etc. My impression is that these are the basic appropriate / interesting plots for Statespace diagnostics (at least they're what are presented by the two main Statespace reference books: Harvey 1989, and Durbin and Koopman 2012).
I believe that these tests per-se should be appropriate, in the sense that the standardized residuals should be approximately iid Normal(0, sigma2) if the given statespace model is well-specified. And these are the three tests recommended as diagnostics by both Harvey and Durbin and Koopman. As you say, the heteroskedasticity and Normality tests are pretty straightforward and shouldn't require any modifications in subclasses. The Ljung-box test is appropriate, but right now the statsmodels We could also eventually add CUSUM, CUSUM-of-squares, non-linearity tests. |
I read another chapter by Luetkepohl about VAR. He mentions Portmanteau test for autocorrelation (Ljung-Box AFAIR) but also Breusch-Godfrey LM test for residual autocorrelation. And besides that, normality tests and ARCH LM test, and some tests for structural breaks, essentially Chow test with mentioning of others. Stata seems to have a similar collection after VAR, I haven't checked any other TSA models in the Stata manual. The main problem is the same as what I had before. If we want to have more than just 3 diagnostic tests, then we need to find a way to avoid a proliferation of test_xxx methods attached to every result. The options that I thought about are based on grouping diagnostic tests, or add a |
Yes, I think you're right. As I understand it, the distributional assumptions on the standardized residuals make all of these tests fair game.
I see what you mean. I'm not sure about other models, but the problem with simply applying the existing tests to the statespace class of models is that there are different sets of residuals that are sometimes appropriate ("innovations"/residuals, standardized residuals, and auxilliary residuals). Also sometimes the initial periods are truncated, and the degrees of freedom might be different for different classes of models. I haven't thought about this too much, but it does seem like the user would benefit from some sort of statespace-specific test wrappers. Or maybe it's better just to have good documentation and/or additional methods for getting the appropriate residuals and degrees of freedom?
I like these ideas. I particularly like the last one, but I guess I'm not familiar enough with all the possible hypothesis tests to know if we could easily group them all. |
My only other thought here is that the three tests that I've implemented appear to be "standard" as first-pass diagnostics, even though they are not the only ones that can be run. Similarly, there are other diagnostic plots that could be produced, but these four are seen often. Maybe we could rename them to make it explicit that they are not the only ones; e.g. something like |
P.S. the test failure here is related to an old Numpy version; I just need to figure out how to make it happy. |
#2461 will fix the test failure here. Other than that, I think this PR is just waiting on a decision about whether or not / how we want to include these diagnostics. Edit: that is to say there is more work to be done on it, but I'm waiting to do it until a decision is made. |
I'm almost done clearing my computer of other things and will be able to come back to this soon. |
Rebased so tests pass. |
Rebased again, and hopefully fixed. |
# Store some values | ||
squared_resid = self.filter_results.standardized_forecasts_error**2 | ||
d = self.loglikelihood_burn | ||
h = np.round((self.nobs - d) / 3) |
josef-pkt
Jul 17, 2015
Member
I guess this needs an int
in front or we get a deprecation warning with numpy indexing below.
I guess this needs an int
in front or we get a deprecation warning with numpy indexing below.
ChadFulton
Jul 17, 2015
Author
Member
Thanks!
Thanks!
statsmodels.graphics.tsaplots.plot_acf | ||
""" | ||
from statsmodels.graphics.utils import _import_mpl, create_mpl_fig | ||
_ = _import_mpl() |
josef-pkt
Jul 17, 2015
Member
why do you make an assignment?
why do you make an assignment?
ChadFulton
Jul 17, 2015
Author
Member
No reason, I think I just copy/pasted it in from somewhere. I'll remove it.
No reason, I think I just copy/pasted it in from somewhere. I'll remove it.
Looks fine overall. I think we can still finetune some details after merging, and add additional options. What's the test coverage? What's the behavior with respect to several or multivariate |
bonus: Here or followup PR: It would be nice to use this in one of the examples in the notebooks. |
All of the tests can accommodate multiple endog; I guess I have to do it explicitly for Ljung-Box and the Goldfeld-Quandt-like test, but since Speaking of multiple endog, here's a notebook showing how I currently put the results from the multiple endog in a results table. I just put the test and comma-separate the values for each endog. Does that seem reasonable? http://nbviewer.ipython.org/gist/ChadFulton/7e404ed4499763802632 |
No test coverage yet, but since it looks like we'll start with these diagnostics, I'll go ahead and write some. |
In this branch, but possibly elsewhere (I do not know), sarimax.predict breaks when simple_differencing=True. You get the size matching error like ValueError: Invalid dimensions for time-varying state intercept vector. Requires shape (*,673), got (50, 648) This seems to be related to not adjusting properly for the points dropped from the prior differencing. Which seems to affect other things too. For example, the new plot_diagnostics has a similar problem when using simple_differencing. Seems that it may be related to loglikelihood_burn not being set properly? I'm sorry I can't make more specific and well informed statements at this time -- I'm new here. By the way, I do like and appreciate the diagnostic plots. I may want to modify a few things to my liking that I can't just do from the figure axes, such as the range of lags for the autocorrelation of residuals. |
Thanks for the bug report! If you like, create an issue for the problem, otherwise I can do that. Then I'll take a look and see what we can do. |
If you have specific comments and suggestions, then we can still include them. Feedback when you use or try out these things is very helpful. @ChadFulton briefly raised the issue of selecting lags for autocorrelation plots. Because I didn't see a use for other option other than maxlag and maybe dropping zero lag, more information would help. Which changes to the plots would you make? |
I have fixed a bug due to the merging (#2615) and I have rebased and fixed the merge path. I have also put the commit from #2616 into this PR at the end. My limited understanding is that if I rebase after #2616 is merged, it will eliminate that commit. So I can rebase again tomorrow after it is merged. But will it also eliminate the duplicate commit in a merge? I don't know. |
Are you thinking a default
I'm also not sure about the best solution. I haven't looked into alternative diagnostic tests much. Durbin and Koopman (2012) suggest that you might use multivariate generalizations on the standardized residuals, but don't pursue it. They also suggest simply using the univariate approach on each element individually. Practically speaking, I've mostly only seen in use the three univariate methods that I have in place. Perhaps we should plan to keep these as only vectorized tests (and I could update the docs accordingly), and then we could later add |
|
||
return output | ||
|
||
def test_heteroskedasticity(self, method='', alternative='two-sided', |
josef-pkt
Sep 9, 2015
Member
method has empty string as default differs from others
method has empty string as default differs from others
alternative : string, 'increasing', 'decreasing' or 'two-sided' | ||
This specifies the alternative for the p-value calculation. Default | ||
is two-sided. | ||
asymptotic : boolean, optional |
josef-pkt
Sep 9, 2015
Member
I used use_f
or use_t
in other models to choose between t/f and normal/chisquare
I don't think f
would be exact in this case given that it's estimated residuals, often it's just a better small sample approximation than chisquare
I used use_f
or use_t
in other models to choose between t/f and normal/chisquare
I don't think f
would be exact in this case given that it's estimated residuals, often it's just a better small sample approximation than chisquare
Parameters | ||
---------- | ||
method : string {'sumsquares'} |
josef-pkt
Sep 9, 2015
Member
I'm not a fan of sumsquares
, IMO sum of squares is an implementation detail. The main feature is that it tests a break in the variance. (like chow test)
I'm not a fan of sumsquares
, IMO sum of squares is an implementation detail. The main feature is that it tests a break in the variance. (like chow test)
statsmodels.stats.stattools.jarque_bera | ||
""" | ||
if method == 'jarquebera': |
josef-pkt
Sep 9, 2015
Member
docstring uses jb
not full name.
docstring uses jb
not full name.
|
||
return output | ||
|
||
def test_serial_correlation(self, method, lags=None, boxpierce=False): |
josef-pkt
Sep 9, 2015
Member
boxpierce could be an option for method
and not a separate keyword
boxpierce could be an option for method
and not a separate keyword
heteroskedasticity, and serial correlation diagnostic tests.
I have rebased and fixed the issues mentioned above (thanks for pointing those out, by the way). I have also added the possibility of As long as we are comfortable with this level of generality in the tests (so e.g. assume that joint tests will have their own test method rather than be shared with these vectorized tests), then this PR is complete. |
I'd like to get these PRs merged. I'm pretty sure we want to think about this some more before a release. I still guess that we will want joint tests in the multivariate case or both with joint as default, so we get a single test result in both univariate and multivariate cases, if that's feasible. I think we start to expand the test and plot methods in other models in a similar way, but this might be the only case where we have univariate and multivariate endog in the same model. merging, Thanks |
ENH: Statespace: Add diagnostics. test and plot methods
This is an example PR to get the ball rolling with respect to diagnostic tests in the Statespace results class. I don't know if it currently matches the Statsmodels approach to diagnostic tests / output, so I'm perfectly happy to move things around or remove things, or whatever is appropriate.
I add to the statespace.MLEResults class the following methods:
test_normality
which is just a wrapper of the statsmodels implementation of the Jarque–Bera testtest_heteroskedasticity
which is analogous to the Goldfeld-Quant testtest_serial_correlation
which is just a wrapper of the statsmoedls implementation of the Ljung-Box testAs an illustration analogous to the OLS summary, I've added the results of these tests to the statespace summary printout.
I've also added a
plot_diagnostics
method to the MLEResults class which produces a figure with four subplots: residuals against time, histogram / density estimate, Normal Q-Q plot, and Correlogram - again just wrapping the current statsmodels results.The only quirk associated with producing these is that the standardized residuals are the preferred object on which to test, and we may need to ignore the first few residuals associated with burned likelihoods.
These appear to be reasonably standard ways of assessing the output of generic statespace models, see e.g. Durbin and Koopman, 2012, section 2.12 and 7.5, as well as Harvey, 1989.
To see what the output looks like right now, here is a link to an example notebook:
http://nbviewer.ipython.org/gist/ChadFulton/bcfd05a2b39705d33070/