New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Statespace: Add diagnostics. #2431

Merged
merged 16 commits into from Sep 19, 2015

Conversation

Projects
None yet
4 participants
@ChadFulton
Member

ChadFulton commented Jun 2, 2015

This is an example PR to get the ball rolling with respect to diagnostic tests in the Statespace results class. I don't know if it currently matches the Statsmodels approach to diagnostic tests / output, so I'm perfectly happy to move things around or remove things, or whatever is appropriate.

I add to the statespace.MLEResults class the following methods:

  • test_normality which is just a wrapper of the statsmodels implementation of the Jarque–Bera test
  • test_heteroskedasticity which is analogous to the Goldfeld-Quant test
  • test_serial_correlation which is just a wrapper of the statsmoedls implementation of the Ljung-Box test

As an illustration analogous to the OLS summary, I've added the results of these tests to the statespace summary printout.

I've also added a plot_diagnostics method to the MLEResults class which produces a figure with four subplots: residuals against time, histogram / density estimate, Normal Q-Q plot, and Correlogram - again just wrapping the current statsmodels results.

The only quirk associated with producing these is that the standardized residuals are the preferred object on which to test, and we may need to ignore the first few residuals associated with burned likelihoods.

These appear to be reasonably standard ways of assessing the output of generic statespace models, see e.g. Durbin and Koopman, 2012, section 2.12 and 7.5, as well as Harvey, 1989.

To see what the output looks like right now, here is a link to an example notebook:

http://nbviewer.ipython.org/gist/ChadFulton/bcfd05a2b39705d33070/

@ChadFulton

This comment has been minimized.

Member

ChadFulton commented Jun 2, 2015

I should note that I haven't yet added unit tests - I thought it best to wait until I see if / how we want to proceed with this.

@coveralls

This comment has been minimized.

coveralls commented Jun 2, 2015

Coverage Status

Coverage decreased (-0.03%) to 83.87% when pulling 7664e04 on ChadFulton:tst-std-res into 4b55fa4 on statsmodels:master.

@josef-pkt

This comment has been minimized.

Member

josef-pkt commented Jun 2, 2015

Looks fine overall, nice notebook. I didn't look at the details yet, except to check that matplotlib usage is optional.

2 things:

So far we have not much of a convention for attaching plots to models/results. There are open issues, but OLS didn't get any plots or diagnostic tests because I didn't know which ones of those should be added and in which way. Kerby also started to add plots, and the tsa models have several plots.
It looks fine to me, and hopefully before 0.8 comes out, we can agree on naming conventions and similar (which will be easy to change)

Do you know if or when the diagnostic tests are appropriate?
Mainly, does Ljung-Box have the standard distribution after estimating an ARIMA model, or an AR model? I don't have an overview for tsa, but today I printed out a VAR summary by Luetkepohl that has the diagnostic tests that are available after VAR.
Heteroscedasticity and normality tests don't have a problem with tsa models, I'm pretty sure.

@ChadFulton

This comment has been minimized.

Member

ChadFulton commented Jun 3, 2015

So far we have not much of a convention for attaching plots to models/results. There are open issues, but OLS didn't get any plots or diagnostic tests because I didn't know which ones of those should be added and in which way. Kerby also started to add plots, and the tsa models have several plots.
It looks fine to me, and hopefully before 0.8 comes out, we can agree on naming conventions and similar (which will be easy to change)

Yes, this is what I was curious about also. I'll just wait and see how things shape up on naming conventions, etc. My impression is that these are the basic appropriate / interesting plots for Statespace diagnostics (at least they're what are presented by the two main Statespace reference books: Harvey 1989, and Durbin and Koopman 2012).

Do you know if or when the diagnostic tests are appropriate?
Mainly, does Ljung-Box have the standard distribution after estimating an ARIMA model, or an AR model? I don't have an overview for tsa, but today I printed out a VAR summary by Luetkepohl that has the diagnostic tests that are available after VAR.
Heteroscedasticity and normality tests don't have a problem with tsa models, I'm pretty sure.

I believe that these tests per-se should be appropriate, in the sense that the standardized residuals should be approximately iid Normal(0, sigma2) if the given statespace model is well-specified. And these are the three tests recommended as diagnostics by both Harvey and Durbin and Koopman.

As you say, the heteroskedasticity and Normality tests are pretty straightforward and shouldn't require any modifications in subclasses. The Ljung-box test is appropriate, but right now the statsmodels acorr_ljungbox doesn't allow specifying the degrees of freedom. In these cases, there should be a penalty based on number of parameters, but the test statistic is asymptotically chi2 (at least according to Harvey).

We could also eventually add CUSUM, CUSUM-of-squares, non-linearity tests.

@josef-pkt

This comment has been minimized.

Member

josef-pkt commented Jun 3, 2015

I read another chapter by Luetkepohl about VAR. He mentions Portmanteau test for autocorrelation (Ljung-Box AFAIR) but also Breusch-Godfrey LM test for residual autocorrelation. And besides that, normality tests and ARCH LM test, and some tests for structural breaks, essentially Chow test with mentioning of others.

Stata seems to have a similar collection after VAR, I haven't checked any other TSA models in the Stata manual.

The main problem is the same as what I had before. If we want to have more than just 3 diagnostic tests, then we need to find a way to avoid a proliferation of test_xxx methods attached to every result.

The options that I thought about are based on grouping diagnostic tests, or add a get_diagnostics similar to get_influence_outliers or whatever these things are called.
or group by hypothesis category.: test_normality test_serial_correlation, ...

@ChadFulton

This comment has been minimized.

Member

ChadFulton commented Jun 4, 2015

I read another chapter by Luetkepohl about VAR. He mentions Portmanteau test for autocorrelation (Ljung-Box AFAIR) but also Breusch-Godfrey LM test for residual autocorrelation. And besides that, normality tests and ARCH LM test, and some tests for structural breaks, essentially Chow test with mentioning of others.

Yes, I think you're right. As I understand it, the distributional assumptions on the standardized residuals make all of these tests fair game.

The main problem is the same as what I had before. If we want to have more than just 3 diagnostic tests, then we need to find a way to avoid a proliferation of test_xxx methods attached to every result.

I see what you mean. I'm not sure about other models, but the problem with simply applying the existing tests to the statespace class of models is that there are different sets of residuals that are sometimes appropriate ("innovations"/residuals, standardized residuals, and auxilliary residuals). Also sometimes the initial periods are truncated, and the degrees of freedom might be different for different classes of models. I haven't thought about this too much, but it does seem like the user would benefit from some sort of statespace-specific test wrappers.

Or maybe it's better just to have good documentation and/or additional methods for getting the appropriate residuals and degrees of freedom?

The options that I thought about are based on grouping diagnostic tests, or add a get_diagnostics similar to get_influence_outliers or whatever these things are called.
or group by hypothesis category.: test_normality test_serial_correlation, ...

I like these ideas. I particularly like the last one, but I guess I'm not familiar enough with all the possible hypothesis tests to know if we could easily group them all.

@ChadFulton

This comment has been minimized.

Member

ChadFulton commented Jun 4, 2015

My only other thought here is that the three tests that I've implemented appear to be "standard" as first-pass diagnostics, even though they are not the only ones that can be run. Similarly, there are other diagnostic plots that could be produced, but these four are seen often.

Maybe we could rename them to make it explicit that they are not the only ones; e.g. something like plot_basic_diagnostics or similar, and instead of three separate test methods, I could group them into test_basic_diagnostics (these probably aren't the best names, but just to give the idea...).

@ChadFulton

This comment has been minimized.

Member

ChadFulton commented Jun 4, 2015

P.S. the test failure here is related to an old Numpy version; I just need to figure out how to make it happy.

@ChadFulton

This comment has been minimized.

Member

ChadFulton commented Jun 23, 2015

#2461 will fix the test failure here. Other than that, I think this PR is just waiting on a decision about whether or not / how we want to include these diagnostics.

Edit: that is to say there is more work to be done on it, but I'm waiting to do it until a decision is made.

@josef-pkt

This comment has been minimized.

Member

josef-pkt commented Jun 23, 2015

I'm almost done clearing my computer of other things and will be able to come back to this soon.

@josef-pkt josef-pkt added the comp-tsa label Jun 23, 2015

@josef-pkt josef-pkt modified the milestone: 0.8 Jun 23, 2015

@josef-pkt josef-pkt referenced this pull request Jun 24, 2015

Open

SUMM: statespace additions - GSOC 2015 #2467

3 of 7 tasks complete

@ChadFulton ChadFulton referenced this pull request Jun 29, 2015

Closed

SUMM: follow-up state space kalman filter merge #2252

3 of 10 tasks complete

@ChadFulton ChadFulton force-pushed the ChadFulton:tst-std-res branch from 7664e04 to 0e0c204 Jul 3, 2015

@ChadFulton

This comment has been minimized.

Member

ChadFulton commented Jul 3, 2015

Rebased so tests pass.

@ChadFulton ChadFulton force-pushed the ChadFulton:tst-std-res branch from 0e0c204 to 4e84db5 Jul 17, 2015

@ChadFulton

This comment has been minimized.

Member

ChadFulton commented Jul 17, 2015

Rebased again, and hopefully fixed.

# Store some values
squared_resid = self.filter_results.standardized_forecasts_error**2
d = self.loglikelihood_burn
h = np.round((self.nobs - d) / 3)

This comment has been minimized.

@josef-pkt

josef-pkt Jul 17, 2015

Member

I guess this needs an int in front or we get a deprecation warning with numpy indexing below.

This comment has been minimized.

@ChadFulton

ChadFulton Jul 17, 2015

Member

Thanks!

statsmodels.graphics.tsaplots.plot_acf
"""
from statsmodels.graphics.utils import _import_mpl, create_mpl_fig
_ = _import_mpl()

This comment has been minimized.

@josef-pkt

josef-pkt Jul 17, 2015

Member

why do you make an assignment?

This comment has been minimized.

@ChadFulton

ChadFulton Jul 17, 2015

Member

No reason, I think I just copy/pasted it in from somewhere. I'll remove it.

@josef-pkt

This comment has been minimized.

Member

josef-pkt commented Jul 17, 2015

Looks fine overall. I think we can still finetune some details after merging, and add additional options.

What's the test coverage? test_heteroskedasticity uses it's own code and doesn't delegate (need verifying numbers)

What's the behavior with respect to several or multivariate endog? From what I can see in browsing it is handled in one case for multiple endog but not in others. Or, Am I not reading this correctly?
(I never looked at the multivariate endog case.)

@josef-pkt

This comment has been minimized.

Member

josef-pkt commented Jul 17, 2015

bonus: Here or followup PR: It would be nice to use this in one of the examples in the notebooks.

@ChadFulton

This comment has been minimized.

Member

ChadFulton commented Jul 17, 2015

What's the behavior with respect to several or multivariate endog? From what I can see in browsing it is handled in one case for multiple endog but not in others. Or, Am I not reading this correctly?
(I never looked at the multivariate endog case.)

All of the tests can accommodate multiple endog; I guess I have to do it explicitly for Ljung-Box and the Goldfeld-Quandt-like test, but since standardized_forecasts_error is a k_endog x nobs matrix, the build-in Jarque-Bera can handle immediately.

Speaking of multiple endog, here's a notebook showing how I currently put the results from the multiple endog in a results table. I just put the test and comma-separate the values for each endog. Does that seem reasonable?

http://nbviewer.ipython.org/gist/ChadFulton/7e404ed4499763802632

@ChadFulton

This comment has been minimized.

Member

ChadFulton commented Jul 17, 2015

What's the test coverage? test_heteroskedasticity uses it's own code and doesn't delegate (need verifying numbers)

No test coverage yet, but since it looks like we'll start with these diagnostics, I'll go ahead and write some.

@jamesuber

This comment has been minimized.

jamesuber commented Jul 17, 2015

In this branch, but possibly elsewhere (I do not know), sarimax.predict breaks when simple_differencing=True. You get the size matching error like

ValueError: Invalid dimensions for time-varying state intercept vector. Requires shape (*,673), got (50, 648)

This seems to be related to not adjusting properly for the points dropped from the prior differencing. Which seems to affect other things too. For example, the new plot_diagnostics has a similar problem when using simple_differencing. Seems that it may be related to loglikelihood_burn not being set properly?

I'm sorry I can't make more specific and well informed statements at this time -- I'm new here.

By the way, I do like and appreciate the diagnostic plots. I may want to modify a few things to my liking that I can't just do from the figure axes, such as the range of lags for the autocorrelation of residuals.

@ChadFulton

This comment has been minimized.

Member

ChadFulton commented Jul 17, 2015

Thanks for the bug report! If you like, create an issue for the problem, otherwise I can do that. Then I'll take a look and see what we can do.

@josef-pkt

This comment has been minimized.

Member

josef-pkt commented Jul 17, 2015

@jamesuber

By the way, I do like and appreciate the diagnostic plots. I may want to modify a few things to my liking that I can't just do from the figure axes, such as the range of lags for the autocorrelation of residuals.

If you have specific comments and suggestions, then we can still include them. Feedback when you use or try out these things is very helpful.

@ChadFulton briefly raised the issue of selecting lags for autocorrelation plots. Because I didn't see a use for other option other than maxlag and maybe dropping zero lag, more information would help. Which changes to the plots would you make?

@ChadFulton

This comment has been minimized.

Member

ChadFulton commented Sep 9, 2015

I have fixed a bug due to the merging (#2615) and I have rebased and fixed the merge path.

I have also put the commit from #2616 into this PR at the end. My limited understanding is that if I rebase after #2616 is merged, it will eliminate that commit. So I can rebase again tomorrow after it is merged. But will it also eliminate the duplicate commit in a merge? I don't know.

@ChadFulton

This comment has been minimized.

Member

ChadFulton commented Sep 9, 2015

I'm still not sure about this

  • maybe method=None as general case where None means we use the "appropriate" test, which might change, so if user want replicability then they should specify a specific method.

Are you thinking a default method=None or just allowing that to be one of the options (where they still have to manually put it in or else an error is raised)?

  • for the multivariate case we can either do the vectorized tests, which is what is used now, AFAiCS, or we can do the joint test, or return both.
  • if there is a large amount of flexibility in the options and returns, then it would be better to return a Bunch or a class instance instead of a tuple. e.g. normality jarque bera with individual and joint test statistic and pvalues would need some description in what the return means. (I'm not sure, given that the residuals are orthogonalized and vectorized individual/marginal distribution test might not be very informative.)

I'm also not sure about the best solution. I haven't looked into alternative diagnostic tests much. Durbin and Koopman (2012) suggest that you might use multivariate generalizations on the standardized residuals, but don't pursue it. They also suggest simply using the univariate approach on each element individually. Practically speaking, I've mostly only seen in use the three univariate methods that I have in place.

Perhaps we should plan to keep these as only vectorized tests (and I could update the docs accordingly), and then we could later add test_joint_normality or something? Otherwise I don't know how to keep the output sane.

return output
def test_heteroskedasticity(self, method='', alternative='two-sided',

This comment has been minimized.

@josef-pkt

josef-pkt Sep 9, 2015

Member

method has empty string as default differs from others

alternative : string, 'increasing', 'decreasing' or 'two-sided'
This specifies the alternative for the p-value calculation. Default
is two-sided.
asymptotic : boolean, optional

This comment has been minimized.

@josef-pkt

josef-pkt Sep 9, 2015

Member

I used use_f or use_t in other models to choose between t/f and normal/chisquare

I don't think f would be exact in this case given that it's estimated residuals, often it's just a better small sample approximation than chisquare

Parameters
----------
method : string {'sumsquares'}

This comment has been minimized.

@josef-pkt

josef-pkt Sep 9, 2015

Member

I'm not a fan of sumsquares, IMO sum of squares is an implementation detail. The main feature is that it tests a break in the variance. (like chow test)

statsmodels.stats.stattools.jarque_bera
"""
if method == 'jarquebera':

This comment has been minimized.

@josef-pkt

josef-pkt Sep 9, 2015

Member

docstring uses jb not full name.

return output
def test_serial_correlation(self, method, lags=None, boxpierce=False):

This comment has been minimized.

@josef-pkt

josef-pkt Sep 9, 2015

Member

boxpierce could be an option for method and not a separate keyword

@ChadFulton ChadFulton force-pushed the ChadFulton:tst-std-res branch from f3151b3 to 1fd4a67 Sep 19, 2015

@ChadFulton

This comment has been minimized.

Member

ChadFulton commented Sep 19, 2015

I have rebased and fixed the issues mentioned above (thanks for pointing those out, by the way).

I have also added the possibility of method=None, where that selects an "appropriate" test. Of course it is a trivial selection for test_normality or test_hereoskedasticity since they only currently have one option. Currently the method argument still has no default. None would have to be passed explicitly. Should I switch it to be the default?

As long as we are comfortable with this level of generality in the tests (so e.g. assume that joint tests will have their own test method rather than be shared with these vectorized tests), then this PR is complete.

@josef-pkt

This comment has been minimized.

Member

josef-pkt commented Sep 19, 2015

I'd like to get these PRs merged. I'm pretty sure we want to think about this some more before a release. I still guess that we will want joint tests in the multivariate case or both with joint as default, so we get a single test result in both univariate and multivariate cases, if that's feasible.
E.g. multivariate normal as in VAR
(I briefly checked the RATS manual today, initially because of other things, and it looks like they have the same multivariate normality test as our current VAR.)

I think we start to expand the test and plot methods in other models in a similar way, but this might be the only case where we have univariate and multivariate endog in the same model.

merging, Thanks

josef-pkt added a commit that referenced this pull request Sep 19, 2015

Merge pull request #2431 from ChadFulton/tst-std-res
ENH: Statespace: Add diagnostics. test and plot methods

@josef-pkt josef-pkt merged commit 1c25877 into statsmodels:master Sep 19, 2015

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls First build on master at 84.498%
Details

@ChadFulton ChadFulton deleted the ChadFulton:tst-std-res branch Jan 16, 2016

@josef-pkt josef-pkt referenced this pull request Feb 9, 2016

Closed

release 0.8 #2176

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment