add hasconstant indicator for R-squared and df calculations #27

Closed
wesm opened this Issue Aug 25, 2011 · 4 comments

Projects

None yet

3 participants

@wesm
Member
wesm commented Aug 25, 2011

Original Launchpad bug 574004: https://bugs.launchpad.net/statsmodels/+bug/574004
Reported by: josef-pktd (joep).

see also Bug #440151: fvalue and mse_model are -inf if only 1 regressor

The R-squared and df calculations assume that there is a constant among the regressors.

todo: add a hasconstant indicator and adjust calculations for df and R-squared

Notes: pandas did recently a similar change

R-squared is not really well defined without constant and there are several competing definitions, but for the simple case of regression without constant we should use the total sum of squares instead of mean corrected ???

I would keep the R-squared with constant for transformed regressors that loses the constant for the regression, (example of heteroscedasticity).

@BenDundee

This one just bit me :)

For reference, please see my question and response in this thread: http://stats.stackexchange.com/questions/36064/calculating-r-squared-coefficient-of-determination-with-centered-vs-un-center

The common wisdom seems to be that the correct definition of R^2 is the one you mentioned, using the total sum of squares. Also, this is how R does it---this isn't a reason (in and of itself) to change things, but the community seems to have a certain expectation in this regard, at least.

I would say, this is an issue that deserves some bandwidth. Linear regression is a pretty basic thing for people to want to do with a statistical package...

@jseabold
Member

Thanks for the links. Will have a look and sort this out.

A quick comment. Linear regression is indeed a pretty basic thing. We provide an OLS class that is fully correct. As soon as you fit the model without an intercept, you're not doing OLS AFAIK. Without an intercept, you're fitting a regression through the origin model (which is a strong substantive claim). The reason I've kicked this down the road so far is that it's still not really clear to me what a "correct" R^2 is in this case. The existence of the R^2 measure depends on the model being fit with an intercept. This (and other) R_0^2 measure [1] may be (somewhat) analogous to R^2 in that they're forced to be in (0,1) like psedo-R^2 measure for non-linear models, but R_0^2 is not the same as R^2. Ie., you can't compare R^2 and R_0^2 measures. For the most part, I've seen it recommended not to rely on R^2 for the RTO model since it can be wildly overinflated (with the higher uncentered sum of squares), and it's almost never the case in the models I deal with that you want to force the predictor to be zero when the regressors are zero. I always assumed this is why it's almost never discussed in textbooks.

In any event, I'll sit down with this before 0.5 and see if I can sort out the theory and the implications for the other inferential statistics. I definitely do not want to silently use a different definition for the no-constant model like R. For the record, SAS includes a big warning that R^2 is redefined and also uses the uncentered TSS.

[1] http://stats.stackexchange.com/questions/26176/removal-of-statistically-significant-intercept-term-boosts-r2-in-linear-model#26205

See also #60.

@BenDundee

I've been thinking about your comments a bit. I'd like to offer a bit of feedback, but I want to be clear that I'm sensitive to the fact that you (or yall) are the one(s) actually doing the work :)

In my mind, R^2 is a property of two data sets, not of the ordinary least squares algorithm for dealing with residuals. One must choose a model first, then use OLS (or GLS, or NNLS or...) to derive the regression constants. Whether one chooses to include an intercept in the model amounts to a modeling choice: either I believe the data is best represented by a model with an intercept, or I believe the data is best represented by a model without an intercept. The algorithm doesn't care whether you include a constant or not: it is perfectly happy to deal with the extra 1's.

The real reason for the change in definition (as I've come to understand it) is a consequence of a null hypothesis (our old friend). In either instance, the null hypothesis is "no relationship exists", which means "set the slope to 0". I hate to reference myself here, but I've seen no other concise explanation of the subject online (see link above). Couched in this way, it makes sense to me that there is a "right" definition of R^2 in the case without an intercept (however bad a statistic).

Given this, it's not clear (to me, at least) why silently doing the right thing is a bad idea. The definition of R^2 that is implemented in 0.4 is only correct for a model with an intercept, we both agree on that. The same definition of R^2 cannot be applied to a model without an intercept, we both agree on that. Given that you provide a function called "add_constant", I would (naively) expect the OLS class to figure out the relevant calculations, and return the correct version of R^2 when queried.

Anyway, I definitely want to say thanks for a great piece of software!

@jseabold
Member
jseabold commented Oct 4, 2012

Closing this as a duplicate of #423 so there's less to keep track of. Fixed the RTO linear case as per your suggestions in this branch

https://github.com/jseabold/statsmodels/tree/handle-constant

but I need to look at how this will affect the rest of the code base.

@jseabold jseabold closed this Oct 4, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment