Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

WIP: Handle constant #499

Merged
merged 11 commits into from Nov 16, 2012

Conversation

Projects
None yet
2 participants
Owner

jseabold commented Oct 4, 2012

Going ahead and starting a PR to make review and discussion easier. This PR addresses #423 and partially #157.

@josef-pkt josef-pkt commented on the diff Oct 4, 2012

statsmodels/base/model.py
@@ -41,6 +41,7 @@ class Model(object):
def __init__(self, endog, exog=None, **kwargs):
@josef-pkt

josef-pkt Oct 4, 2012

Owner

we need a keyword argument so the user can, if necessary, determine whether we have a constant included.
for example:
With full dummy set, constant is implicit
(full set of spline basis functions also has constant implicit)
with nonlinear model, constant might be a parameter.

what we would also need is a ddof correction, for example if the user demeans the data (removes panel fixed effect, within group), not for Rsquared but for tests and pvalues

@jseabold

jseabold Oct 4, 2012

Owner

Makes sense. Stata uses hascons. Could we also just check if it's full rank since we're computing the rank anyway? I think this is how R does this for the purposes of ANOVA, etc. Then we handle the panel case on its own. Not sure.

@josef-pkt

josef-pkt Oct 4, 2012

Owner

I think checking the rank does not help for this. For dummies users could leave out one category for a no constant regression, or keep all dummies and don't include a constant, even in the pure ANOVA case.
If users use formula and have a more standard/predefined encoding for dummies/categoricals, then we have the information anyway from the formula, (unless users create it themselves as a basic exog variable.)

@jseabold

jseabold Oct 5, 2012

Owner

Hmm, it still seems to me we can get by with rank checking. I'll have to think about this and do some testing. Stata may be a bit different because it isn't really designed to handle ANOVA well and has no concept of contrasts AFAIK.

[Edit: I see what you mean with the leave one out and no constant, but I don't know how to interpret this model without thinking more about it. Is this model estimable? You'd be constraining the reference category == 0 and then the other coefficients are based off of it?]

@josef-pkt

josef-pkt Oct 5, 2012

Owner

OLS or linear models don't know about ANOVA and contrasts either, they just see the exog, only when using formulas we might get the information about the contrast or dummy coding of categorical variables.

at edit: Yes, that's what I mean, maintained assumption is: coefficient of reference category == 0 and no constant effect.

I'm just illustrating that there is no direct link between rank and has_constant. Whether those are "commonly" used models is a different issue.

@jseabold

jseabold Oct 5, 2012

Owner

What do you propose that has_constant do? I started with this.

hasconst : bool or None
    If None, the RHS matrix is automatically checked for a constant. If True, k_constant = 1, const_idx = [] and no checking is done. If False, k_constant = 0, const_idx = [] and no checking is done.
@jseabold

jseabold Oct 5, 2012

Owner

I started looking at ANOVA again to see if we could separate if from formula information, but we can't calculate things without the formula information either way, so it's a moot point.

Owner

jseabold commented Oct 4, 2012

Would like to rebase this on the WLS fixes before continuing.

Owner

jseabold commented Oct 5, 2012

See if 5e9769e is what you had in mind.

Owner

josef-pkt commented Nov 14, 2012

looks pretty good. (without going through all the details)

The only case that is not covered yet is prior demeaned data, with a ddof different from whether a constant is included in the exog.
I don't see right now whether this can be added to this without having to change the interface again, besides additional keyword.
main usage: fixed effects model (balanced or unbalanced where both endog and exog are deviations from group mean)

Owner

jseabold commented Nov 16, 2012

I think the hasconst keyword now handles this use case based on the last commit and the above comment unless I'm misunderstanding your concern. It's why I added it. You can have hasconst=True but const_idx=None.

    hasconst : None or bool
        Indicates whether the RHS includes a user-supplied constant. If True,
        a constant is not checked for and k_constant is set to 1 and all
        result statistics are calculated as if a constant is present. If
        False, a constant is not checked for and k_constant is set to 0.

So if hasconst=True, k_constant=1 and const_idx=None. If False, k_constant=0 and const_idx=None, if None, a constant is checked for.

If this sounds good, I'll merge this today.

Owner

jseabold commented Nov 16, 2012

After fixing the bugs...

Owner

josef-pkt commented Nov 16, 2012

Yes the case of a single constant looks fine.

The case that I have in mind that is not covered is has_const=5, or ddof=5. (There was a question like this for pandas panel one one of the mailing lists about removing nuisance fixed effects.)
Suppose we have 5 groups/categories/panels and we group-demean all variables, then the actual number of parameters for the degrees of freedom is exog.shape[1]+5.
Suppose we detrend all variables beforehand, then the df_model = exog.shape[1] * (1+2), I guess.

However, we need to keep track of has any constant (boolean) for R_squared, versus df_model (count).
So maybe this can be added independent of this PR.

Owner

jseabold commented Nov 16, 2012

Yeah I'd have to think more about this. I would assume this would be handled in the subclasses. I haven't looked at systems stuff in a while, but I just imagined something like neqs * has_const. I don't think this PR precludes handling that, but if we need to refactor in the future, we can.

Owner

josef-pkt commented Nov 16, 2012

It's not really systems stuff. It's just OLS where one of the categorical factor fixed effects are removed by demeaning. I just consider it as univariate, because there is no assumption that we have balanced equations/panels.

Correction: detrending all variables only adds +2 to df_model.

Owner

jseabold commented Nov 16, 2012

Ah, right. Still something that can fit into the current framework though no?

Owner

josef-pkt commented Nov 16, 2012

I guess so. (I'm late for a meeting)

Owner

jseabold commented Nov 16, 2012

It should be, assuming that we are handling the demeaning under the hood in some kind of model, we can just change k_constant after the fact. If the user is passing in already group demeaned data, then I don't know. As a user, I'd want this handled automatically. If for some reason, you don't, then the way it's written we could allow for hasconst to be an integer and then assign it instead of 1 to k_constant. It's just a one line change in base/data.py. I'm not going to do it now, but it's possible and wouldn't break anything to change this in the future. Going to go ahead and merge these changes in.

jseabold added a commit that referenced this pull request Nov 16, 2012

@jseabold jseabold merged commit 5f7ee03 into statsmodels:master Nov 16, 2012

1 check passed

default The Travis build passed
Details

PierreBdR pushed a commit to PierreBdR/statsmodels that referenced this pull request Sep 2, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment