anova_lm throws error on models created from api.ols but not formula.api.ols #1855

Closed
jeffmax opened this Issue Jul 30, 2014 · 7 comments

Projects

None yet

3 participants

@jeffmax
jeffmax commented Jul 30, 2014

If I fit a linear regression using the array based api, I get the following error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-18-58bd3b88eadf> in <module>()
----> 1 anova_lm(model.fit())

/usr/local/lib/python3.4/site-packages/statsmodels/stats/anova.py in anova_lm(*args, **kwargs)
    324     if len(args) == 1:
    325         model = args[0]
--> 326         return anova_single(model, **kwargs)
    327 
    328     try:

/usr/local/lib/python3.4/site-packages/statsmodels/stats/anova.py in anova_single(model, **kwargs)
     61 
     62     response_name = model.model.endog_names
---> 63     design_info = model.model.data.orig_exog.design_info
     64     exog_names = model.model.exog_names
     65     # +1 for resids

/usr/local/lib/python3.4/site-packages/pandas/core/generic.py in __getattr__(self, name)
   1841                 return self[name]
   1842             raise AttributeError("'%s' object has no attribute '%s'" %
-> 1843                                  (type(self).__name__, name))
   1844 
   1845     def __setattr__(self, name, value):

AttributeError: 'DataFrame' object has no attribute 'design_info'

This does not occur if when I perform the same ols but use the formula.api.ols methods. This is from Python 3

@jeffmax
jeffmax commented Jul 30, 2014

I just saw the note that the formula api is a pre-requisite. Is this the direction that the library is going?

@jeffmax jeffmax closed this Jul 30, 2014
@jseabold
Member

It's not the direction the library is going in, but there are things in ANOVA you just can't do if you don't have information about how the variables were created. As of now, the only way to do this is to use information in the formula.

@jeffmax
jeffmax commented Jul 30, 2014

Thanks.

@jseabold
Member

We could probably improve the error message here.

@josef-pkt
Member

@jeffmax What's your use case? What's the structure of your design matrix, exog?
If you open a wichlist/enhancement issue, then we might be able to take it into account in future extensions or refactoring.

To expand a bit on Skipper's answer:
ANOVA or similar functionality (that we don't have yet) drops terms where one variable is coded with several columns of the exog, as for example in the case of categorical variables, or polynomials, splines, ....
Currently the only way we can get this information is through the formulas. It might be possible to add a non-formula API to specify which columns belong together. But, we don't have any case like that yet.

If we don't have the information which columns represent the same underlying explanatory variable, then we can look only at one column at a time. In that case, Anova type 3 is essentially the same as the t-test for the params table in summary.

similar issues in planned features:
The current GSOC project for multiple imputation, MICE, uses formulas throughout, because we will also need to use the variable transformation that patsy provides (again mainly categorical and similar). I don't know yet if there will be a non-formula API if each column represents a different variable.

stepwise regression or similar:
Initially I thought we only add or drop columns. However, Stata (and IIIRC R) has the option to add or drop "terms", where several columns that belong together are dropped or added at the same time. In some cases there are additional restrictions, for example that the main terms can only be dropped if the interaction terms also have been dropped. Currently the only way to be able to get this information would be through formulas.

anova_lm still has to be extended to other models, where we essentially have a list of models that are compared with different tests, compare_xxx. This is not fully supported yet, but I was also looking at the implementation in anova_lm, where Skipper figured out a relative straightforward way to get the relevant information from the patsy formulas.

@jeffmax
jeffmax commented Jul 30, 2014

@josef-pkt Thanks for the explanation. I am a student taking a linear regression course where most of the instruction is given in terms of how things are done in Minitab. I use Python a lot at work, and want to know how to use the statistics libraries, so I typically try to duplicate my results from Minitab in Python. I do not believe we have go over on the different types of ANOVA and how derivative terms are dropped, in fact, the way we are doing this in Minitab, I don't think it has any idea about how variables are constructed. If I want to do a 2nd order regression, I have to create a new column of data that is derived from the first order column (squaring each value, and putting the result in a new column), and then the regression is done on those two variables.
Minitab actually just reports two rows for the ANOVA, one row with the SS, MS F and P values attributed to the regression, and one row for the error, where as statsmodels has one row for each variable.

The issue here could be that this is just an introductory course on regression? We typically use the ANOVA to determine whether or not all of the exog variables are insignificant.

I found this problem initially because I was first using the regular api OLS to do regression because it was quicker than writing out a formula (and it was the first way I discovered to do it), but I kept running into the error when I tried to do the ANOVA on it. I think as @jseabold commented, it would be helpful if the error pointed the user towards the formula-api instead of just showing an DataFrame attribute error.

@josef-pkt
Member

@jeffmax Thanks for the explanation.

We don't have the simple ANOVA table associated with a regression. Stata also reports it for the linear regression.
However, the F-value and the associated p-value for the hypothesis that all slope coefficients are zero is shown in the summary table, and are available as results.fvalue and results f_pvalue.

Some other models, like the discrete models, have llnull instead of the fvalue, because the ANOVA F-statistics are only appropriate for the linear model and only a few others. llnull and the associated p-value are based on the likelihood ratio test and similar to the simple ANOVA table tests that all slope coefficients are zero. (i.e. comparing to a model with only a constant.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment