Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

df_diff in anova_lm #1213

Closed
mngu2382 opened this issue Nov 29, 2013 · 4 comments

Comments

Projects
None yet
2 participants
@mngu2382
Copy link
Contributor

commented Nov 29, 2013

Working through the Whiteside example in chapter 6 of MASS. Expecting
the difference in degrees of freedom to be 1 instead of 2, error flowing on
to F statistic and p-value.

import statsmodels as sm
import statsmodels.api as smapi
import statsmodels.formula.api as smf
import statsmodels.stats.anova as smaov

whiteside = smapi.datasets.get_rdataset("whiteside", "MASS")
whiteside = whiteside.data

gasBA = smf.ols("Gas ~ Insul/Temp - 1", data=whiteside)
gasPR = smf.ols("Gas ~ Insul + Temp", data=whiteside)

smaov.anova_lm(gasPR.fit(), gasBA.fit())
#    df_resid       ssr  df_diff   ss_diff         F    Pr(>F)
#0        53  6.770382        0       NaN       NaN       NaN
#1        52  5.425247        2  1.345135  6.446436  0.003155

sm.version.full_version
# '0.5.0'

The output from R:

anova(gasPR, gasBA)
# Analysis of Variance Table
# 
# Model 1: Gas ~ Insul + Temp
# Model 2: Gas ~ Insul/Temp - 1
#   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
#1     53 6.7704                                  
#2     52 5.4252  1    1.3451 12.893 0.0007307 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Thanks

@josef-pkt

This comment has been minimized.

Copy link
Member

commented Nov 29, 2013

Hi, Thanks for reporting

There might be a bug in recognizing the implicit constant in the first model.

>>> gasBA2 = smf.ols("Gas ~ Insul/Temp", data=whiteside)
>>> print smaov.anova_lm(gasPR.fit(), gasBA2.fit())
   df_resid       ssr  df_diff   ss_diff          F    Pr(>F)
0        53  6.770382        0       NaN        NaN       NaN
1        52  5.425247        1  1.345135  12.892872  0.000731


>>> gasBA.fit().compare_f_test(gasPR.fit())
(12.892872395818717, 0.0007306851862609817, 1.0)
@josef-pkt

This comment has been minimized.

Copy link
Member

commented Nov 29, 2013

Yes, clearly a bug, when models are compared that have implicit and explicit constant.

From what I can see anoval_lm uses df_model to count the difference in df
table.ix[1:]["df_diff"] = np.diff(map(getattr, args, ["df_model"]*n_models))

df_model is unreliable, because it's defined as self.rank - self.k_constant

If we provide the information that the first model has a constant, then it's also correct

>>> gasBA3 = smf.ols("Gas ~ Insul/Temp - 1", data=whiteside, hasconst=True)
>>> print smaov.anova_lm(gasPR.fit(), gasBA3.fit())
   df_resid       ssr  df_diff   ss_diff          F    Pr(>F)
0        53  6.770382        0       NaN        NaN       NaN
1        52  5.425247        1  1.345135  12.892872  0.000731
>>> gasBA.df_model
4.0
>>> gasBA3.df_model
3.0
>>> gasBA.df_resid
52.0
>>> gasBA3.df_resid
52.0
@josef-pkt

This comment has been minimized.

Copy link
Member

commented Nov 29, 2013

possible solutions

  • use df_resid
    compare_f_test is not affected by hasconst/k_constant because it uses the difference in df_resid instead of in df_model

  • make the the k_constant detection more reliable, so it also detects implicit constants, see
    #1212 (comment)

    patsy also should have the information about k_constant already

@josef-pkt

This comment has been minimized.

Copy link
Member

commented Jun 18, 2014

PR #1770 adds unit tests

but I get the same results as R for example in issue description with current master

   df_resid       ssr  df_diff   ss_diff          F    Pr(>F)
0        53  6.770382        0       NaN        NaN       NaN
1        52  5.425247        1  1.345135  12.892872  0.000731

somwhere df_model is corrected for the implicit constant now

>>> gasBA.exog.shape
(56, 4)
>>> gasBA.df_model
3.0

>>> gasBA.exog.mean(0)
array([ 0.53571429,  0.46428571,  2.39107143,  2.48392857])
>>> (gasBA.exog[:,:2].sum(1) == 1).all()
True
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.