df_diff in anova_lm #1213

Closed
mngu2382 opened this Issue Nov 29, 2013 · 4 comments

Projects

None yet

2 participants

@mngu2382
Contributor

Working through the Whiteside example in chapter 6 of MASS. Expecting
the difference in degrees of freedom to be 1 instead of 2, error flowing on
to F statistic and p-value.

import statsmodels as sm
import statsmodels.api as smapi
import statsmodels.formula.api as smf
import statsmodels.stats.anova as smaov

whiteside = smapi.datasets.get_rdataset("whiteside", "MASS")
whiteside = whiteside.data

gasBA = smf.ols("Gas ~ Insul/Temp - 1", data=whiteside)
gasPR = smf.ols("Gas ~ Insul + Temp", data=whiteside)

smaov.anova_lm(gasPR.fit(), gasBA.fit())
#    df_resid       ssr  df_diff   ss_diff         F    Pr(>F)
#0        53  6.770382        0       NaN       NaN       NaN
#1        52  5.425247        2  1.345135  6.446436  0.003155

sm.version.full_version
# '0.5.0'

The output from R:

anova(gasPR, gasBA)
# Analysis of Variance Table
# 
# Model 1: Gas ~ Insul + Temp
# Model 2: Gas ~ Insul/Temp - 1
#   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
#1     53 6.7704                                  
#2     52 5.4252  1    1.3451 12.893 0.0007307 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Thanks

@josef-pkt
Member

Hi, Thanks for reporting

There might be a bug in recognizing the implicit constant in the first model.

>>> gasBA2 = smf.ols("Gas ~ Insul/Temp", data=whiteside)
>>> print smaov.anova_lm(gasPR.fit(), gasBA2.fit())
   df_resid       ssr  df_diff   ss_diff          F    Pr(>F)
0        53  6.770382        0       NaN        NaN       NaN
1        52  5.425247        1  1.345135  12.892872  0.000731


>>> gasBA.fit().compare_f_test(gasPR.fit())
(12.892872395818717, 0.0007306851862609817, 1.0)
@josef-pkt
Member

Yes, clearly a bug, when models are compared that have implicit and explicit constant.

From what I can see anoval_lm uses df_model to count the difference in df
table.ix[1:]["df_diff"] = np.diff(map(getattr, args, ["df_model"]*n_models))

df_model is unreliable, because it's defined as self.rank - self.k_constant

If we provide the information that the first model has a constant, then it's also correct

>>> gasBA3 = smf.ols("Gas ~ Insul/Temp - 1", data=whiteside, hasconst=True)
>>> print smaov.anova_lm(gasPR.fit(), gasBA3.fit())
   df_resid       ssr  df_diff   ss_diff          F    Pr(>F)
0        53  6.770382        0       NaN        NaN       NaN
1        52  5.425247        1  1.345135  12.892872  0.000731
>>> gasBA.df_model
4.0
>>> gasBA3.df_model
3.0
>>> gasBA.df_resid
52.0
>>> gasBA3.df_resid
52.0
@josef-pkt
Member

possible solutions

  • use df_resid
    compare_f_test is not affected by hasconst/k_constant because it uses the difference in df_resid instead of in df_model

  • make the the k_constant detection more reliable, so it also detects implicit constants, see
    #1212 (comment)

    patsy also should have the information about k_constant already

@josef-pkt
Member

PR #1770 adds unit tests

but I get the same results as R for example in issue description with current master

   df_resid       ssr  df_diff   ss_diff          F    Pr(>F)
0        53  6.770382        0       NaN        NaN       NaN
1        52  5.425247        1  1.345135  12.892872  0.000731

somwhere df_model is corrected for the implicit constant now

>>> gasBA.exog.shape
(56, 4)
>>> gasBA.df_model
3.0

>>> gasBA.exog.mean(0)
array([ 0.53571429,  0.46428571,  2.39107143,  2.48392857])
>>> (gasBA.exog[:,:2].sum(1) == 1).all()
True
@josef-pkt josef-pkt closed this Jun 18, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment