df_diff in anova_lm #1213

Closed
opened this issue Nov 29, 2013 · 4 comments

Projects
None yet
Contributor

mngu2382 commented Nov 29, 2013

 Working through the Whiteside example in chapter 6 of MASS. Expecting the difference in degrees of freedom to be 1 instead of 2, error flowing on to F statistic and p-value. ```import statsmodels as sm import statsmodels.api as smapi import statsmodels.formula.api as smf import statsmodels.stats.anova as smaov whiteside = smapi.datasets.get_rdataset("whiteside", "MASS") whiteside = whiteside.data gasBA = smf.ols("Gas ~ Insul/Temp - 1", data=whiteside) gasPR = smf.ols("Gas ~ Insul + Temp", data=whiteside) smaov.anova_lm(gasPR.fit(), gasBA.fit()) # df_resid ssr df_diff ss_diff F Pr(>F) #0 53 6.770382 0 NaN NaN NaN #1 52 5.425247 2 1.345135 6.446436 0.003155 sm.version.full_version # '0.5.0'``` The output from R: ``````anova(gasPR, gasBA) # Analysis of Variance Table # # Model 1: Gas ~ Insul + Temp # Model 2: Gas ~ Insul/Temp - 1 # Res.Df RSS Df Sum of Sq F Pr(>F) #1 53 6.7704 #2 52 5.4252 1 1.3451 12.893 0.0007307 *** # --- # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 `````` Thanks
Member

josef-pkt commented Nov 29, 2013

 Hi, Thanks for reporting There might be a bug in recognizing the implicit constant in the first model. ``````>>> gasBA2 = smf.ols("Gas ~ Insul/Temp", data=whiteside) >>> print smaov.anova_lm(gasPR.fit(), gasBA2.fit()) df_resid ssr df_diff ss_diff F Pr(>F) 0 53 6.770382 0 NaN NaN NaN 1 52 5.425247 1 1.345135 12.892872 0.000731 >>> gasBA.fit().compare_f_test(gasPR.fit()) (12.892872395818717, 0.0007306851862609817, 1.0) ``````
Member

josef-pkt commented Nov 29, 2013

 Yes, clearly a bug, when models are compared that have implicit and explicit constant. From what I can see anoval_lm uses df_model to count the difference in df `table.ix[1:]["df_diff"] = np.diff(map(getattr, args, ["df_model"]*n_models))` df_model is unreliable, because it's defined as `self.rank - self.k_constant` If we provide the information that the first model has a constant, then it's also correct ``````>>> gasBA3 = smf.ols("Gas ~ Insul/Temp - 1", data=whiteside, hasconst=True) >>> print smaov.anova_lm(gasPR.fit(), gasBA3.fit()) df_resid ssr df_diff ss_diff F Pr(>F) 0 53 6.770382 0 NaN NaN NaN 1 52 5.425247 1 1.345135 12.892872 0.000731 `````` ``````>>> gasBA.df_model 4.0 >>> gasBA3.df_model 3.0 >>> gasBA.df_resid 52.0 >>> gasBA3.df_resid 52.0 ``````
Member

josef-pkt commented Nov 29, 2013

 possible solutions use df_resid compare_f_test is not affected by hasconst/k_constant because it uses the difference in `df_resid` instead of in `df_model` make the the k_constant detection more reliable, so it also detects implicit constants, see #1212 (comment) patsy also should have the information about `k_constant` already

Merged

Member

josef-pkt commented Jun 18, 2014

 PR #1770 adds unit tests but I get the same results as R for example in issue description with current master `````` df_resid ssr df_diff ss_diff F Pr(>F) 0 53 6.770382 0 NaN NaN NaN 1 52 5.425247 1 1.345135 12.892872 0.000731 `````` somwhere df_model is corrected for the implicit constant now ``````>>> gasBA.exog.shape (56, 4) >>> gasBA.df_model 3.0 >>> gasBA.exog.mean(0) array([ 0.53571429, 0.46428571, 2.39107143, 2.48392857]) >>> (gasBA.exog[:,:2].sum(1) == 1).all() True ``````

Closed