Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Degrees of freedom is incorrect in the glance function #273

Closed
jbryer opened this issue Jan 26, 2018 · 16 comments
Closed

Degrees of freedom is incorrect in the glance function #273

jbryer opened this issue Jan 26, 2018 · 16 comments

Comments

@jbryer
Copy link

jbryer commented Jan 26, 2018

Note that the summary function correctly shows 2 and 29 degrees of freedom for the F, but the glance function returns 3 and 29.

> library(broom)
> data(mtcars)
> lm.out <- lm(mpg ~ disp + hp, data = mtcars)
> summary(lm.out)

Call:
lm(formula = mpg ~ disp + hp, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.7945 -2.3036 -0.8246  1.8582  6.9363 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 30.735904   1.331566  23.083  < 2e-16 ***
disp        -0.030346   0.007405  -4.098 0.000306 ***
hp          -0.024840   0.013385  -1.856 0.073679 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.127 on 29 degrees of freedom
Multiple R-squared:  0.7482,	Adjusted R-squared:  0.7309 
F-statistic: 43.09 on 2 and 29 DF,  p-value: 2.062e-09

> glance(lm.out)
  r.squared adj.r.squared    sigma statistic      p.value df    logLik      AIC      BIC deviance
1 0.7482402     0.7308774 3.126601  43.09458 2.062068e-09  3 -80.30928 168.6186 174.4815 283.4934
  df.residual
1          29
@gavinsimpson
Copy link
Contributor

Depends whether df is intended to be the degrees of freedom for the F test or for the model? The description. ?lm_tidiers has:

df Degrees of freedom used by the coefficients

of which there are 3 in this model. Also df + df.residual = nrow(mtcars).

The omnibus F test in the summary is the equivalent of doing:

> lm.1 <- lm(mpg ~ disp + hp, data = mtcars)
> lm.0 <- lm(mpg ~ 1, data = mtcars)
> anova(lm.0, lm.1)
Analysis of Variance Table

Model 1: mpg ~ 1
Model 2: mpg ~ disp + hp
  Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
1     31 1126.05                                  
2     29  283.49  2    842.55 43.095 2.062e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

so the 2 stated for one of the degrees of freedom in the F test is related to the difference in degrees of freedom for two models.

@dgrtwo
Copy link
Collaborator

dgrtwo commented Jan 29, 2018

broom counts the intercept term as a degree of freedom used in the model. This matches the results from summary(lm.out)$df, though not from summary(lm.out)$fstatistic.

One case where this version would be moderately useful would be comparing dfs used between lm(mpg ~ disp + hp, data = mtcars) and lm(mpg ~ 0 + disp + hp, data = mtcars).

I agree an argument could be made for either (though as Gavin notes this version matches the current documentation), but I don't think it's worth breaking the reverse compatibility of this function to switch the meaning.

@dgrtwo dgrtwo closed this as completed Jan 29, 2018
@ghost
Copy link

ghost commented Jan 30, 2018

The p.value in the glance table is a correct p.value for an F with 2, 29 df, and not for one with 3, 29 df.

From the point of view of a long-time instructor, there is no question that what should be reported is the numerator df associated with the F test that is also reported (since its denominator df and its p value are also in the table). The total "model" df is not reported when reports and publications of analyses are written. As it stands the output will easily lead to mis-reporting of the numerator df by researchers, although statistical programmers may understand the nuance. The extra 1 df is associated with the intercept, but no test of the intercept is provided, so why include the df?

When I use glance in a markdown document, I do the following:
fit1 <- lm(mpg~disp+hp, data=mtcars)
fit1.g <- glance(fit1)
fit1.g$df <- fit1.g$df-1
kable(fit1.g)

Manually "kludging" the df this way is something I would rather not have to waste my time teaching my students.

I love the broom package for so many reasons. But right now I cannot recommend glance to my students because of this df issue. I would like to teach statistics, not R idiosyncracies.

Thanks for considering this.

Bruce Dudek

@dgrtwo
Copy link
Collaborator

dgrtwo commented Feb 10, 2018

I've heard from enough people that I've around on this! I'd like to update it to use the df used for the F-statistic.

However, I don't want to do the update until the next minor (0.6.0) broom release since there's a chance it could break backwards compatibility (I'm nervous if people use glance() in statistical tests).

@hardin47
Copy link

I agree that pedagogically the mis-match in df is a problem. Particularly because the df sits right next to the p-value in glance.

In every text or help page I've read, I don't see df=p ("degrees of freedom used by the coefficients" is a new idea for me). When the ANOVA table is broken down, it is done with df.resid + df.model = n-1 (because n-1 is the degrees of freedom associated with the error if there is no model: the total sums of square).

Oh... just saw @dgrtwo response.... Thank you!!!

@vqv
Copy link

vqv commented Mar 26, 2018

I just ran into this problem myself while teaching an applied regression class.

I think there are arguments to be made for both interpretations of df. Is it the model degrees of freedom or the extra degrees of freedom (in comparing the intercept only model versus the full model)? If you look to the left, there is p.value which suggests that it's the extra degrees of freedom. But if you look to the right, there is logLik and AICwhich suggests that it is the model degrees of freedom.

How about giving df a more explicit name like df.model?

@alexpghayes
Copy link
Collaborator

Related to #212 / possible duplicate.

@alexpghayes
Copy link
Collaborator

Closed in #624 and #625. Thanks all for your patience. If you have strong opinions about this please take a look at #212 where I raised a question about what should happen for aov objects. Tentatively I'm throwing an error.

@lindeloev
Copy link

I still see this issue in broom 0.5.1. summary(lm(...)) returns 2 and 57 df while glance returns 3 and 57 df.

@alexpghayes alexpghayes reopened this Apr 7, 2019
@alexpghayes
Copy link
Collaborator

Can you provide a reprex? On the dev version of broom the example above gives

library(broom)

fit <- lm(mpg ~ disp + hp, data = mtcars)

summary(fit)
#> 
#> Call:
#> lm(formula = mpg ~ disp + hp, data = mtcars)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -4.7945 -2.3036 -0.8246  1.8582  6.9363 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 30.735904   1.331566  23.083  < 2e-16 ***
#> disp        -0.030346   0.007405  -4.098 0.000306 ***
#> hp          -0.024840   0.013385  -1.856 0.073679 .  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 3.127 on 29 degrees of freedom
#> Multiple R-squared:  0.7482, Adjusted R-squared:  0.7309 
#> F-statistic: 43.09 on 2 and 29 DF,  p-value: 2.062e-09
anova(fit)
#> Analysis of Variance Table
#> 
#> Response: mpg
#>           Df Sum Sq Mean Sq F value    Pr(>F)    
#> disp       1 808.89  808.89 82.7454 5.406e-10 ***
#> hp         1  33.67   33.67  3.4438   0.07368 .  
#> Residuals 29 283.49    9.78                      
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
glance(fit)[c("df", "df.residual")]
#> # A tibble: 1 x 2
#>      df df.residual
#>   <dbl>       <int>
#> 1     2          29

packageVersion("broom")
#> [1] '0.5.2.9001'

Created on 2019-04-08 by the reprex package (v0.2.1)

I'm not sure what example you're referring to.

@lindeloev
Copy link

I get df = 3 on this example using the latest broom from CRAN:

> packageVersion('broom')
[1] ‘0.5.2’
> fit = lm(mpg ~ disp + hp, data=mtcars)
> broom::glance(fit)
# A tibble: 1 x 11
  r.squared adj.r.squared sigma statistic       p.value    df logLik   AIC   BIC deviance df.residual
      <dbl>         <dbl> <dbl>     <dbl>         <dbl> <int>  <dbl> <dbl> <dbl>    <dbl>       <int>
1     0.748         0.731  3.13      43.1 0.00000000206     3  -80.3  169.  174.     283.          29

Tried installing latest broom from GitHub, but got an error (probably not broom-related). Was this fixed since 0.5.2?

@alexpghayes
Copy link
Collaborator

alexpghayes commented Apr 9, 2019 via email

@nicholasjhorton
Copy link

My students just pointed out the same issue:

mod <- lm(mpg ~ cyl + disp, data = mtcars)
summary(mod)  # 2 df overall test
#> 
#> Call:
#> lm(formula = mpg ~ cyl + disp, data = mtcars)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -4.4213 -2.1722 -0.6362  1.1899  7.0516 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 34.66099    2.54700  13.609 4.02e-14 ***
#> cyl         -1.58728    0.71184  -2.230   0.0337 *  
#> disp        -0.02058    0.01026  -2.007   0.0542 .  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 3.055 on 29 degrees of freedom
#> Multiple R-squared:  0.7596, Adjusted R-squared:  0.743 
#> F-statistic: 45.81 on 2 and 29 DF,  p-value: 1.058e-09
broom::glance(mod) # returns df = 3
#> # A tibble: 1 x 11
#>   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
#>       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>  <dbl> <dbl> <dbl>
#> 1     0.760         0.743  3.06      45.8 1.06e-9     3  -79.6  167.  173.
#> # … with 2 more variables: deviance <dbl>, df.residual <int>

Created on 2020-04-03 by the reprex package (v0.3.0)

@alexpghayes
Copy link
Collaborator

The df has been updated in the 0.7.0 dev release, but I'm not sure when I'll have time to get it onto CRAN. In the short term you can try the dev version of broom.

@simonpcouch
Copy link
Collaborator

Fixed with the release of 0.7.0 on CRAN!

@github-actions
Copy link

github-actions bot commented Mar 7, 2021

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Mar 7, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

9 participants