Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of tidy.lm and tidy.glm for full-rank fits #1112

Merged
merged 2 commits into from
Aug 12, 2022

Conversation

capnrefsmmat
Copy link
Contributor

tidy.lm and tidy.glm use summary(x)$coefficients, then join with coef(x)
to handle the case where the fit is rank deficient and some terms with
NA coefficients are missing from summary(x)$coefficients. However, in a
simple experiment, the left_join() consumes a disproportionate amount of
time. Only run the left_join() when the fit is rank-deficient and it is
necessary, and add tests to ensure the rows are added correctly for
rank-deficient fits.

An example:

library(broom)
library(profvis)

profvis(replicate(1000, tidy(lm(dist ~ speed, data = cars))))

Before this change, the expression ran in 3470 ms, of which 2570
ms (74%) was in tidy.lm() and 1590 ms (46%) was in
left_join.data.frame(). After this change, the expression ran in 1740
ms, of which 1010 ms was in tidy.lm(), cutting 60% from the total time
in tidy.lm(). This is useful for simulation studies that use tidy() to
get estimates for each simulated fit.

Thanks for making a pull request to broom! A few things to keep in mind that will probably help this PR be merged more quickly:

  • Have you documented the change in NEWS.md?
    • As this doesn't change any user-facing API, I haven't, but can add it if performance changes are significant
  • If this is your first time PRing to broom, have you added yourself as a contributor in the DESCRIPTION?
  • Have you added any new vocabulary to inst/WORDLIST? (New vocabulary will be noted in the R-CMD-check from GitHub Actions.)
  • Does R-CMD-check pass on GitHub Actions? (Sometimes, checks may not be passing on the main branch already—if that's the case, just try to make sure your changes don't add any additional errors/warnings.)
  • Have you updated DESCRIPTION if your feature/bug fix requires a specific version of a package?
  • Have you added unit tests for any new functionality?

tidy.lm and tidy.glm use summary(x)$coefficients, then join with coef(x)
to handle the case where the fit is rank deficient and some terms with
NA coefficients are missing from summary(x)$coefficients. However, in a
simple experiment, the left_join() consumes a disproportionate amount of
time. Only run the left_join() when the fit is rank-deficient and it is
necessary, and add tests to ensure the rows are added correctly for
rank-deficient fits.

An example:

```
library(broom)
library(profvis)

profvis(replicate(1000, tidy(lm(dist ~ speed, data = cars))))
```

Before this change, the expression ran in 3470 ms, of which 2570
ms (74%) was in tidy.lm() and 1590 ms (46%) was in
left_join.data.frame(). After this change, the expression ran in 1740
ms, of which 1010 ms was in tidy.lm(), cutting 60% from the total time
in tidy.lm(). This is useful for simulation studies that use tidy() to
get estimates for each simulated fit.
@simonpcouch
Copy link
Collaborator

Thanks for the PR and thorough description! Just gave a thumbs up on triggering Actions, and will review more thoroughly once those run. :)

@simonpcouch
Copy link
Collaborator

Looks great! I appreciate you making this happen. Merging!👍

@simonpcouch simonpcouch merged commit 520a933 into tidymodels:main Aug 12, 2022
@github-actions
Copy link

This pull request has been automatically locked. If you believe the issue addressed here persists, please file a new PR (with a reprex: https://reprex.tidyverse.org) and link to this one.

@github-actions github-actions bot locked and limited conversation to collaborators Aug 27, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants