-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes to allow sparsity predictor representations #373
Conversation
Here is some benchmarking FWIW: library(glmnet)
#> Loading required package: Matrix
#> Loaded glmnet 4.0-2
library(parsnip)
n <- 1e5
y <- sample(0:1, n, replace = TRUE)
x1 <- rnorm(n)
x2 <- sample(1:400, n, replace = TRUE)
X1 <- sparse.model.matrix(~ x1 + factor(x2))
X2 <- model.matrix(~ x1 + factor(x2))
lasso_spec <- linear_reg(mixture = 1) %>%
set_mode("regression") %>%
set_engine("glmnet")
bench::mark(
check = FALSE, iterations = 10,
lasso_spec %>% fit_xy(x = X2, y = y),
lasso_spec %>% fit_xy(x = X1, y = y),
glmnet(x = X2, y = y),
glmnet(x = X1, y = y)
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 4 x 6
#> expression min median `itr/sec` mem_alloc
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt>
#> 1 lasso_spec %>% fit_xy(x = X2, y = y) 6.76s 6.81s 0.147 931.9MB
#> 2 lasso_spec %>% fit_xy(x = X1, y = y) 276.37ms 282.1ms 3.49 16.7MB
#> 3 glmnet(x = X2, y = y) 6.57s 6.68s 0.149 623MB
#> 4 glmnet(x = X1, y = y) 164.27ms 166.48ms 5.75 16.4MB
#> # … with 1 more variable: `gc/sec` <dbl> Created on 2020-10-02 by the reprex package (v0.3.0.9001) Looking so great! We can see the small parsnip overhead here, and the huge improvement to being able to use a sparse data structure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super happy about these changes. 😄
In the benchmarking I have been working on, this makes a big difference for glmnet especially in terms of fitting time.
We don't have this change/option documented anywhere in user-facing docs, and I think it should be surfaced at least mildly somewhere (not be entirely undocumented). Some options:
- In
fit.R
, change thex
parameter to something like
A matrix, sparse matrix, or data frame of predictors. Only some models have support for sparse matrix input.
See `parsnip::get_encoding()` for details.
- Add something to the Details of the models that have the support.
I think I like option 1 better.
Quick thought, only version 0.12.0 of ranger supports the x/y interface (the newest one), so you may want to add a version requirement to ranger in Suggests https://cran.r-project.org/web/packages/ranger/NEWS |
Co-authored-by: Davis Vaughan <davis@rstudio.com>
This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue. |
Related to discussion in tidymodels/tidymodels#42
Adds a new encoding fields called
allow_sparse_x
that is TRUE forglmnet
,xgboost
, andranger
models.ranger
now uses their x/y interface. Tests indicates that there is no difference.Some refactoring of
xgboost
code.Only changes the use of
fit_xy()
. For example:Created on 2020-09-29 by the reprex package (v0.3.0)