Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes to allow sparsity predictor representations #373

Merged
merged 14 commits into from
Oct 5, 2020
Merged

Conversation

topepo
Copy link
Member

@topepo topepo commented Sep 29, 2020

Related to discussion in tidymodels/tidymodels#42

  • Adds a new encoding fields called allow_sparse_x that is TRUE for glmnet, xgboost, and ranger models.

  • ranger now uses their x/y interface. Tests indicates that there is no difference.

  • Some refactoring of xgboost code.

Only changes the use of fit_xy(). For example:

library(tidymodels)
#> ── Attaching packages ─────────────────────────────────────────────────────────── tidymodels 0.1.1 ──
#> ✓ broom     0.7.0          ✓ recipes   0.1.13    
#> ✓ dials     0.0.9          ✓ rsample   0.0.8     
#> ✓ dplyr     1.0.2          ✓ tibble    3.0.3     
#> ✓ ggplot2   3.3.2          ✓ tidyr     1.1.2     
#> ✓ infer     0.5.2          ✓ tune      0.1.1.9000
#> ✓ modeldata 0.0.2          ✓ workflows 0.2.0     
#> ✓ parsnip   0.1.3.9000     ✓ yardstick 0.0.7     
#> ✓ purrr     0.3.4
#> ── Conflicts ────────────────────────────────────────────────────────────── tidymodels_conflicts() ──
#> x purrr::discard() masks scales::discard()
#> x dplyr::filter()  masks stats::filter()
#> x dplyr::lag()     masks stats::lag()
#> x recipes::step()  masks stats::step()
library(Matrix)
#> 
#> Attaching package: 'Matrix'
#> The following objects are masked from 'package:tidyr':
#> 
#>     expand, pack, unpack

xgb_spec <-
  boost_tree(trees = 10) %>% 
  set_engine("xgboost") %>% 
  set_mode("regression")

mtcar_x <- mtcars[, -1]
mtcar_mat <- as.matrix(mtcar_x)
mtcar_smat <- Matrix(mtcar_mat, sparse = TRUE)

set.seed(1)
from_df <- xgb_spec %>% fit_xy(mtcar_x, mtcars$mpg)
set.seed(1)
from_mat <- xgb_spec %>% fit_xy(mtcar_mat, mtcars$mpg)
set.seed(1)
from_sparse <- xgb_spec %>% fit_xy(mtcar_smat, mtcars$mpg)

all.equal(from_df$fit, from_mat$fit)
#> [1] TRUE
all.equal(from_df$fit, from_sparse$fit)
#> [1] TRUE

Created on 2020-09-29 by the reprex package (v0.3.0)

@juliasilge
Copy link
Member

juliasilge commented Oct 2, 2020

Here is some benchmarking FWIW:

library(glmnet)
#> Loading required package: Matrix
#> Loaded glmnet 4.0-2
library(parsnip)

n <- 1e5 
y <- sample(0:1, n, replace = TRUE) 
x1 <- rnorm(n)
x2 <- sample(1:400, n, replace = TRUE)
X1 <- sparse.model.matrix(~ x1 + factor(x2))
X2 <- model.matrix(~ x1 + factor(x2))

lasso_spec <- linear_reg(mixture = 1) %>%
  set_mode("regression") %>%
  set_engine("glmnet")

bench::mark(
  check = FALSE, iterations = 10,
  lasso_spec %>% fit_xy(x = X2, y = y),
  lasso_spec %>% fit_xy(x = X1, y = y),
  glmnet(x = X2, y = y),
  glmnet(x = X1, y = y)
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 4 x 6
#>   expression                                min   median `itr/sec` mem_alloc
#>   <bch:expr>                           <bch:tm> <bch:tm>     <dbl> <bch:byt>
#> 1 lasso_spec %>% fit_xy(x = X2, y = y)    6.76s    6.81s     0.147   931.9MB
#> 2 lasso_spec %>% fit_xy(x = X1, y = y) 276.37ms  282.1ms     3.49     16.7MB
#> 3 glmnet(x = X2, y = y)                   6.57s    6.68s     0.149     623MB
#> 4 glmnet(x = X1, y = y)                164.27ms 166.48ms     5.75     16.4MB
#> # … with 1 more variable: `gc/sec` <dbl>

Created on 2020-10-02 by the reprex package (v0.3.0.9001)

Looking so great! We can see the small parsnip overhead here, and the huge improvement to being able to use a sparse data structure.

Copy link
Member

@juliasilge juliasilge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super happy about these changes. 😄

In the benchmarking I have been working on, this makes a big difference for glmnet especially in terms of fitting time.

We don't have this change/option documented anywhere in user-facing docs, and I think it should be surfaced at least mildly somewhere (not be entirely undocumented). Some options:

  • In fit.R, change the x parameter to something like
A matrix, sparse matrix, or data frame of predictors. Only some models have support for sparse matrix input. 
See `parsnip::get_encoding()` for details.
  • Add something to the Details of the models that have the support.

I think I like option 1 better.

@DavisVaughan
Copy link
Member

Quick thought, only version 0.12.0 of ranger supports the x/y interface (the newest one), so you may want to add a version requirement to ranger in Suggests https://cran.r-project.org/web/packages/ranger/NEWS

NAMESPACE Show resolved Hide resolved
NEWS.md Outdated Show resolved Hide resolved
R/boost_tree.R Show resolved Hide resolved
R/fit.R Outdated Show resolved Hide resolved
R/fit.R Outdated Show resolved Hide resolved
@topepo topepo merged commit 88e23c4 into master Oct 5, 2020
@topepo topepo deleted the sparsity branch October 5, 2020 23:31
@github-actions
Copy link

github-actions bot commented Mar 6, 2021

This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Mar 6, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants