Changes to allow sparsity predictor representations #373

topepo · 2020-09-29T22:50:04Z

Related to discussion in tidymodels/tidymodels#42

Adds a new encoding fields called allow_sparse_x that is TRUE for glmnet, xgboost, and ranger models.
ranger now uses their x/y interface. Tests indicates that there is no difference.
Some refactoring of xgboost code.

Only changes the use of fit_xy(). For example:

library(tidymodels)
#> ── Attaching packages ─────────────────────────────────────────────────────────── tidymodels 0.1.1 ──
#> ✓ broom     0.7.0          ✓ recipes   0.1.13    
#> ✓ dials     0.0.9          ✓ rsample   0.0.8     
#> ✓ dplyr     1.0.2          ✓ tibble    3.0.3     
#> ✓ ggplot2   3.3.2          ✓ tidyr     1.1.2     
#> ✓ infer     0.5.2          ✓ tune      0.1.1.9000
#> ✓ modeldata 0.0.2          ✓ workflows 0.2.0     
#> ✓ parsnip   0.1.3.9000     ✓ yardstick 0.0.7     
#> ✓ purrr     0.3.4
#> ── Conflicts ────────────────────────────────────────────────────────────── tidymodels_conflicts() ──
#> x purrr::discard() masks scales::discard()
#> x dplyr::filter()  masks stats::filter()
#> x dplyr::lag()     masks stats::lag()
#> x recipes::step()  masks stats::step()
library(Matrix)
#> 
#> Attaching package: 'Matrix'
#> The following objects are masked from 'package:tidyr':
#> 
#>     expand, pack, unpack

xgb_spec <-
  boost_tree(trees = 10) %>% 
  set_engine("xgboost") %>% 
  set_mode("regression")

mtcar_x <- mtcars[, -1]
mtcar_mat <- as.matrix(mtcar_x)
mtcar_smat <- Matrix(mtcar_mat, sparse = TRUE)

set.seed(1)
from_df <- xgb_spec %>% fit_xy(mtcar_x, mtcars$mpg)
set.seed(1)
from_mat <- xgb_spec %>% fit_xy(mtcar_mat, mtcars$mpg)
set.seed(1)
from_sparse <- xgb_spec %>% fit_xy(mtcar_smat, mtcars$mpg)

all.equal(from_df$fit, from_mat$fit)
#> [1] TRUE
all.equal(from_df$fit, from_sparse$fit)
#> [1] TRUE

^{Created on 2020-09-29 by the reprex package (v0.3.0)}

juliasilge · 2020-10-02T14:29:44Z

Here is some benchmarking FWIW:

library(glmnet)
#> Loading required package: Matrix
#> Loaded glmnet 4.0-2
library(parsnip)

n <- 1e5 
y <- sample(0:1, n, replace = TRUE) 
x1 <- rnorm(n)
x2 <- sample(1:400, n, replace = TRUE)
X1 <- sparse.model.matrix(~ x1 + factor(x2))
X2 <- model.matrix(~ x1 + factor(x2))

lasso_spec <- linear_reg(mixture = 1) %>%
  set_mode("regression") %>%
  set_engine("glmnet")

bench::mark(
  check = FALSE, iterations = 10,
  lasso_spec %>% fit_xy(x = X2, y = y),
  lasso_spec %>% fit_xy(x = X1, y = y),
  glmnet(x = X2, y = y),
  glmnet(x = X1, y = y)
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 4 x 6
#>   expression                                min   median `itr/sec` mem_alloc
#>   <bch:expr>                           <bch:tm> <bch:tm>     <dbl> <bch:byt>
#> 1 lasso_spec %>% fit_xy(x = X2, y = y)    6.76s    6.81s     0.147   931.9MB
#> 2 lasso_spec %>% fit_xy(x = X1, y = y) 276.37ms  282.1ms     3.49     16.7MB
#> 3 glmnet(x = X2, y = y)                   6.57s    6.68s     0.149     623MB
#> 4 glmnet(x = X1, y = y)                164.27ms 166.48ms     5.75     16.4MB
#> # … with 1 more variable: `gc/sec` <dbl>

^{Created on 2020-10-02 by the reprex package (v0.3.0.9001)}

Looking so great! We can see the small parsnip overhead here, and the huge improvement to being able to use a sparse data structure.

juliasilge

Super happy about these changes. 😄

In the benchmarking I have been working on, this makes a big difference for glmnet especially in terms of fitting time.

We don't have this change/option documented anywhere in user-facing docs, and I think it should be surfaced at least mildly somewhere (not be entirely undocumented). Some options:

In fit.R, change the x parameter to something like

A matrix, sparse matrix, or data frame of predictors. Only some models have support for sparse matrix input. 
See `parsnip::get_encoding()` for details.

Add something to the Details of the models that have the support.

I think I like option 1 better.

DavisVaughan · 2020-10-05T15:31:59Z

Quick thought, only version 0.12.0 of ranger supports the x/y interface (the newest one), so you may want to add a version requirement to ranger in Suggests https://cran.r-project.org/web/packages/ranger/NEWS

NAMESPACE

NEWS.md

R/boost_tree.R

R/fit.R

Co-authored-by: Davis Vaughan <davis@rstudio.com>

… sparsity

github-actions · 2021-03-06T00:31:43Z

This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

topepo added 7 commits September 28, 2020 20:56

initial loosening of x input format

3255172

TODO placeholders for tidymodels/tidymodels#42

168fb93

test new matrix conversion function

9e30b6b

enable xgboost to use sparse X for tidymodels/tidymodels#42

c9d21e1

name changes for data conversion utilities

99950ec

changes for sparse matrices with ranger

0cb413c

added encoding field for sparse matrices for tidymodels/tidymodels#42

659b5ad

topepo requested a review from juliasilge September 30, 2020 13:04

updated news file

2f2737a

topepo requested a review from DavisVaughan October 1, 2020 17:38

juliasilge requested changes Oct 2, 2020

View reviewed changes

DavisVaughan approved these changes Oct 5, 2020

View reviewed changes

NAMESPACE Show resolved Hide resolved

NEWS.md Outdated Show resolved Hide resolved

R/boost_tree.R Show resolved Hide resolved

R/fit.R Outdated Show resolved Hide resolved

R/fit.R Outdated Show resolved Hide resolved

topepo and others added 6 commits October 5, 2020 15:54

ranger version req

7bc0bd3

Update R/fit.R

bb25ada

Co-authored-by: Davis Vaughan <davis@rstudio.com>

typo in interface results for #373

078d10c

Merge branch 'sparsity' of https://github.com/tidymodels/parsnip into…

4083321

… sparsity

more info in news file

516c6f8

doc change from Julia's suggestion

9ddf191

juliasilge approved these changes Oct 5, 2020

View reviewed changes

topepo merged commit 88e23c4 into master Oct 5, 2020

topepo deleted the sparsity branch October 5, 2020 23:31

github-actions bot locked and limited conversation to collaborators Mar 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Changes to allow sparsity predictor representations #373

Changes to allow sparsity predictor representations #373

Uh oh!

topepo commented Sep 29, 2020

Uh oh!

juliasilge commented Oct 2, 2020 •

edited

Loading

Uh oh!

juliasilge left a comment •

edited

Loading

Uh oh!

DavisVaughan commented Oct 5, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Mar 6, 2021

Uh oh!

Uh oh!

Changes to allow sparsity predictor representations #373

Changes to allow sparsity predictor representations #373

Uh oh!

Conversation

topepo commented Sep 29, 2020

Uh oh!

juliasilge commented Oct 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

juliasilge left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DavisVaughan commented Oct 5, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Mar 6, 2021

Uh oh!

Uh oh!

juliasilge commented Oct 2, 2020 •

edited

Loading

juliasilge left a comment •

edited

Loading