-
Notifications
You must be signed in to change notification settings - Fork 90
/
Copy pathparsnip.Rmd
157 lines (118 loc) · 6.43 KB
/
parsnip.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
---
title: "Introduction to parsnip"
output: rmarkdown::html_vignette
description: |
The goal of parsnip is to provide a tidy, unified interface to models to avoid
getting bogged down in the syntactical minutiae of the underlying software.
vignette: >
%\VignetteEngine{knitr::rmarkdown}
%\VignetteIndexEntry{Introduction to parsnip}
%\VignetteEncoding{UTF-8}
---
```{r ex_setup, include=FALSE}
knitr::opts_chunk$set(
message = FALSE,
digits = 3,
collapse = TRUE,
comment = "#>"
)
options(digits = 3)
library(parsnip)
set.seed(368783)
```
This package provides functions and methods to create and manipulate functions commonly used during modeling (e.g. fitting the model, making predictions, etc). It allows the user to manipulate how the same type of model can be created from different sources.
### Motivation
Modeling functions across different R packages can have very different interfaces. If you would like to try different approaches, there is a lot of syntactical minutiae to remember. The problem worsens when you move in-between platforms (e.g. doing a logistic regression in R's `glm` versus Spark's implementation).
parsnip tries to solve this by providing similar interfaces to models. For example, if you are fitting a random forest model and would like to adjust the number of trees in the forest there are different argument names to remember:
* `randomForest::randomForest` uses `ntree`,
* `ranger::ranger` uses `num.trees`,
* Spark's `sparklyr::ml_random_forest` uses `num_trees`.
Rather than remembering these values, a common interface to these models can be used with
```{r rf-ex}
library(parsnip)
rf_mod <- rand_forest(trees = 2000)
```
The package makes the translation between `trees` and the real names in each of the implementations.
Some terminology:
* The **model type** differentiates models. Example types are: random forests, logistic regression, linear support vector machines, etc.
* The **mode** of the model denotes how it will be used. Two common modes are _classification_ and _regression_. Others would include "censored regression" and "risk regression" (parametric and Cox PH models for censored data, respectively), as well as unsupervised models (e.g. "clustering").
* The **computational engine** indicates how the actual model might be fit. These are often R packages (such as `randomForest` or `ranger`) but might also be methods outside of R (e.g. Stan, Spark, and others).
The parsnip package, similar to ggplot2, dplyr and recipes, separates the specification of what you want to do from the actual doing. This allows us to create broader functionality for modeling.
### Placeholders for Parameters
There are times where you would like to change a parameter from its default but you are not sure what the final value will be. This is the basis for _model tuning_ where we use the [tune](https://tune.tidymodels.org/) package. Since the model is not executing when created, these types of parameters can be changed using the `tune()` function. This provides a simple placeholder for the value.
```{r rf-tune}
tune_mtry <- rand_forest(trees = 2000, mtry = tune())
tune_mtry
```
This will come in handy later when we fit the model over different values of `mtry`.
### Specifying Arguments
Commonly used arguments to the modeling functions have their parameters exposed in the function. For example, `rand_forest` has arguments for:
* `mtry`: The number of predictors that will be randomly sampled at each split when creating the tree models.
* `trees`: The number of trees contained in the ensemble.
* `min_n`: The minimum number of data points in a node that are required for the node to be split further.
The arguments to the default function are:
```{r rf-def}
args(rand_forest)
```
However, there might be other arguments that you would like to change or allow to vary. These are accessible using `set_engine`. For example, `ranger()` from the ranger package has an option to set the internal random number seed. To set this to a specific value:
```{r rf-seed}
rf_with_seed <-
rand_forest(trees = 2000, mtry = tune(), mode = "regression") %>%
set_engine("ranger", seed = 63233)
rf_with_seed
```
### Process
To fit the model, you must:
* have a defined model, including the _mode_,
* have no `tune()` parameters, and
* specify a computational engine.
For example, `rf_with_seed` above is not ready for fitting due the `tune()` parameter. We can set that parameter's value and then create the model fit:
```{r, eval = FALSE}
rf_with_seed %>%
set_args(mtry = 4) %>%
set_engine("ranger") %>%
fit(mpg ~ ., data = mtcars)
```
```
#> parsnip model object
#>
#> Ranger result
#>
#> Call:
#> ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~4, x), num.trees = ~2000, num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1))
#>
#> Type: Regression
#> Number of trees: 2000
#> Sample size: 32
#> Number of independent variables: 10
#> Mtry: 4
#> Target node size: 5
#> Variable importance mode: none
#> Splitrule: variance
#> OOB prediction error (MSE): 5.57
#> R squared (OOB): 0.847
```
Or, using the `randomForest` package:
```{r, eval = FALSE}
set.seed(56982)
rf_with_seed %>%
set_args(mtry = 4) %>%
set_engine("randomForest") %>%
fit(mpg ~ ., data = mtcars)
```
```
#> parsnip model object
#>
#>
#> Call:
#> randomForest(x = maybe_data_frame(x), y = y, ntree = ~2000, mtry = min_cols(~4, x))
#> Type of random forest: regression
#> Number of trees: 2000
#> No. of variables tried at each split: 4
#>
#> Mean of squared residuals: 5.52
#> % Var explained: 84.3
```
Note that the call objects show `num.trees = ~2000`. The tilde is the consequence of `parsnip` using [quosures](https://adv-r.hadley.nz/evaluation.html#quosures) to process the model specification's arguments.
Normally, when a function is executed, the function's arguments are immediately evaluated. In the case of parsnip, the model specification's arguments are _not_; the [expression is captured](https://www.tidyverse.org/blog/2019/04/parsnip-internals/) along with the environment where it should be evaluated. That is what a quosure does.
parsnip uses these expressions to make a model fit call that is evaluated. The tilde in the call above reflects that the argument was captured using a quosure.