-
Notifications
You must be signed in to change notification settings - Fork 19
/
indicators.Rmd
65 lines (45 loc) · 2.45 KB
/
indicators.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
# Indicator Variable Details
```{r echo=FALSE}
options(cli.width = 70, width = 70, cli.unicode = FALSE)
# Load them early on so package conflict messages don't show up
suppressPackageStartupMessages({
library(parsnip)
library(recipes)
library(workflows)
library(modeldata)
})
```
Some modeling functions in R create indicator/dummy variables from categorical data when you use a model formula, and some do not. When you specify and fit a model with a `workflow()`, parsnip and workflows match and reproduce the underlying behavior of the user-specified model's computational engine.
## Formula Preprocessor
In the [modeldata::Sacramento] data set of real estate prices, the `type` variable has three levels: `"Residential"`, `"Condo"`, and `"Multi-Family"`. This base `workflow()` contains a formula added via [add_formula()] to predict property price from property type, square footage, number of beds, and number of baths:
```{r}
set.seed(123)
library(parsnip)
library(recipes)
library(workflows)
library(modeldata)
data("Sacramento")
base_wf <- workflow() %>%
add_formula(price ~ type + sqft + beds + baths)
```
This first model does create dummy/indicator variables:
```{r}
lm_spec <- linear_reg() %>%
set_engine("lm")
base_wf %>%
add_model(lm_spec) %>%
fit(Sacramento)
```
There are **five** independent variables in the fitted model for this OLS linear regression. With this model type and engine, the factor predictor `type` of the real estate properties was converted to two binary predictors, `typeMulti_Family` and `typeResidential`. (The third type, for condos, does not need its own column because it is the baseline level).
This second model does not create dummy/indicator variables:
```{r}
rf_spec <- rand_forest() %>%
set_mode("regression") %>%
set_engine("ranger")
base_wf %>%
add_model(rf_spec) %>%
fit(Sacramento)
```
Note that there are **four** independent variables in the fitted model for this ranger random forest. With this model type and engine, indicator variables were not created for the `type` of real estate property being sold. Tree-based models such as random forest models can handle factor predictors directly, and don't need any conversion to numeric binary variables.
## Recipe Preprocessor
When you specify a model with a `workflow()` and a recipe preprocessor via [add_recipe()], the _recipe_ controls whether dummy variables are created or not; the recipe overrides any underlying behavior from the model's computational engine.