Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Effect encoding extras #136

Merged
merged 4 commits into from
Jul 9, 2023
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion index.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ These slides are designed to use with live teaching and are published for worksh

### Advanced tidymodels

- 01: [Feature engineering using recipes]()
- 01: [Feature engineering using recipes](slides/extras-effect-encodings.html)
- 02: [Tuning hyperparameters (grid search)](slides/advanced-02-tuning-hyperparameters.html)
- 03: [Grid search via racing](slides/advanced-03-racing.html)
- 04: [Iterative search](slides/advanced-04-iterative.html)
Expand Down
356 changes: 356 additions & 0 deletions slides/extras-effect-encodings.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,356 @@
---
title: "Extras - Effect Encodings"
subtitle: "Advanced tidymodels"
format:
revealjs:
slide-number: true
footer: <https://workshops.tidymodels.org>
include-before-body: header.html
include-after-body: footer-annotations.html
theme: [default, tidymodels.scss]
width: 1280
height: 720
knitr:
opts_chunk:
echo: true
collapse: true
comment: "#>"
fig.path: "figures/"
---

```{r}
#| label: setup
#| include: false
#| file: setup.R
```

```{r more-setup}
#| include: false
library(modeldatatoo)
library(probably)
library(countdown)
library(finetune)

cores <- parallelly::availableCores(logical = FALSE)
cl <- parallel::makePSOCKcluster(cores)
doParallel::registerDoParallel(cl)

options(width = 200)

ggplot2::theme_set(ggplot2::theme_bw())
```

## Previously - Setup

:::: {.columns}

::: {.column width="40%"}

```{r}
#| label: tune-startup
library(tidymodels)
library(modeldatatoo)
library(textrecipes)
library(bonsai)

# Max's usual settings:
tidymodels_prefer()
theme_set(theme_bw())
options(
pillar.advice = FALSE,
pillar.min_title_chars = Inf
)
```

:::

::: {.column width="60%"}

```{r}
#| label: data-import
set.seed(295)
hotel_rates <-
data_hotel_rates() %>%
sample_n(5000) %>%
arrange(arrival_date) %>%
select(-arrival_date_num, -arrival_date) %>%
mutate(
company = factor(as.character(company)),
country = factor(as.character(country)),
agent = factor(as.character(agent))
)
```


:::

::::


## Previously - Data Usage

```{r}
#| label: hotel-split
set.seed(4028)
hotel_split <-
initial_split(hotel_rates, strata = avg_price_per_room)

hotel_tr <- training(hotel_split)
hotel_te <- testing(hotel_split)

set.seed(472)
hotel_rs <- vfold_cv(hotel_tr, strata = avg_price_per_room)
```


## What do we do with the agent and company data?

There are `r length(unique(hotel_tr$agent))` unique agent values and `r length(unique(hotel_tr$company))` companies in our training set. How can we include this information in our model?

. . .

We could:

- make the full set of indicator variables 😳

- lump agents and companies that rarely occur into an "other" group

- use [feature hashing](https://www.tmwr.org/categorical.html#feature-hashing) to create a smaller set of indicator variables

- use effect encoding to replace the `agent` and `company` columns with the estimated effect of that predictor



```{r}
#| label: effects-calcs
#| include: false

agent_stats <-
hotel_tr %>%
group_by(agent) %>%
summarize(
ADR = mean(avg_price_per_room),
num_reservations = n(),
.groups = "drop"
) %>%
mutate(agent = reorder(agent, ADR))

library(embed)

estimates <-
recipe(avg_price_per_room ~ ., data = hotel_tr) %>%
step_lencode_mixed(agent, outcome = vars(avg_price_per_room), id = "encoding") %>% #<<
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors()) %>%
step_normalize(all_numeric_predictors()) %>%
prep() %>%
tidy(id = "encoding") %>%
select(agent = level, estimate = value)

before <- hotel_tr %>% select(avg_price_per_room, agent) %>% slice(1:7) %>% add_rowindex()
after <- left_join(before, estimates, by = "agent") %>%
select(avg_price_per_room, agent = estimate, .row)
```


## Per-player statistics

::: columns
::: {.column width="50%"}
```{r}
#| label: effects-freq
#| echo: false
#| out-width: '90%'
#| fig-width: 6
#| fig-height: 3
#| fig-align: 'center'
#| dev: 'svg'
#| dev-args: list(bg = "transparent")

agent_stats %>%
ggplot(aes(x = num_reservations)) +
geom_histogram(bins = 30, col = "blue", fill = "blue", alpha = 1/3) +
labs(x = "Number of reservations per player")
```
:::

::: {.column width="50%"}
```{r}
#| label: effects-adr
#| echo: false
#| out-width: '90%'
#| fig-width: 6
#| fig-height: 3
#| fig-align: 'center'
#| dev: 'svg'
#| dev-args: list(bg = "transparent")

agent_stats %>%
ggplot(aes(x = ADR)) +
geom_histogram(bins = 30, col = "red", fill = "red", alpha = 1/3) +
labs(x = "Average ADR per agent")
```
:::
:::


## What is an effect encoding?

We replace the qualitative’s predictor data with their _effect on the outcome_.

::: columns
::: {.column width="50%"}
Data before:

```{r}
#| label: before
before
```

:::

::: {.column width="50%"}

Data after:

```{r}
#| label: after
after
```

:::
:::

The `agent` column is replaced with an estimate of the ADR.


## Per-agent statistics again {.annotation}

::: columns
::: {.column width="50%"}
```{r}
#| label: effects-again
#| echo: false
#| out-width: '90%'
#| fig-width: 6
#| fig-height: 3
#| fig-align: 'center'
#| dev: 'svg'
#| dev-args: list(bg = "transparent")

agent_stats %>%
ggplot(aes(x = num_reservations)) +
geom_histogram(bins = 30, col = "blue", fill = "blue", alpha = 1/3) +
labs(x = "Number of reservations per agent")
agent_stats %>%
ggplot(aes(x = ADR)) +
geom_histogram(bins = 30, col = "red", fill = "red", alpha = 1/3) +
labs(x = "Average ADR per agent")
```
:::

::: {.column width="50%"}

- Good statistical methods for estimating these rates use *partial pooling*.


- Pooling borrows strength across agents and shrinks extreme values towards the mean for agents with very few transations


- The embed package has recipe steps for effect encodings.

:::
:::


:::notes
Partial pooling gives better estimates for agents with fewer reservations by shrinking the estimate to the overall ADR mean


:::

## Partial pooling

```{r}
#| label: effect-compare
#| echo: false
#| fig-width: 6
#| fig-height: 6
#| fig-align: 'center'
#| dev: 'svglite'
#| dev-args: list(bg = "transparent")

inner_join(agent_stats, estimates, by = "agent") %>%
ggplot(aes(x = ADR, y = estimate)) +
geom_abline(col = "green", lty = 2) +
geom_point(aes(size = num_reservations), alpha = 1/3) +
coord_obs_pred() +
scale_size(range = c(1/3, 5)) +
labs(x = "ADR Sample Mean", y = "Estimated via Effects Encoding")
```

## Agent effects `r hexes("recipes","embed")` {.annotation}

```{r}
#| label: lencode-rec
#| code-line-numbers: "1,6|"
library(embed)

hotel_effect_rec <-
recipe(avg_price_per_room ~ ., data = hotel_tr) %>%
step_YeoJohnson(lead_time) %>%
step_lencode_mixed(agent, company, outcome = vars(avg_price_per_room)) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors())
```

. . .

It is very important to appropriately validate the effect encoding step to make sure that we are not overfitting.

## Recipes are estimated `r hexes("recipes")`

Preprocessing steps in a recipe use the *training set* to compute quantities.

. . .

What kind of quantities are computed for preprocessing?

- Levels of a factor
- Whether a column has zero variance
- Normalization
- Feature extraction
- Effect encodings

. . .

When a recipe is part of a workflow, this estimation occurs when `fit()` is called.
topepo marked this conversation as resolved.
Show resolved Hide resolved

## Effect encoding results `r hexes("recipes","embed", "workflows", "tune")`

```{r}
#| label: resample-encoding
#| code-line-numbers: "4|"
hotel_effect_wflow <-
workflow() %>%
add_model(linear_reg()) %>%
update_recipe(hotel_effect_rec)

reg_metrics <- metric_set(mae, rsq)

hotel_effect_res <-
hotel_effect_wflow %>%
fit_resamples(hotel_rs, metrics = reg_metrics)

collect_metrics(hotel_effect_res)
```

Slightly worse but it can handle new agents (if they occur).

```{r}
#| label: teardown
#| include: false

parallel::stopCluster(cl)

```