take look at hockey explainer results #78

topepo · 2022-07-22T17:16:06Z

topepo · 2022-07-22T19:56:26Z

tl; dr

The slide is correct as-is but the code needed a small change but it doesn't matter.

details:

The data have:

levels(nhl_train$on_goal)
#> [1] "yes" "no"

# This is greater than 1 / 2 (so a positive log-odds)
mean(nhl_train$on_goal == "yes")
#> [1] 0.5515917

What is parsnip doing?

The intercept in our model is negative:

final_glm_spline_wflow %>%
  tidy() %>%
  filter(grepl("Intercept", term))
#> # A tibble: 1 × 5
#>   term        estimate std.error statistic  p.value
#>   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
#> 1 (Intercept)   -0.241    0.0381     -6.32 2.57e-10

glm() via parsnip is modeling the probability of being not on-goal (as expected).

A positive slope for defenseman indicate that being a defenseman:

increases the prob of not on goal, or
defenseman are less likely to have shots on goal

final_glm_spline_wflow %>%
  tidy() %>%
  filter(grepl("position", term))
#> # A tibble: 4 × 5
#>   term                estimate std.error statistic  p.value
#>   <chr>                  <dbl>     <dbl>     <dbl>    <dbl>
#> 1 position_defenseman  0.129      0.0333    3.88   0.000103
#> 2 position_goalie     -0.104      2.06     -0.0506 0.960   
#> 3 position_left_wing   0.00438    0.0266    0.164  0.870   
#> 4 position_right_wing  0.0287     0.0273    1.05   0.294

Just to be sure, what do the raw data say?

nhl_train %>%
  mutate(binned_x = ntile(coord_x, 15)) %>%
  group_by(binned_x, position) %>%
  summarize(
    on_goal_rate = mean(on_goal == "yes"),
    mean_x = mean(coord_x),
    .groups = "drop"
  ) %>%
  filter(position != "goalie") %>%
  ggplot(aes(mean_x, on_goal_rate, col = position)) +
  geom_line() +
  geom_point() +
  lims(y = 0:1)

I believe that the y-axis format of

"Predicted probability of not being on goal"

is correct

However ... what is dalex doing?

The slide has:

  library(DALEXtra)

  glm_explainer <- explain_tidymodels(
    final_glm_spline_wflow,
    data = dplyr::select(nhl_train, -on_goal),
    # DALEX required an integer for factors:
    y = as.integer(nhl_train$on_goal),
    verbose = FALSE
  )

  set.seed(123)
  pdp_coord_x <- model_profile(
    glm_explainer,
    variables = "coord_x",
    N = 500,
    groups = "position"
  )

Let's reformat the data to run glm() manually at first:

wflow_mold <- extract_mold(final_glm_spline_wflow)

train_data <-
  bind_cols(wflow_mold$predictors, wflow_mold$outcome) %>%
  mutate(
    # This re-encodes 1 = yes, 2 = no
    on_goal = as.integer(on_goal)
  )

If you were to run:

int_glm <- glm(on_goal ~ ., data = train_data, family = binomial)
#> Error in eval(family$initialize): y values must be 0 <= y <= 1

So, for this type of model explainer, DALEX never has to fit the model so it never calls glm() or fits the model. The as.integer() is wrong but it doesn't produce and error since it is never needed.

I changed the slide to be

  library(DALEXtra)

  glm_explainer <- explain_tidymodels(
    final_glm_spline_wflow,
    data = dplyr::select(nhl_train, -on_goal),
    # DALEX required an integer for factors:
    y = as.integer(nhl_train$on_goal) - 1,
    verbose = FALSE
  )

This is more appropriate in case it ever does need to run glm() for some other explainer.

topepo self-assigned this Jul 22, 2022

topepo closed this as completed Jul 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

take look at hockey explainer results #78

take look at hockey explainer results #78

topepo commented Jul 22, 2022 •

edited

Loading

topepo commented Jul 22, 2022

take look at hockey explainer results #78

take look at hockey explainer results #78

Comments

topepo commented Jul 22, 2022 • edited Loading

topepo commented Jul 22, 2022

tl; dr

details:

What is parsnip doing?

However ... what is dalex doing?

topepo commented Jul 22, 2022 •

edited

Loading