Prediction for new group levels (rstanarm vs. lme4) #268

fweber144 · 2022-01-23T11:14:29Z

For multilevel rstanarm reference models, there is an inconsistency between the reference model and the submodels with respect to the calculation of the linear predictor for new group levels in a K-fold CV: rstanarm reference models call rstanarm::posterior_linpred.stanreg() with no other arguments apart from object and newdata:

projpred/R/refmodel.R

Line 498 in d7c64e4

posterior_linpred(fit, newdata = newdata)

Thus, according to rstanarm's docs (see also the example below), this call marginalizes over the group-level effects (only for the new levels). In contrast, for the submodels, lme4::predict.merMod() is called with no other arguments apart from object, newdata, and allow.new.levels = TRUE:

projpred/R/divergence_minimizers.R

Line 427 in d7c64e4

return(predict(fit, newdata = newdata, allow.new.levels = TRUE))

Thus, according to lme4's docs (see also the example below), this call sets group-level effects for new levels to zero.

My question is now: Do you think this needs to be changed and if yes, how? The problem is that rstanarm's options for dealing with new levels are not quite the same as lme4's options. The only common possibility (as far as I understand these two packages) is to set group-level effects for all levels (i.e., existing and new ones) to zero. This, however, has important implications for the evaluation of the models' predictive performance: Adding a group-level effect to a submodel which previously only included population-level effects would then make no (or in case there is a dispersion parameter, only little) difference with respect to the predictive performance measures like ELPD, RMSE, etc. In other words, due to the existing levels, the value of the performance measures might look worse than predictive performance actually is. Apart from that, I'm not sure if, for families with a dispersion parameter, the original dispersion parameter estimates would then still be appropriate (similarly to this).

Needed for the following examples:

# Data --------------------------------------------------------------------

set.seed(457211)
dat <- matrix(rnorm(41 * 2, sd = 0.4),
              ncol = 2,
              dimnames = list(NULL, paste0("X", 1:2)))
dat <- data.frame(dat)
dat$group <- gl(n = 8, k = floor(nrow(dat) / 8), length = nrow(dat),
                labels = paste0("gr", seq_len(8)))
group_icpts_truth <- rnorm(nlevels(dat$group), sd = 6)
icpt <- -4.2
dat$y <- icpt + group_icpts_truth[dat$group]
dat$y <- rnorm(nrow(dat), mean = dat$y, sd = 4)

# New data ----------------------------------------------------------------

set.seed(457212)
dat_new <- matrix(rnorm(3 * 2, sd = 0.4),
                  ncol = 2,
                  dimnames = list(NULL, paste0("X", 1:2)))
dat_new <- data.frame(dat_new)
dat_new$group <- factor(paste0("gr", nlevels(dat$group) + 1L))

Example for how rstanarm deals with existing and new levels:

# Fit ---------------------------------------------------------------------

library(rstanarm)
options(mc.cores = parallel::detectCores(logical = FALSE))
lmm_fit <- stan_lmer(y ~ X1 + X2 + (1 | group),
                     data = dat,
                     seed = 734572,
                     refresh = 0)

# Existing data -----------------------------------------------------------

lmm_pl <- posterior_linpred(lmm_fit)

lmm_drws <- as.matrix(lmm_fit)
lmm_pl_man <- lmm_drws[, "(Intercept)"] +
  lmm_drws[, paste0("X", 1:2), drop = FALSE] %*%
  t(dat[, paste0("X", 1:2), drop = FALSE]) +
  lmm_drws[, paste0("b[(Intercept) group:", dat$group, "]"), drop = FALSE]
stopifnot(isTRUE(all.equal(
  unname(lmm_pl), unname(lmm_pl_man), tolerance = 1e-15
)))

# New data ----------------------------------------------------------------

lmm_pl_new <- posterior_linpred(lmm_fit, newdata = dat_new)

lmm_pl_new_man <- lmm_drws[, "(Intercept)"] +
  lmm_drws[, paste0("X", 1:2), drop = FALSE] %*%
  t(dat_new[, paste0("X", 1:2), drop = FALSE]) +
  as.vector(rstan::extract(lmm_fit$stanfit, permuted = FALSE)[
    , , "b[(Intercept) group:_NEW_group]"
  ])
stopifnot(isTRUE(all.equal(
  unname(lmm_pl_new), lmm_pl_new_man, tolerance = 1e-15
)))

The same for lme4:

# Fit ---------------------------------------------------------------------

library(lme4)
lmm_fit <- lmer(y ~ X1 + X2 + (1 | group),
                data = dat)

# Existing data -----------------------------------------------------------

lmm_pr <- predict(lmm_fit)

lmm_fixef <- fixef(lmm_fit)
lmm_ranef <- ranef(lmm_fit)$group
lmm_pr_man <- lmm_fixef["(Intercept)"] +
  lmm_ranef[dat$group, "(Intercept)"] +
  drop(as.matrix(dat[, paste0("X", 1:2), drop = FALSE]) %*%
         lmm_fixef[paste0("X", 1:2)])
stopifnot(isTRUE(all.equal(
  unname(lmm_pr), lmm_pr_man, tolerance = 1e-15
)))

# New data ----------------------------------------------------------------

lmm_pr_new <- predict(lmm_fit, newdata = dat_new, allow.new.levels = TRUE)

lmm_pr_new_man <- lmm_fixef["(Intercept)"] +
  drop(as.matrix(dat_new[, paste0("X", 1:2), drop = FALSE]) %*%
         lmm_fixef[paste0("X", 1:2)])
stopifnot(isTRUE(all.equal(
  unname(lmm_pr_new), lmm_pr_new_man, tolerance = 1e-15
)))

The text was updated successfully, but these errors were encountered:

fweber144 · 2022-01-23T11:26:48Z

For brms reference models, there is a related (though slightly different) issue: paul-buerkner/brms#1286.

fweber144 · 2022-03-03T15:09:11Z

After some internal correspondence, we came to the conclusion that this is indeed a bug and needs to be fixed (by making the lme4 prediction draw new group-level effects for new levels).

downstream code on-the-fly. The approach implemented now changes the default of `seed` (and `.seed`) arguments from `NULL` to `sample.int(.Machine$integer.max, 1)`. For top-level functions which are only called by the user and not within projpred (should only be `cv_varsel()`, `project()`, and `proj_predict()`), this is not strictly necessary, but done for consistency with lower-level functions which are also called within projpred. The main advantage of generating seeds on-the-fly is that more seeds may be easily added and used in downstream code (which will be important for fixing issue stan-dev#268), without always having to add a new argument to top-level functions. Apart from that, this approach makes it possible that users set a seed once at the beginning of their script and then use the default `seed` (and `.seed`) arguments -- they will then get reproducible results which would not have been the case for the former implementation. However, due to the resetting of `.Random.seed`, this approach does not avoid yet that the same seed is re-used multiple times (which is probably bad practice from a theoretical point of view).

fweber144 · 2022-03-11T17:49:54Z

Note that I was talking of lme4 above (so GLMMs), but GAMMs are probably concerned, too.

fweber144 mentioned this issue Jan 23, 2022

Prediction for new group levels in projpred K-fold CV paul-buerkner/brms#1286

Closed

fweber144 mentioned this issue Jan 27, 2022

Quick fix for issue #1286 (and other changes for K-fold CV) paul-buerkner/brms#1287

Merged

fweber144 mentioned this issue Mar 4, 2022

Seeds #286

Merged

fweber144 added a commit to fweber144/projpred that referenced this issue Mar 11, 2022

Fix issue stan-dev#268 for GLMMs.

0c5830a

fweber144 mentioned this issue Mar 11, 2022

Fix issue #268 #287

Merged

AlejandroCatalina pushed a commit that referenced this issue Mar 25, 2022

Fix issue #268 (#287)

17c05f7

fweber144 added the fixed in develop label Mar 31, 2022

fweber144 closed this as completed Apr 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prediction for new group levels (rstanarm vs. lme4) #268

Prediction for new group levels (rstanarm vs. lme4) #268

fweber144 commented Jan 23, 2022

fweber144 commented Jan 23, 2022

fweber144 commented Mar 3, 2022

fweber144 commented Mar 11, 2022

Prediction for new group levels (rstanarm vs. lme4) #268

Prediction for new group levels (rstanarm vs. lme4) #268

Comments

fweber144 commented Jan 23, 2022

fweber144 commented Jan 23, 2022

fweber144 commented Mar 3, 2022

fweber144 commented Mar 11, 2022