Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prediction for new group levels (rstanarm vs. lme4) #268

Closed
fweber144 opened this issue Jan 23, 2022 · 3 comments
Closed

Prediction for new group levels (rstanarm vs. lme4) #268

fweber144 opened this issue Jan 23, 2022 · 3 comments

Comments

@fweber144
Copy link
Collaborator

For multilevel rstanarm reference models, there is an inconsistency between the reference model and the submodels with respect to the calculation of the linear predictor for new group levels in a K-fold CV: rstanarm reference models call rstanarm::posterior_linpred.stanreg() with no other arguments apart from object and newdata:

posterior_linpred(fit, newdata = newdata)
Thus, according to rstanarm's docs (see also the example below), this call marginalizes over the group-level effects (only for the new levels). In contrast, for the submodels, lme4::predict.merMod() is called with no other arguments apart from object, newdata, and allow.new.levels = TRUE:
return(predict(fit, newdata = newdata, allow.new.levels = TRUE))
Thus, according to lme4's docs (see also the example below), this call sets group-level effects for new levels to zero.

My question is now: Do you think this needs to be changed and if yes, how? The problem is that rstanarm's options for dealing with new levels are not quite the same as lme4's options. The only common possibility (as far as I understand these two packages) is to set group-level effects for all levels (i.e., existing and new ones) to zero. This, however, has important implications for the evaluation of the models' predictive performance: Adding a group-level effect to a submodel which previously only included population-level effects would then make no (or in case there is a dispersion parameter, only little) difference with respect to the predictive performance measures like ELPD, RMSE, etc. In other words, due to the existing levels, the value of the performance measures might look worse than predictive performance actually is. Apart from that, I'm not sure if, for families with a dispersion parameter, the original dispersion parameter estimates would then still be appropriate (similarly to this).

Needed for the following examples:

# Data --------------------------------------------------------------------

set.seed(457211)
dat <- matrix(rnorm(41 * 2, sd = 0.4),
              ncol = 2,
              dimnames = list(NULL, paste0("X", 1:2)))
dat <- data.frame(dat)
dat$group <- gl(n = 8, k = floor(nrow(dat) / 8), length = nrow(dat),
                labels = paste0("gr", seq_len(8)))
group_icpts_truth <- rnorm(nlevels(dat$group), sd = 6)
icpt <- -4.2
dat$y <- icpt + group_icpts_truth[dat$group]
dat$y <- rnorm(nrow(dat), mean = dat$y, sd = 4)

# New data ----------------------------------------------------------------

set.seed(457212)
dat_new <- matrix(rnorm(3 * 2, sd = 0.4),
                  ncol = 2,
                  dimnames = list(NULL, paste0("X", 1:2)))
dat_new <- data.frame(dat_new)
dat_new$group <- factor(paste0("gr", nlevels(dat$group) + 1L))

Example for how rstanarm deals with existing and new levels:

# Fit ---------------------------------------------------------------------

library(rstanarm)
options(mc.cores = parallel::detectCores(logical = FALSE))
lmm_fit <- stan_lmer(y ~ X1 + X2 + (1 | group),
                     data = dat,
                     seed = 734572,
                     refresh = 0)

# Existing data -----------------------------------------------------------

lmm_pl <- posterior_linpred(lmm_fit)

lmm_drws <- as.matrix(lmm_fit)
lmm_pl_man <- lmm_drws[, "(Intercept)"] +
  lmm_drws[, paste0("X", 1:2), drop = FALSE] %*%
  t(dat[, paste0("X", 1:2), drop = FALSE]) +
  lmm_drws[, paste0("b[(Intercept) group:", dat$group, "]"), drop = FALSE]
stopifnot(isTRUE(all.equal(
  unname(lmm_pl), unname(lmm_pl_man), tolerance = 1e-15
)))

# New data ----------------------------------------------------------------

lmm_pl_new <- posterior_linpred(lmm_fit, newdata = dat_new)

lmm_pl_new_man <- lmm_drws[, "(Intercept)"] +
  lmm_drws[, paste0("X", 1:2), drop = FALSE] %*%
  t(dat_new[, paste0("X", 1:2), drop = FALSE]) +
  as.vector(rstan::extract(lmm_fit$stanfit, permuted = FALSE)[
    , , "b[(Intercept) group:_NEW_group]"
  ])
stopifnot(isTRUE(all.equal(
  unname(lmm_pl_new), lmm_pl_new_man, tolerance = 1e-15
)))

The same for lme4:

# Fit ---------------------------------------------------------------------

library(lme4)
lmm_fit <- lmer(y ~ X1 + X2 + (1 | group),
                data = dat)

# Existing data -----------------------------------------------------------

lmm_pr <- predict(lmm_fit)

lmm_fixef <- fixef(lmm_fit)
lmm_ranef <- ranef(lmm_fit)$group
lmm_pr_man <- lmm_fixef["(Intercept)"] +
  lmm_ranef[dat$group, "(Intercept)"] +
  drop(as.matrix(dat[, paste0("X", 1:2), drop = FALSE]) %*%
         lmm_fixef[paste0("X", 1:2)])
stopifnot(isTRUE(all.equal(
  unname(lmm_pr), lmm_pr_man, tolerance = 1e-15
)))

# New data ----------------------------------------------------------------

lmm_pr_new <- predict(lmm_fit, newdata = dat_new, allow.new.levels = TRUE)

lmm_pr_new_man <- lmm_fixef["(Intercept)"] +
  drop(as.matrix(dat_new[, paste0("X", 1:2), drop = FALSE]) %*%
         lmm_fixef[paste0("X", 1:2)])
stopifnot(isTRUE(all.equal(
  unname(lmm_pr_new), lmm_pr_new_man, tolerance = 1e-15
)))
@fweber144
Copy link
Collaborator Author

For brms reference models, there is a related (though slightly different) issue: paul-buerkner/brms#1286.

@fweber144
Copy link
Collaborator Author

After some internal correspondence, we came to the conclusion that this is indeed a bug and needs to be fixed (by making the lme4 prediction draw new group-level effects for new levels).

fweber144 added a commit to fweber144/projpred that referenced this issue Mar 4, 2022
downstream code on-the-fly.

The approach implemented now changes the default of `seed` (and `.seed`) arguments from `NULL` to `sample.int(.Machine$integer.max, 1)`. For top-level functions which are only called by the user and not within projpred (should only be `cv_varsel()`, `project()`, and `proj_predict()`), this is not strictly necessary, but done for consistency with lower-level functions which are also called within projpred.

The main advantage of generating seeds on-the-fly is that more seeds may be easily added and used in downstream code (which will be important for fixing issue stan-dev#268), without always having to add a new argument to top-level functions. Apart from that, this approach makes it possible that users set a seed once at the beginning of their script and then use the default `seed` (and `.seed`) arguments -- they will then get reproducible results which would not have been the case for the former implementation.

However, due to the resetting of `.Random.seed`, this approach does not avoid yet that the same seed is re-used multiple times (which is probably bad practice from a theoretical point of view).
@fweber144 fweber144 mentioned this issue Mar 4, 2022
fweber144 added a commit to fweber144/projpred that referenced this issue Mar 11, 2022
@fweber144
Copy link
Collaborator Author

Note that I was talking of lme4 above (so GLMMs), but GAMMs are probably concerned, too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant