Fix `subfit` issues (mainly due to the lack of `xlev`) #403

fweber144 · 2023-04-05T10:19:26Z

This resolves several bugs related to submodel fits of class subfit, see the message of commits fcf1408 and d08920a. The main problem was that in predict.subfit(), argument xlev of model.matrix() was not used. Unfortunately, deriving these xlevels at fitting time is not really straightforward, which is the reason why this PR adds a bunch of code for deriving these at the beginning of fit_glm_ridge_callback() and search_L1().

All of the issues resolved by commit fcf1408 can be reproduced using the following code:

options(mc.cores = parallel::detectCores(logical = FALSE))
N <- 41L
K <- 5L
K_fac <- 4L
set.seed(457324)
dat <- data.frame(
  y = rnorm(N),
  xcat = gl(n = K, k = floor(N / K), length = N,
            labels = paste0("gr", seq_len(K))),
  xfac = sample(gl(n = K_fac, k = floor(N / K_fac), length = N,
                   labels = paste0("fgr", seq_len(K_fac)))),
  xlog = sample(rep_len(c(TRUE, FALSE), length.out = N))
)
levels(dat$xfac) <- c(levels(dat$xfac),
                      paste0("fgr", (K_fac + 1L):(K_fac + 2L)))
dat$xcat <- as.character(dat$xcat)

library(rstanarm)
rfit <- stan_glm(y ~ xcat + xfac + xlog,
                 data = dat,
                 seed = 1140350788,
                 chains = 1, iter = 500,
                 refresh = 0)

library(projpred)

# debug(fit_glm_ridge_callback)
prj <- project(rfit, solution_terms = c("xcat", "xfac", "xlog"),
               nclusters = 1)

dat_new <- data.frame(
  y = c(-1.8, 0.1),
  xcat = rep(paste0("gr", 2), 2),
  xfac = factor(rep(paste0("fgr", 2), 2),
                levels = c(rev(levels(dat$xfac)), paste0("fgr", K_fac + 3L))),
  xlog = rep(TRUE, 2)
)
# debug(predict.subfit)
prjlp <- proj_linpred(prj, newdata = dat_new)

# debug(search_L1)
cvvs <- cv_varsel(rfit,
                  nclusters = 1,
                  nclusters_pred = 1,
                  seed = 46782345)
cvvs <- cv_varsel(rfit,
                  nclusters = 1,
                  nclusters_pred = 1,
                  refit_prj = FALSE,
                  seed = 46782345)

(All of the calls following a commented debug() call failed before this PR and succeed afterwards.)

For the issue resolved by commit d08920a, see the corresponding commit message for reproducibility instructions.

… model matrix, not just the number of its columns.

level of a `factor`-coerced `character` variable. Also fix a bug causing a re-ordering of the levels of a `factor` in `newdata` (compared to that `factor` in the original dataset) not to get recognized by `predict.subfit()`, which in turn causes `x %*% rbind(alpha, beta)` to be incorrect in such cases. Also fix an error thrown by `fit_glm_ridge_callback()` in case of unused levels of a `factor`, namely: ``` warning: solve(): system is singular; attempting approx solution Error: solve(): solution not found ```

…ctor".

…glm_ridge_callback()`).

is not necessary (as `collapse_contrasts_solution_path()` would also work correctly without this change), but done for consistency.

…earch_L1()`. This is not necessary (as `search_L1()` would also work correctly without this change, thanks to its `match()` call and the following `indices <- indices[!is.na(indices)]` line), but done for consistency.

(which is the default only for `datafit`s, not for the reference model objects of class `refmodel` that are usually employed in practice) to lead to incorrect predictions from the L1-searched submodels (which are L1-penalized GLMs) if the solution path had a main effect ranked after an interaction term. This bug existed at least from version 2.0.2 on (perhaps even in earlier versions). The mentioned submodel predictions did not only affect the performance evaluation, but also the projected dispersion parameter and the returned Kullback-Leibler divergence (and the corresponding cross-entropy). This bug came up in the unit tests (with `run_more = TRUE`), file `test_datafit.R`, when creating `cvvss_datafit`, element `args_cvvs_i <- args_cvvs_datafit$rstanarm.glm.gauss.spclformul.with_wobs.with_offs.trad.default_meth.kfold.default_search_trms`.

…of the model matrix (for safety reasons).

with `refit_prj = FALSE` to lead to incorrect predictions from the L1-searched submodels if the solution path had a main effect ranked after an interaction term.

fweber144 added 14 commits April 4, 2023 16:13

Make predict.subfit() safer by checking the column names of the new…

6412d30

… model matrix, not just the number of its columns.

Ensure that subfits have matching names for element beta and x.

0686c1d

predict.subfit(): Avoid an unnecessary definition of x.

8608d89

Avoid model.frame.default()'s warning "variable '<...>' is not a fa…

f21e90b

…ctor".

Set row names for beta in search_L1() (for consistency with `fit_…

bf798b8

…glm_ridge_callback()`).

Tests: Enhance readability of an else if condition.

5a9e5ee

Adapt the tests.

e608d4c

Drop unused levels also in collapse_contrasts_solution_path(). This

b8b6bec

is not necessary (as `collapse_contrasts_solution_path()` would also work correctly without this change), but done for consistency.

predict.subfit(): Use identical() when checking the column names …

2b63cda

…of the model matrix (for safety reasons).

Add a NEWS.md entry for the first three subfit fixes.

636945d

Add a NEWS.md entry for the bug that caused an L1 search combined

505026d

with `refit_prj = FALSE` to lead to incorrect predictions from the L1-searched submodels if the solution path had a main effect ranked after an interaction term.

fweber144 merged commit a6ee4f9 into stan-dev:master Apr 5, 2023

fweber144 deleted the subfit_xlevels_etc branch April 5, 2023 10:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `subfit` issues (mainly due to the lack of `xlev`) #403

Fix `subfit` issues (mainly due to the lack of `xlev`) #403

fweber144 commented Apr 5, 2023

Fix subfit issues (mainly due to the lack of xlev) #403

Fix subfit issues (mainly due to the lack of xlev) #403

Conversation

fweber144 commented Apr 5, 2023

Fix `subfit` issues (mainly due to the lack of `xlev`) #403

Fix `subfit` issues (mainly due to the lack of `xlev`) #403