Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix subfit issues (mainly due to the lack of xlev) #403

Merged
merged 14 commits into from
Apr 5, 2023

Conversation

fweber144
Copy link
Collaborator

This resolves several bugs related to submodel fits of class subfit, see the message of commits fcf1408 and d08920a. The main problem was that in predict.subfit(), argument xlev of model.matrix() was not used. Unfortunately, deriving these xlevels at fitting time is not really straightforward, which is the reason why this PR adds a bunch of code for deriving these at the beginning of fit_glm_ridge_callback() and search_L1().

All of the issues resolved by commit fcf1408 can be reproduced using the following code:

options(mc.cores = parallel::detectCores(logical = FALSE))
N <- 41L
K <- 5L
K_fac <- 4L
set.seed(457324)
dat <- data.frame(
  y = rnorm(N),
  xcat = gl(n = K, k = floor(N / K), length = N,
            labels = paste0("gr", seq_len(K))),
  xfac = sample(gl(n = K_fac, k = floor(N / K_fac), length = N,
                   labels = paste0("fgr", seq_len(K_fac)))),
  xlog = sample(rep_len(c(TRUE, FALSE), length.out = N))
)
levels(dat$xfac) <- c(levels(dat$xfac),
                      paste0("fgr", (K_fac + 1L):(K_fac + 2L)))
dat$xcat <- as.character(dat$xcat)

library(rstanarm)
rfit <- stan_glm(y ~ xcat + xfac + xlog,
                 data = dat,
                 seed = 1140350788,
                 chains = 1, iter = 500,
                 refresh = 0)

library(projpred)

# debug(fit_glm_ridge_callback)
prj <- project(rfit, solution_terms = c("xcat", "xfac", "xlog"),
               nclusters = 1)

dat_new <- data.frame(
  y = c(-1.8, 0.1),
  xcat = rep(paste0("gr", 2), 2),
  xfac = factor(rep(paste0("fgr", 2), 2),
                levels = c(rev(levels(dat$xfac)), paste0("fgr", K_fac + 3L))),
  xlog = rep(TRUE, 2)
)
# debug(predict.subfit)
prjlp <- proj_linpred(prj, newdata = dat_new)

# debug(search_L1)
cvvs <- cv_varsel(rfit,
                  nclusters = 1,
                  nclusters_pred = 1,
                  seed = 46782345)
cvvs <- cv_varsel(rfit,
                  nclusters = 1,
                  nclusters_pred = 1,
                  refit_prj = FALSE,
                  seed = 46782345)

(All of the calls following a commented debug() call failed before this PR and succeed afterwards.)

For the issue resolved by commit d08920a, see the corresponding commit message for reproducibility instructions.

… model matrix,

not just the number of its columns.
level of a `factor`-coerced `character` variable.

Also fix a bug causing a re-ordering of the levels of a `factor` in
`newdata` (compared to that `factor` in the original dataset) not to get
recognized by `predict.subfit()`, which in turn causes
`x %*% rbind(alpha, beta)` to be incorrect in such cases.

Also fix an error thrown by `fit_glm_ridge_callback()` in case of unused levels
of a `factor`, namely:
```
warning: solve(): system is singular; attempting approx solution
Error: solve(): solution not found
```
is not necessary (as `collapse_contrasts_solution_path()` would also work
correctly without this change), but done for consistency.
…earch_L1()`. This

is not necessary (as `search_L1()` would also work correctly without this
change, thanks to its `match()` call and the following
`indices <- indices[!is.na(indices)]` line), but done for consistency.
(which is the default only for `datafit`s, not for the reference model objects
of class `refmodel` that are usually employed in practice) to lead to incorrect
predictions from the L1-searched submodels (which are L1-penalized GLMs) if the
solution path had a main effect ranked after an interaction term. This bug
existed at least from version 2.0.2 on (perhaps even in earlier versions). The
mentioned submodel predictions did not only affect the performance evaluation,
but also the projected dispersion parameter and the returned Kullback-Leibler
divergence (and the corresponding cross-entropy).

This bug came up in the unit tests (with `run_more = TRUE`), file
`test_datafit.R`, when creating `cvvss_datafit`, element
`args_cvvs_i <- args_cvvs_datafit$rstanarm.glm.gauss.spclformul.with_wobs.with_offs.trad.default_meth.kfold.default_search_trms`.
…of the model matrix (for safety reasons).
with `refit_prj = FALSE` to lead to incorrect predictions from the L1-searched
submodels if the solution path had a main effect ranked after an interaction
term.
@fweber144 fweber144 merged commit a6ee4f9 into stan-dev:master Apr 5, 2023
@fweber144 fweber144 deleted the subfit_xlevels_etc branch April 5, 2023 10:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant