-
-
Notifications
You must be signed in to change notification settings - Fork 29
Description
I'm currently working on the search_terms argument (fixing bugs and improving documentation). While doing so, I realized that there can be model sizes for which the forward search doesn't have any candidate models, for example:
options(mc.cores = parallel::detectCores(logical = FALSE))
data("df_gaussian", package = "projpred")
df_gaussian <- df_gaussian[1:41, ]
dat <- data.frame(y = df_gaussian$y, df_gaussian$x)
library(rstanarm)
rfit <- stan_glm(y ~ X1 + X2 + X3 + X4 + X5,
data = dat,
seed = 1140350788)
library(projpred)
vs <- varsel(rfit,
nclusters = 3,
nclusters_pred = 5,
method = "forward",
search_terms = c("X1 + X2"),
seed = 46782345)(tested with projpred 2.1.1). If you inspect the output of that varsel() call, you'll see that X1 + X2 is regarded as the solution term at model size 1:
print(vs)gives
Family: gaussian
Link function: identity
Formula: y ~ X1 + X2 + X3 + X4 + X5
Observations: 41
Search method: forward, maximum number of terms 1
Number of clusters used for selection: 3
Number of clusters used for prediction: 5
Suggested Projection Size: NA
Selection Summary:
size solution_terms elpd se diff diff.se
0 <NA> -101.6 2.9 -17.4 3.4
1 X1 + X2 -93.9 2.8 -9.7 2.3
and plot(vs) behaves accordingly. Now my question (especially to @AlejandroCatalina) is whether this is intended or whether X1 + X2 should be regarded as the solution term at model size 2 because it consists of the 2 terms X1 and X2. The latter would probably require some larger changes because all functions downstream of search_forward() would have to be adapted to deal with "empty model sizes".