Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adapt summary() and print() output to pct_solution_terms_cv #289

Closed
fweber144 opened this issue Mar 17, 2022 · 2 comments · Fixed by #414
Closed

Adapt summary() and print() output to pct_solution_terms_cv #289

fweber144 opened this issue Mar 17, 2022 · 2 comments · Fixed by #414

Comments

@fweber144
Copy link
Collaborator

fweber144 commented Mar 17, 2022

I think the summary() and print() output for vsel objects from cv_varsel() needs to be adapted to properly account for pct_solution_terms_cv. I'm not sure yet about the best way to achieve this, so I'll leave this open for discussion. At the very end of this comment, I started some thoughts.

Currently, only the solution path from the full-data search is shown. However, the CV folds may have differing solution paths (this is why pct_solution_terms_cv exists), at least as long as we have the search included in the CV, i.e., LOO CV with validate_search = TRUE or K-fold CV (the latter currently only supports validate_search = TRUE). If pct_solution_terms_cv is not incorporated into the summary() and print() output (as is currently the case), users might get the false impression that there is no uncertainty with respect to the solution path and that the printed values of the performance measures (e.g., the ELPD values) are based on the full-data search (which they don't, because they are based on the cross-validated searches).

Example:

options(mc.cores = parallel::detectCores(logical = FALSE))
data("df_gaussian", package = "projpred")
df_gaussian <- df_gaussian[1:41, ]
dat <- data.frame(y = df_gaussian$y, df_gaussian$x)
library(rstanarm)
rfit <- stan_glm(y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10,
                 data = dat,
                 seed = 1140350788)

library(projpred)
cvvs <- cv_varsel(rfit,
                  cv_method = "kfold",
                  K = 2,
                  nclusters = 3,
                  nclusters_pred = 5,
                  seed = 963201)

In this case, the CV folds have different solution paths: cvvs$pct_solution_terms_cv gives

      size X1  X3  X6  X5  X9  X7  X2 X4 X10  X8
 [1,]    1  1 0.0 0.0 0.0 0.0 0.0 0.0  0   0 0.0
 [2,]    2  0 0.5 0.5 0.0 0.0 0.0 0.0  0   0 0.0
 [3,]    3  0 0.0 0.5 0.0 0.0 0.5 0.0  0   0 0.0
 [4,]    4  0 0.0 0.0 0.5 0.0 0.0 0.0  0   0 0.5
 [5,]    5  0 0.5 0.0 0.0 0.5 0.0 0.0  0   0 0.0
 [6,]    6  0 0.0 0.0 0.5 0.0 0.0 0.0  0   0 0.5
 [7,]    7  0 0.0 0.0 0.0 0.5 0.0 0.5  0   0 0.0
 [8,]    8  0 0.0 0.0 0.0 0.0 0.0 0.0  0   1 0.0
 [9,]    9  0 0.0 0.0 0.0 0.0 0.0 0.0  1   0 0.0
[10,]   10  0 0.0 0.0 0.0 0.0 0.5 0.5  0   0 0.0

To see the different solution paths more clearly, debug the cv_varsel() call above using debug(projpred:::kfold_varsel) until solution_terms_cv has been created and then inspect solution_terms_cv:

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]  [,9] [,10]
[1,] "X1" "X3" "X6" "X5" "X9" "X8" "X2" "X10" "X4" "X7" 
[2,] "X1" "X6" "X7" "X8" "X3" "X5" "X9" "X10" "X4" "X2"

The print() output, however, does not point out the uncertainty with respect to the different solution paths: print(cvvs) gives:

Family: gaussian 
Link function: identity 

Formula: y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10
Observations: 41
CV method: kfold search included 
Search method: l1, maximum number of terms 10
Number of clusters used for selection: 1
Number of clusters used for prediction: 5
Suggested Projection Size: 2

Selection Summary:
 size solution_terms elpd.kfold  se diff diff.se
    0           <NA>     -105.1 2.4 -9.0     3.8
    1             X1     -100.8 2.5 -4.6     3.2
    2             X3      -98.4 3.0 -2.2     3.6
    3             X6      -98.8 2.7 -2.6     2.9
    4             X5      -99.4 3.2 -3.3     2.2
    5             X9      -97.1 3.0 -0.9     1.7
    6             X7      -95.4 3.0  0.8     1.6
    7             X2      -94.2 3.0  2.0     1.4
    8             X4      -95.0 3.5  1.1     0.9
    9            X10      -94.9 3.5  1.2     0.8
   10             X8      -94.8 3.5  1.3     0.8

Perhaps a first step would be to remove column solution_terms from the Selection Summary: output table (see above). The solution_terms() accessor function can always be used to get the solution path from the full-data search. For accessing the possibly differing solution paths from the CV, a new prop_solution_terms_cv() (or similarly named) accessor function for <vsel_object_with_cv>$pct_solution_terms_cv could be created. Its documentation could say something like "For each model size size (rows) and each solution term solterm (columns), the returned matrix contains the proportion of the CV folds which have solterm at position size of the solution terms.". Alternatively, instead of creating a new prop_solution_terms_cv() accessor function, the existing solution_terms() accessor function could be extended.

@fweber144 fweber144 changed the title Incorporate pct_solution_terms_cv into the summary() and print() output Adapt summary() and print() output to pct_solution_terms_cv Mar 17, 2022
@awd97
Copy link

awd97 commented May 15, 2022

Thank you for explaining #316 and pointing me to this! For us the proposed solution would be excellent - access to solution_terms_cv would enable us to explore model selection frequency a la Heinze et al (https://avehtari.github.io/modelselection/bodyfat.html) and moreover allow us to perform further modelling on solution_terms_cv directly (rather than a single variable selection, typically our work involves variable reduction to inform expanded data collection, before a final variable selection).

@fweber144
Copy link
Collaborator Author

PR #406 introduced some major new features concerning this issue, so you can now use them via the GitHub version (branch master). We are currently continuing work on this (see especially the points at the end of #406 (comment)).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants