Adapt `summary()` and `print()` output to `pct_solution_terms_cv` #289

fweber144 · 2022-03-17T14:23:55Z

I think the summary() and print() output for vsel objects from cv_varsel() needs to be adapted to properly account for pct_solution_terms_cv. I'm not sure yet about the best way to achieve this, so I'll leave this open for discussion. At the very end of this comment, I started some thoughts.

Currently, only the solution path from the full-data search is shown. However, the CV folds may have differing solution paths (this is why pct_solution_terms_cv exists), at least as long as we have the search included in the CV, i.e., LOO CV with validate_search = TRUE or K-fold CV (the latter currently only supports validate_search = TRUE). If pct_solution_terms_cv is not incorporated into the summary() and print() output (as is currently the case), users might get the false impression that there is no uncertainty with respect to the solution path and that the printed values of the performance measures (e.g., the ELPD values) are based on the full-data search (which they don't, because they are based on the cross-validated searches).

Example:

options(mc.cores = parallel::detectCores(logical = FALSE))
data("df_gaussian", package = "projpred")
df_gaussian <- df_gaussian[1:41, ]
dat <- data.frame(y = df_gaussian$y, df_gaussian$x)
library(rstanarm)
rfit <- stan_glm(y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10,
                 data = dat,
                 seed = 1140350788)

library(projpred)
cvvs <- cv_varsel(rfit,
                  cv_method = "kfold",
                  K = 2,
                  nclusters = 3,
                  nclusters_pred = 5,
                  seed = 963201)

In this case, the CV folds have different solution paths: cvvs$pct_solution_terms_cv gives

      size X1  X3  X6  X5  X9  X7  X2 X4 X10  X8
 [1,]    1  1 0.0 0.0 0.0 0.0 0.0 0.0  0   0 0.0
 [2,]    2  0 0.5 0.5 0.0 0.0 0.0 0.0  0   0 0.0
 [3,]    3  0 0.0 0.5 0.0 0.0 0.5 0.0  0   0 0.0
 [4,]    4  0 0.0 0.0 0.5 0.0 0.0 0.0  0   0 0.5
 [5,]    5  0 0.5 0.0 0.0 0.5 0.0 0.0  0   0 0.0
 [6,]    6  0 0.0 0.0 0.5 0.0 0.0 0.0  0   0 0.5
 [7,]    7  0 0.0 0.0 0.0 0.5 0.0 0.5  0   0 0.0
 [8,]    8  0 0.0 0.0 0.0 0.0 0.0 0.0  0   1 0.0
 [9,]    9  0 0.0 0.0 0.0 0.0 0.0 0.0  1   0 0.0
[10,]   10  0 0.0 0.0 0.0 0.0 0.5 0.5  0   0 0.0

To see the different solution paths more clearly, debug the cv_varsel() call above using debug(projpred:::kfold_varsel) until solution_terms_cv has been created and then inspect solution_terms_cv:

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]  [,9] [,10]
[1,] "X1" "X3" "X6" "X5" "X9" "X8" "X2" "X10" "X4" "X7" 
[2,] "X1" "X6" "X7" "X8" "X3" "X5" "X9" "X10" "X4" "X2"

The print() output, however, does not point out the uncertainty with respect to the different solution paths: print(cvvs) gives:

Family: gaussian 
Link function: identity 

Formula: y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10
Observations: 41
CV method: kfold search included 
Search method: l1, maximum number of terms 10
Number of clusters used for selection: 1
Number of clusters used for prediction: 5
Suggested Projection Size: 2

Selection Summary:
 size solution_terms elpd.kfold  se diff diff.se
    0           <NA>     -105.1 2.4 -9.0     3.8
    1             X1     -100.8 2.5 -4.6     3.2
    2             X3      -98.4 3.0 -2.2     3.6
    3             X6      -98.8 2.7 -2.6     2.9
    4             X5      -99.4 3.2 -3.3     2.2
    5             X9      -97.1 3.0 -0.9     1.7
    6             X7      -95.4 3.0  0.8     1.6
    7             X2      -94.2 3.0  2.0     1.4
    8             X4      -95.0 3.5  1.1     0.9
    9            X10      -94.9 3.5  1.2     0.8
   10             X8      -94.8 3.5  1.3     0.8

Perhaps a first step would be to remove column solution_terms from the Selection Summary: output table (see above). The solution_terms() accessor function can always be used to get the solution path from the full-data search. For accessing the possibly differing solution paths from the CV, a new prop_solution_terms_cv() (or similarly named) accessor function for <vsel_object_with_cv>$pct_solution_terms_cv could be created. Its documentation could say something like "For each model size size (rows) and each solution term solterm (columns), the returned matrix contains the proportion of the CV folds which have solterm at position size of the solution terms.". Alternatively, instead of creating a new prop_solution_terms_cv() accessor function, the existing solution_terms() accessor function could be extended.

The text was updated successfully, but these errors were encountered:

awd97 · 2022-05-15T18:12:50Z

Thank you for explaining #316 and pointing me to this! For us the proposed solution would be excellent - access to solution_terms_cv would enable us to explore model selection frequency a la Heinze et al (https://avehtari.github.io/modelselection/bodyfat.html) and moreover allow us to perform further modelling on solution_terms_cv directly (rather than a single variable selection, typically our work involves variable reduction to inform expanded data collection, before a final variable selection).

fweber144 · 2023-05-11T08:59:36Z

PR #406 introduced some major new features concerning this issue, so you can now use them via the GitHub version (branch master). We are currently continuing work on this (see especially the points at the end of #406 (comment)).

fweber144 changed the title ~~Incorporate pct_solution_terms_cv into the summary() and print() output~~ Adapt summary() and print() output to pct_solution_terms_cv Mar 17, 2022

fweber144 mentioned this issue May 11, 2022

Reproducibility issue (?), exhibited when using search_terms or penalty #316

Closed

fweber144 mentioned this issue Apr 19, 2023

New functions for investigating predictor rankings #406

Merged

fweber144 mentioned this issue May 22, 2023

Add full-data predictor ranking and basic ranking proportions to plot.vsel() #414

Merged

fweber144 closed this as completed in #414 May 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adapt `summary()` and `print()` output to `pct_solution_terms_cv` #289

Adapt `summary()` and `print()` output to `pct_solution_terms_cv` #289

fweber144 commented Mar 17, 2022 •

edited

Loading

awd97 commented May 15, 2022

fweber144 commented May 11, 2023

Adapt summary() and print() output to pct_solution_terms_cv #289

Adapt summary() and print() output to pct_solution_terms_cv #289

Comments

fweber144 commented Mar 17, 2022 • edited Loading

awd97 commented May 15, 2022

fweber144 commented May 11, 2023

Adapt `summary()` and `print()` output to `pct_solution_terms_cv` #289

Adapt `summary()` and `print()` output to `pct_solution_terms_cv` #289

fweber144 commented Mar 17, 2022 •

edited

Loading