You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I think the summary() and print() output for vsel objects from cv_varsel() needs to be adapted to properly account for pct_solution_terms_cv. I'm not sure yet about the best way to achieve this, so I'll leave this open for discussion. At the very end of this comment, I started some thoughts.
Currently, only the solution path from the full-data search is shown. However, the CV folds may have differing solution paths (this is why pct_solution_terms_cv exists), at least as long as we have the search included in the CV, i.e., LOO CV with validate_search = TRUE or K-fold CV (the latter currently only supports validate_search = TRUE). If pct_solution_terms_cv is not incorporated into the summary() and print() output (as is currently the case), users might get the false impression that there is no uncertainty with respect to the solution path and that the printed values of the performance measures (e.g., the ELPD values) are based on the full-data search (which they don't, because they are based on the cross-validated searches).
To see the different solution paths more clearly, debug the cv_varsel() call above using debug(projpred:::kfold_varsel) until solution_terms_cv has been created and then inspect solution_terms_cv:
The print() output, however, does not point out the uncertainty with respect to the different solution paths: print(cvvs) gives:
Family: gaussian
Link function: identity
Formula: y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10
Observations: 41
CV method: kfold search included
Search method: l1, maximum number of terms 10
Number of clusters used for selection: 1
Number of clusters used for prediction: 5
Suggested Projection Size: 2
Selection Summary:
size solution_terms elpd.kfold se diff diff.se
0 <NA> -105.1 2.4 -9.0 3.8
1 X1 -100.8 2.5 -4.6 3.2
2 X3 -98.4 3.0 -2.2 3.6
3 X6 -98.8 2.7 -2.6 2.9
4 X5 -99.4 3.2 -3.3 2.2
5 X9 -97.1 3.0 -0.9 1.7
6 X7 -95.4 3.0 0.8 1.6
7 X2 -94.2 3.0 2.0 1.4
8 X4 -95.0 3.5 1.1 0.9
9 X10 -94.9 3.5 1.2 0.8
10 X8 -94.8 3.5 1.3 0.8
Perhaps a first step would be to remove column solution_terms from the Selection Summary: output table (see above). The solution_terms() accessor function can always be used to get the solution path from the full-data search. For accessing the possibly differing solution paths from the CV, a new prop_solution_terms_cv() (or similarly named) accessor function for <vsel_object_with_cv>$pct_solution_terms_cv could be created. Its documentation could say something like "For each model size size (rows) and each solution term solterm (columns), the returned matrix contains the proportion of the CV folds which have solterm at position size of the solution terms.". Alternatively, instead of creating a new prop_solution_terms_cv() accessor function, the existing solution_terms() accessor function could be extended.
The text was updated successfully, but these errors were encountered:
fweber144
changed the title
Incorporate pct_solution_terms_cv into the summary() and print() output
Adapt summary() and print() output to pct_solution_terms_cvMar 17, 2022
Thank you for explaining #316 and pointing me to this! For us the proposed solution would be excellent - access to solution_terms_cv would enable us to explore model selection frequency a la Heinze et al (https://avehtari.github.io/modelselection/bodyfat.html) and moreover allow us to perform further modelling on solution_terms_cv directly (rather than a single variable selection, typically our work involves variable reduction to inform expanded data collection, before a final variable selection).
PR #406 introduced some major new features concerning this issue, so you can now use them via the GitHub version (branch master). We are currently continuing work on this (see especially the points at the end of #406 (comment)).
I think the
summary()
andprint()
output forvsel
objects fromcv_varsel()
needs to be adapted to properly account forpct_solution_terms_cv
. I'm not sure yet about the best way to achieve this, so I'll leave this open for discussion. At the very end of this comment, I started some thoughts.Currently, only the solution path from the full-data search is shown. However, the CV folds may have differing solution paths (this is why
pct_solution_terms_cv
exists), at least as long as we have the search included in the CV, i.e., LOO CV withvalidate_search = TRUE
or K-fold CV (the latter currently only supportsvalidate_search = TRUE
). Ifpct_solution_terms_cv
is not incorporated into thesummary()
andprint()
output (as is currently the case), users might get the false impression that there is no uncertainty with respect to the solution path and that the printed values of the performance measures (e.g., the ELPD values) are based on the full-data search (which they don't, because they are based on the cross-validated searches).Example:
In this case, the CV folds have different solution paths:
cvvs$pct_solution_terms_cv
givesTo see the different solution paths more clearly, debug the
cv_varsel()
call above usingdebug(projpred:::kfold_varsel)
untilsolution_terms_cv
has been created and then inspectsolution_terms_cv
:The
print()
output, however, does not point out the uncertainty with respect to the different solution paths:print(cvvs)
gives:Perhaps a first step would be to remove column
solution_terms
from theSelection Summary:
output table (see above). Thesolution_terms()
accessor function can always be used to get the solution path from the full-data search. For accessing the possibly differing solution paths from the CV, a newprop_solution_terms_cv()
(or similarly named) accessor function for<vsel_object_with_cv>$pct_solution_terms_cv
could be created. Its documentation could say something like "For each model sizesize
(rows) and each solution termsolterm
(columns), the returned matrix contains the proportion of the CV folds which havesolterm
at positionsize
of the solution terms.". Alternatively, instead of creating a newprop_solution_terms_cv()
accessor function, the existingsolution_terms()
accessor function could be extended.The text was updated successfully, but these errors were encountered: