-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"harden" parsnip to potential dimensionality bugs #184
Comments
I think making this change requires passing the data to either |
I'm wondering if |
That might be a good place to do it. Here's an example: library(parsnip)
rf_mod <-
rand_forest(mtry = 20) %>%
set_mode("regression") %>%
set_engine("ranger")
fit(rf_mod, mpg ~ ., data = mtcars)
#> Error in ranger::ranger(formula = formula, data = data, mtry = ~20, num.threads = 1, : User interrupt or internal error.
#> Timing stopped at: 0.007 0.001 0.008
# Error: mtry can not be larger than number of variables in data. Ranger will EXIT now.
# the code being executed:
rand_forest(mtry = 20) %>%
set_mode("regression") %>%
set_engine("ranger") %>%
translate()
#> Random Forest Model Specification (regression)
#>
#> Main Arguments:
#> mtry = 20
#>
#> Computational engine: ranger
#>
#> Model fit template:
#> ranger::ranger(formula = missing_arg(), data = missing_arg(),
#> case.weights = missing_arg(), mtry = 20, num.threads = 1,
#> verbose = FALSE, seed = sample.int(10^5, 1)) Created on 2019-11-21 by the reprex package (v0.3.0) ( Since we only fit univariate models for random forest, we could substitute |
I spent some time today (probably too much) looking at this, and passing the data to One idea I have been working up an example for is instead making use of the data descriptors, with something like this, in probably mtry <- call2("min",
rlang::eval_tidy(arg_vals$mtry),
rlang::expr(.preds())) I don't have the environments where these get evaluated all correct yet, but I feel like I am getting somewhere with it: library(parsnip)
rf_mod <-
rand_forest(mtry = 20) %>%
set_mode("regression") %>%
set_engine("ranger")
rf_mod
#> Random Forest Model Specification (regression)
#>
#> Main Arguments:
#> mtry = 20
#>
#> Computational engine: ranger
translate(rf_mod)
#> Random Forest Model Specification (regression)
#>
#> Main Arguments:
#> mtry = 20
#>
#> Computational engine: ranger
#>
#> Model fit template:
#> ranger::ranger(formula = missing_arg(), data = missing_arg(),
#> case.weights = missing_arg(), mtry = min(20, .preds()), num.threads = 1,
#> verbose = FALSE, seed = sample.int(10^5, 1))
fit(rf_mod, mpg ~ ., data = mtcars)
#> Error: Descriptor context not set
#> Timing stopped at: 0.008 0 0.008 Created on 2020-03-13 by the reprex package (v0.3.0) You can see that the context for the data descriptor isn't being set correctly yet, but I think this could be a good way to go, instead of having users pass data to the model specification. What are your thoughts @topepo? This will take some rlang digging on my part to really solve. |
I feel that the data should not be seen by the specification; only when one of the fit functions are called. My original thoughts was along the lines of what you suggest using the descriptors. Is that |
I took out all the ill-considered stuff I tried and added the tiny example shown here in #270 so we can both check out where the call is being evaluated. |
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue. |
Although the data descriptors can solve this, we should look at capping the value of any tuning parameters that are based on either
nrow()
orncol()
.For example, with
mtry
in random forests, we should, somewhere usemtry = max(mtry, ncol(x))
.The text was updated successfully, but these errors were encountered: