Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"harden" parsnip to potential dimensionality bugs #184

Closed
topepo opened this issue Jun 27, 2019 · 7 comments · Fixed by #377
Closed

"harden" parsnip to potential dimensionality bugs #184

topepo opened this issue Jun 27, 2019 · 7 comments · Fixed by #377
Labels
feature a feature request or enhancement

Comments

@topepo
Copy link
Member

topepo commented Jun 27, 2019

Although the data descriptors can solve this, we should look at capping the value of any tuning parameters that are based on either nrow() or ncol().

For example, with mtry in random forests, we should, somewhere use mtry = max(mtry, ncol(x)).

@topepo topepo added the tidy-dev-day 🤓 Tidyverse Developer Day rstd.io/tidy-dev-day label Jun 27, 2019
@patr1ckm
Copy link
Contributor

I think making this change requires passing the data to either check_args or translate. As check_args usually only throws errors, perhaps translate. Does that sound like the right refactor?

@patr1ckm
Copy link
Contributor

patr1ckm commented Oct 30, 2019

I'm wondering if data should be passed to both check_args and translate. Passing data to check_args would also allow throwing an error message for the situation reported in #206 in which a multinomial model is specified but constructed withlogistic_reg which assumes that the dependent variable only has 2 levels.

@topepo
Copy link
Member Author

topepo commented Nov 21, 2019

I think making this change requires passing the data to either check_args or translate. As check_args usually only throws errors, perhaps translate. Does that sound like the right refactor?

That might be a good place to do it.

Here's an example:

library(parsnip)

rf_mod <- 
  rand_forest(mtry = 20) %>% 
  set_mode("regression") %>% 
  set_engine("ranger")

fit(rf_mod, mpg ~ ., data = mtcars)
#> Error in ranger::ranger(formula = formula, data = data, mtry = ~20, num.threads = 1, : User interrupt or internal error.
#> Timing stopped at: 0.007 0.001 0.008
# Error: mtry can not be larger than number of variables in data. Ranger will EXIT now.


# the code being executed:
rand_forest(mtry = 20) %>% 
  set_mode("regression") %>% 
  set_engine("ranger") %>% 
  translate()
#> Random Forest Model Specification (regression)
#> 
#> Main Arguments:
#>   mtry = 20
#> 
#> Computational engine: ranger 
#> 
#> Model fit template:
#> ranger::ranger(formula = missing_arg(), data = missing_arg(), 
#>     case.weights = missing_arg(), mtry = 20, num.threads = 1, 
#>     verbose = FALSE, seed = sample.int(10^5, 1))

Created on 2019-11-21 by the reprex package (v0.3.0)

(ranger should really be smarter than this).

Since we only fit univariate models for random forest, we could substitute mtry = 20 with the expression mtry = min(20, ncol(data) - 1) or similar.

@topepo topepo added next release 🚀 and removed tidy-dev-day 🤓 Tidyverse Developer Day rstd.io/tidy-dev-day labels Dec 2, 2019
@juliasilge
Copy link
Member

juliasilge commented Mar 13, 2020

I spent some time today (probably too much) looking at this, and passing the data to translate() or check_args() seems like a pretty drastic change from what we have been saying a model specification is. Users would then be passing the data to both their model specification and when they fit their model, which seems somewhat awkward based on how we have been framing tidymodels overall. It seems less than ideal to me, or at least would require some rethinking of how we have been talking about what a model specification is.

One idea I have been working up an example for is instead making use of the data descriptors, with something like this, in probably translate():

mtry <- call2("min",
              rlang::eval_tidy(arg_vals$mtry),
              rlang::expr(.preds()))

I don't have the environments where these get evaluated all correct yet, but I feel like I am getting somewhere with it:

library(parsnip)

rf_mod <- 
  rand_forest(mtry = 20) %>% 
  set_mode("regression") %>% 
  set_engine("ranger")

rf_mod
#> Random Forest Model Specification (regression)
#> 
#> Main Arguments:
#>   mtry = 20
#> 
#> Computational engine: ranger

translate(rf_mod)
#> Random Forest Model Specification (regression)
#> 
#> Main Arguments:
#>   mtry = 20
#> 
#> Computational engine: ranger 
#> 
#> Model fit template:
#> ranger::ranger(formula = missing_arg(), data = missing_arg(), 
#>     case.weights = missing_arg(), mtry = min(20, .preds()), num.threads = 1, 
#>     verbose = FALSE, seed = sample.int(10^5, 1))

fit(rf_mod, mpg ~ ., data = mtcars)
#> Error: Descriptor context not set
#> Timing stopped at: 0.008 0 0.008

Created on 2020-03-13 by the reprex package (v0.3.0)

You can see that the context for the data descriptor isn't being set correctly yet, but I think this could be a good way to go, instead of having users pass data to the model specification.

What are your thoughts @topepo? This will take some rlang digging on my part to really solve.

@topepo
Copy link
Member Author

topepo commented Mar 14, 2020

I feel that the data should not be seen by the specification; only when one of the fit functions are called.

My original thoughts was along the lines of what you suggest using the descriptors. Is that mtry example in a branch somewhere? I wonder where .preds() was being evaluated.

@juliasilge
Copy link
Member

I took out all the ill-considered stuff I tried and added the tiny example shown here in #270 so we can both check out where the call is being evaluated.

@github-actions
Copy link

github-actions bot commented Mar 6, 2021

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Mar 6, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature a feature request or enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants