Skip to content

Make arguments robust to values outside of data dimensions #377

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Oct 22, 2020

Conversation

topepo
Copy link
Member

@topepo topepo commented Oct 10, 2020

closes #184

close #270

Uses the translate() function to make some arguments expressions instead of specific values. For example:

> nearest_neighbor(neighbors = 1000) %>%
+   set_engine("kknn") %>%
+   set_mode("regression") %>% 
+   translate()
K-Nearest Neighbor Model Specification (regression)

Main Arguments:
  neighbors = 1000

Computational engine: kknn 

Model fit template:
kknn::train.kknn(formula = missing_arg(), data = missing_arg(), 
    ks = min(1000, nrow(data) - 5))

For C5.0 and xgboost models, the wrapper functions were used to bound the values based on the actual data dimensions.

@topepo topepo requested a review from juliasilge October 10, 2020 01:26
@juliasilge
Copy link
Member

juliasilge commented Oct 13, 2020

I think the reason I was trying to use the data descriptors back in June instead of just nrow() or ncol() was because I was concerned about the dummy variable column counting. Do you think that is worth worrying about?

library(parsnip)

rf_mod <- 
  rand_forest(mtry = 20) %>% 
  set_mode("regression") %>% 
  set_engine("ranger")

translate(rf_mod)
#> Random Forest Model Specification (regression)
#> 
#> Main Arguments:
#>   mtry = 20
#> 
#> Computational engine: ranger 
#> 
#> Model fit template:
#> ranger::ranger(x = missing_arg(), y = missing_arg(), case.weights = missing_arg(), 
#>     mtry = min(~20, ncol(x)), num.threads = 1, verbose = FALSE, 
#>     seed = sample.int(10^5, 1))

fit(rf_mod, circumference ~ ., data = Orange)
#> parsnip model object
#> 
#> Fit time:  17ms 
#> Ranger result
#> 
#> Call:
#>  ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min(~20,      ncol(x)), num.threads = 1, verbose = FALSE, seed = sample.int(10^5,      1)) 
#> 
#> Type:                             Regression 
#> Number of trees:                  500 
#> Sample size:                      35 
#> Number of independent variables:  2 
#> Mtry:                             2 
#> Target node size:                 5 
#> Variable importance mode:         none 
#> Splitrule:                        variance 
#> OOB prediction error (MSE):       125.2415 
#> R squared (OOB):                  0.9621042

Created on 2020-10-13 by the reprex package (v0.3.0.9001)

The variable Tree is a factor so really it would be OK to go up to mtry = 5 here. It gets cut down to 2, though.

ranger was a bad example there, but I am still trying to figure out where the ncol() is evaluated.

@juliasilge
Copy link
Member

juliasilge commented Oct 13, 2020

OK, I am convinced now! 😄 Moving so many models to the x/y format has really had some nice outcomes and made things like this more doable.

This behavior should probably be documented somewhere (not too obtrusive) so it is not a total surprise to people. I'll make a suggestion.

@juliasilge
Copy link
Member

Actually, I changed my mind on adding anything to the documentation. I think from how these arguments are documented it is clear enough what they are supposed to be. ✅

@topepo topepo merged commit 8af904b into master Oct 22, 2020
@topepo topepo deleted the check-arg-dimensions branch October 22, 2020 22:46
@github-actions
Copy link

github-actions bot commented Mar 6, 2021

This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Mar 6, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"harden" parsnip to potential dimensionality bugs
2 participants