Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xgboost mtry parameter swap for #495 #499

Merged
merged 2 commits into from
May 21, 2021
Merged

xgboost mtry parameter swap for #495 #499

merged 2 commits into from
May 21, 2021

Conversation

topepo
Copy link
Member

@topepo topepo commented May 19, 2021

closes #495
closes #461

colsample_bytree remains an argument to xgb_train(). The mtry parameter in boost_tree() parameter now points to colsample_bynode.

We might want to add more engine-specific tunables to tune for this engine.

@mdancho84 I think that modeltime would need the same switch.

(edit for clarity)

@topepo topepo requested a review from juliasilge May 19, 2021 14:50
@juliasilge
Copy link
Member

So now both of these parameters are going to have the problem outlined in #461

@juliasilge
Copy link
Member

I reran some analyses with xgboost, including this one, and the results are different in about the ways I think we'd expect:

new_mtry

No big changes in overall performance after tuning, but it switched some things in the variable importance and such. People who come back to train with the same data after this change are going to notice.

@topepo topepo merged commit 46a2018 into master May 21, 2021
@topepo topepo deleted the xgb-mtry branch May 21, 2021 13:53
@mdancho84
Copy link
Contributor

I'm here now - just seeing this. Yep, Modeltime will need the switch (I believe).

@mdancho84
Copy link
Contributor

mdancho84 commented May 25, 2021

Hey, I've reviewed and I see one potential issue with backwards compatibility. What happens is that models that may have been specified with a value of 1 thinking this means 100%, gets converted to one column, which makes model performance very bad.

I recommend handling 1 as 100% of columns, and not 1 as 1 column.

The case in which a user actually only wants to use 1 columns should be rare, and handling as 100% is consistent with the underlying xgboost::xgb.train() function.

@topepo
Copy link
Member Author

topepo commented May 25, 2021

The previous behavior of "1.00 mean s 100% but otherwise it is a count" was a big mistake on my part.

It is 100% is consistent with xgboost, but completely inconsistent with randomForest, ranger, gbm, C50 and so on. We want to avoid people having to think "but for xgboost I have to do this instead of that". There are probably 4-5 things in the overall xgboost api that are irregular (and we fix them).

@mdancho84
Copy link
Contributor

We are good to go. Let me know when parsnip is accepted by CRAN, and I will update modeltime.

@github-actions
Copy link

github-actions bot commented Jun 9, 2021

This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Jun 9, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

mtry maps to wrong parameter for XGBoost when mtry =1 all predictors are used in xgboost engine
3 participants