Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for catboost into boost_tree() #117

Closed
jaredlander opened this issue Jan 8, 2019 · 9 comments
Closed

Add support for catboost into boost_tree() #117

jaredlander opened this issue Jan 8, 2019 · 9 comments
Labels
feature a feature request or enhancement

Comments

@jaredlander
Copy link

catboost (@catboost) is growing in popularity and has an R package which easily supports GPUs out of the box. @annaveronika has some nice benchmarks comparing this to xgboost.

It requires a data.frame (will tbls work?) for the predictors (data) and a vector for the response (label), which can be both categorical and numeric. Though categorical predictors need to be factors according to this notebook. Hopefully this can be worked out nicely with both recipes and the formula.

It also allows categorical predictors to be one-hot encoded or not, which should make @topepo happy.

@annaveronika
Copy link

Sorry for wrong comments, I was thinking it's an issue in our repo.

@dbakshee
Copy link

Hello!
If you see any issues blocking support of @catboost in boost_tree() let our dev team know, so that we could fix them sooner. We see that the latest catboost 0.12.2 does not support string labels in R interface, so one needs to convert the labels to numeric. I mean this:

pool <- catboost.load_pool(data=state.x77, label=state.region)  # FAILS
pool <- catboost.load_pool(data=state.x77, label=sapply(state.region, hash))  # works

We'll have this fixed.

@topepo
Copy link
Member

topepo commented Feb 25, 2019

Just getting to this now....

I'd love to include this; I've heard very good things about perfomance.

One general thought that I have is that the R interface isn't very R-like. Would it be possible to have a conventional S3 interface that looks something like:

catboost.formula(formula, data, <model arguments>)
catboost.default(x, y, <model arguments>)

The current interface is very flexible but the biggest frustration that I've heard about xgboost (which is similar in interface) is that people feel like they are working for xgboost instead of the other way around.

<rant>

parsnip aims to help this by providing a common set of R interfaces but sometimes the underlying packages make this very difficult. For example, not supporting string labels makes sense from the computational side but forces everyone to do something different for your package. If the hashing code above is the solution, why not implement that under the hood of your function if the outcome is a string (or factor?)? Similarly, the conversion of the data to a catboost specific format is putting the problems on the user.

We are trying to push developers to design separate functions/design for the user-interface and the computational module. A lot of that is encapsulated in these guidelines. In summary, we don't want people to be frustrated by your user interface.

</rant>

(small edit)

@topepo
Copy link
Member

topepo commented Feb 25, 2019

I should have also said that we'd be happy to help with the R interface. We're building some foundational code that could help make those user interfaces easier to create.

@annaveronika
Copy link

@topepo We are now trying to add CatBoost to CRAN, there is some delay from moderators, but I think this should happen soon.
About your help with R interfaces - we would very much appreciate that. We did largely copy our R interfaces from XGBoost, because we were thinking that that is what people would like, since they are used to it.
We would love to do a better job. What could be done - you could make a pr with adding new interfaces, and we could add the implementations. Or you could take implementation from other functions.
I don't think we can remove existing interfaces, but we can add a pack of new ones.

Also we have much more functionality in our Python package - there are cross-validation, a set of utils, and more. We would love to add this all to R using interfaces that are clear for R users.
So we would also appreciate if you would look into that.

I have created an issue on our gh: catboost/catboost#777
We can do the further work in this issue.

@juliasilge juliasilge added the feature a feature request or enhancement label Apr 3, 2020
@Athospd
Copy link

Athospd commented Jul 31, 2020

hello! We implemented lightgbm and catboost engines in this package curso-r/treesnip.

@juliasilge
Copy link
Member

Thank you for sharing, @Athospd! 🙌 We may open an issue on that new repo with some details you all may want to explore on engine-specific arguments.

I'm going to close this issue, since catboost is still not on CRAN and supporting R modeling engines not on CRAN is outside the scope of tidymodels goals.

@jaredlander
Copy link
Author

LightGBM is getting close to being on CRAN.

microsoft/LightGBM#3188

Don't know how close catboost is.

@github-actions
Copy link

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Mar 16, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

6 participants