Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XGBoost Support #23

Closed
gaw89 opened this issue Apr 27, 2017 · 8 comments
Closed

XGBoost Support #23

gaw89 opened this issue Apr 27, 2017 · 8 comments

Comments

@gaw89
Copy link

gaw89 commented Apr 27, 2017

Is BorutaPy compatible with XGBoost? If not, would you be interested in a PR for that compatibility (assuming it's possible and I can figure it out)?

It seems to me that this is not currently supported since I got an error when I tried it with XGBClassifier, but I wanted to know if there's any official word.

Thanks!

@danielhomola
Copy link
Collaborator

HI,

BorutaPy now works with all tree based methods of scikit learn, so it'll work with GradientBoostingClassifier which is pretty close to XGBoost.. XGBoost is super fast though so adding it would be amazing.

I think there's a scikit learn like interface for XGBoost that has the fit method which is great. The only other thing you need to use BorutaPy with XGBoost is a method to extract variable importance from the model. In BorutaPy this is done with scikit-learn's feature_importances_ property. Unfortunately the creators of the scikit like interface of XGBoost did not comply with this and they use a get_score() method to return the variable importances.

So I guess the easiest thing to do would be to take the scikit-learn like interface of XGBoost and extend the class with a feature_importances_ property (which would just call get_scores() under the hood).

I don't want toadd this to BorutaPy because it's supposed support scikit-learn and not other packages. Nonetheless if you found a way to this, let me know and I'll include it in the README as an example maybe.

@gaw89
Copy link
Author

gaw89 commented Apr 29, 2017

Hello Daniel,

Thanks for your response. I looked into this a little more, and it appears (as you mentioned) that the issue lies with Xgboost implementation of the Sklearn API. I made some alterations and got it working with Boruta. I am working with Xgboost maintainers to see if they'll accept a PR for the updates to the API.

I will post back here if/when the PR gets through.

Great package by the way!

@danielhomola
Copy link
Collaborator

Sweet, thanks!

@mbq
Copy link

mbq commented Apr 30, 2017

@gaw89 Out of curiosity, does this even work? I would expect XGBoost scores to just degenerate Boruta into a minimal optimal method...

@danielhomola
Copy link
Collaborator

hm interesting point Miron.. Is that because of this?

@gaw89
Copy link
Author

gaw89 commented May 1, 2017

@mbq if you are referring to the link that Daniel posted, it seems you may be right, at least in the case of correlated features. However, I wonder - is the only difference between minimal optimal and all relevant feature selection that all relevant retains correlated features?

Also, since Boruta uses multiple runs of the wrapper algorithm, it seems pretty unlikely that, given correlated features A and B, it would choose A every time and never choose B (particularly when using a large number of runs of the algorithm). I suppose this also assumes a different random seed for each run of the underlying algorithm, but I am not sure about that.

I guess the real question here is, will there be any utility in using something other than Random Forest for the Boruta wrapper? Will I get a different/better set of features using XGBoost, decision tree, etc. than using Random Forest? My thought was, since I am using XGBoost for my final model, it would be better to use the same algorithm with the same parameters for my feature selection, but maybe this isn't the case. Do either of you have any thoughts on this?

@gaw89
Copy link
Author

gaw89 commented May 1, 2017

Cool algorithm/package by the way!

@mbq
Copy link

mbq commented May 1, 2017

Thanks!

This correlated features issue is a bit more complicated; imagine a set with features A, B and C so that both A and f(B,C) explains Y perfectly, and f is non-trivial in a sense that neither B nor C on its own is a good predictor. Now, no greedy CART-based method will ever touch B or C despite random seed; however, after removing A, they will happily build 100% accurate model on them.

One may ask who will need B or C when A does the job? The standard answer is that it allows for better understanding of the problem, which is useless if you only need a good model. The less standard is that it leads to a more robust models (like A might be clear in train, but may get more noisy in test); also, in p>>n, spurious random correlations may easily get indistinguishable from the real ones, so AR selection is better for designing further studies / doing meta-analyses (like A may be a lucky piece of nonsense). Formally, it is even more crazy because there is no minimal optimal selection when there is a perfect duplicate of information, but I'll leave it for now.

Back to your main question. Boruta is mostly for drawing a line between weakly relevant features and noise; it basically assumes that the importance source somehow homogeneously scans the feature space. Greedy methods like canonical boosting go in an opposite direction, hence my concern that they are not a best importance sources; but I haven't tested that, especially those stochastic modifications, so I don't know for sure.

About final model, theoretically, AR set is not fundamental, so it won't depend on the method that produced it provided it works well, so using same modelling in both places shouldn't be beneficial. What's more important, again theoretically, AR is totally redundant to model optimisation, while it is way more expensive than MO. The only but is this aforementioned robustness thing, but it should be relevant only in pathological cases anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants