best boosting AUC? #15

Closed
szilard opened this Issue May 31, 2015 · 21 comments

Projects

None yet

3 participants

@szilard
Owner
szilard commented May 31, 2015

@tqchen @hetong007 I'm trying to get a good AUC with boosting for the largest dataset (n = 10M). Would be nice to beat random forests :)

So far I did some basic grid search https://github.com/szilard/benchm-ml/blob/master/3-boosting/0-xgboost-init-grid.R for n = 1M (not the largest dataset) and seems like deeper trees, min_child_weight = 1 subsample = 0.5 work well.

I'm running now https://github.com/szilard/benchm-ml/blob/master/3-boosting/6a-xgboost-grid.R with n = 10M by just looping over max_depth = c(2,5,10,20,50) but it's been running for a while.

Any suggestions?

Smallest learning rate I'm using is eta = 0.01, any experience with smaller values?

PS: See results so far here: https://github.com/szilard/benchm-ml#boosting-gradient-boosted-treesgradient-boosting-machines

@tqchen
tqchen commented May 31, 2015

One thing that might be interesting to try is use integer encoding for the dates(seemed i got far better result with simply depth=6)

@tqchen
tqchen commented May 31, 2015

If no overfitting was happening, eta=0.01 was good enough. Another interesting thing to try, which I always do to optimize AUC, is to re-balance the weight. In particular, set

scale_pos_weight= num_neg_example/num_pos_example

Will re-balance the positive and negative weight, usually work better for AUC. Though the effect was more significant on unbalanced dataset, I am not sure what will happen here

@szilard
Owner
szilard commented May 31, 2015

Well, I deliberately considered that "forbidden" :) See some discussion here: #1

The reason is that I want the benchmark on a dataset with a mix of categoricals and numerics (with more categoricals) - similar to a industry/business dataset. So I'm making the day of the week, month etc. somewhat artificially categoricals:

for (k in c("Month","DayofMonth","DayOfWeek")) {
  d_train[[k]] <- as.factor(d_train[[k]])
}

If I made them ordinal/numeric those variables would dominate importance-wise the prediction and the dataset would have more too many "numeric" features.

So, the game between RF and boosting is on in the sense that those vars need to be categoricals and no other feature engineering ;)

@szilard
Owner
szilard commented May 31, 2015

Re @tqchen 2nd comment:

This dataset is pretty well balanced. What's very handy in xgboost (and missing from the other tools) is the early stopping :) And I use it with eval_metric = "auc" :)

I'll let you know when this run finishes: https://github.com/szilard/benchm-ml/blob/master/3-boosting/6a-xgboost-grid.R

@szilard
Owner
szilard commented May 31, 2015

I might also try eta=0.001 though already eta=0.01 is painfully slow. Btw is there any paper/result about decreasing learning rates during training (for boosting)? Or even some "adaptive" learning rate, see e.g. Vowpal Wabbit for the linear case.

@tqchen
tqchen commented May 31, 2015

I do not know if there is any theory on decreasing learning rate. I see you set subsample parameter, there is another colsample_bytree which sub-samples columns, usually makes result less easier to overfit and running time faster(set it to 0.5 or 0.3)

@tqchen
tqchen commented May 31, 2015

although gbm usually wins on complicated cases with more features and many feature played an important role, maybe this dataset was not in that case. Since the things are made explicitly categorical, currently there was too few integer features

@szilard
Owner
szilard commented May 31, 2015

Hm... Quick google does not bring up anything. Besides VW, there is simulated annealing in various contexts (e.g. neural nets) etc. This might be useful for boosting...

Yeah, colsample_bytree is needed for RF. I'll try out for boosting, thanks :)

@szilard
Owner
szilard commented May 31, 2015

Yeah, I wish I chose a dataset with more columns... Anyway, my main focus here is see what tools can run on 10M rows in decent time and with decent AUC, and AUC for boosting is pretty close to RF.

@hetong007
Contributor

Just for reference, I sometimes try this adaptive learning rate method: http://www.ark.cs.cmu.edu/cdyer/adagrad.pdf . It is not implemented in the xgboost.

@szilard
Owner
szilard commented May 31, 2015

Oh, now I remember reading this http://www.matthewzeiler.com/pubs/googleTR2012/googleTR2012.pdf which is implemented in H2O deep learning.

Btw do you think this has been adapted for GBMs?@hetong007: when you say "sometimes I try this..." what tool are you using? (is it boosting/GBM or something else like linear etc)

@tqchen
tqchen commented May 31, 2015

The nature of boosting was very different from linear solver, which makes adagrad may not be a directly applied here. Actually, even for deeplearning, adagrad was not the best choice for common convnet(maybe a safe choice).

However, it was actually straight forward to tweak the R/python code of xgboost to implementing decay learning rate without touching the cpp part, so maybe it was interesting to try.

@hetong007
Contributor

Sometimes in my research I will write some small prototypes to compare, but basically matrix factorization.

@tqchen I think currently we cannot get the gradient of each update from R/python, so at least the one I posted is not applicable.

@szilard
Owner
szilard commented May 31, 2015

Here is some options for decay (after quick web search): http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/40808.pdf I was doing some simulated annealing related stuff in physics 20 yrs ago ;)

Anyway, maybe it's easy to change it in the C++ level code in xgboost?

@tqchen
tqchen commented May 31, 2015

there is a thing called customized loss function in xgboost, which should do the job in R. Adding learning rate was equivalent to scale the h statistics:)

@tqchen
tqchen commented May 31, 2015

BTW, @szilard maybe it is interesting to do some feature importance analysis on the trees learnt, see this example

I guess the result will have a few very important features

@szilard
Owner
szilard commented May 31, 2015

Yeah, I think I took a quick look at the variable importance in RF in H2O. There are only 8 variables though...

@szilard
Owner
szilard commented May 31, 2015

I'll have to look at custom loss functions, probably tomorrow...

@tqchen
tqchen commented Jun 2, 2015

great, thanks @szilard

@szilard
Owner
szilard commented Jun 2, 2015

Here is Time/AUC for a few settings:

n = 10M (dataset size)
nround = 5000 max_depth = 20 eta = 0.01 min_child_weight = 1

subsample = 0.5 Time: 50000s AUC: 0.811
subsample = 1 Time: 49000s AUC: 0.805
subsample = 0.5 colsample_bytree = 0.5 Time: 35000s AUC: 0.810

@szilard szilard closed this Jun 9, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment