add xgboost to benchmark #2

tqchen · 2015-04-27T23:09:08Z

Just discovered this repo. Since you are comparing gradient boosting algorithms, would be great if you could add https://github.com/dmlc/xgboost to comparison.

It also have a support for random forest.

Thanks

szilard · 2015-04-27T23:13:10Z

Yeah, I was planning to look at that via the R package when I get to boosting (in a few days). I'm doing now linear models as a baseline (but naturally poor accuracy).

tqchen · 2015-04-27T23:19:09Z

The accuracy of tree-based vs linear model depends on the type of data you are working on. If it is low dimensional categorical data or continuous case(which seems to be your case), tree-based model always work better, which was the case you are looking at.

szilard · 2015-04-27T23:28:41Z

Yes, I know, my focus is credit card fraud, therefore the choice of the data/benchmark as described in the README.

tqchen · 2015-04-27T23:37:14Z

Thanks for the explaining this. In xgboost we do tackle categorical data via one-hot encoding. Though a sparse matrix format is supported and optimized for so it could be fast and memory efficient when you use input from a sparse matrix.

If you would like to create a libsvm format of the inputs, there is a interesting new feature of external memory computing which could be tried.

I am looking forward to see the results

szilard · 2015-05-14T06:58:57Z

Ran xgboost to build random forests using the code provided by @hetong007. Looks good (see main README), but something weird happens for the largest data size (n=10M) - the trend for Run time and AUC "breaks", see figures main README.

tqchen · 2015-05-14T07:33:24Z

Thanks @szilard , I think I can explain that behavior. This is more about parameter setting issue. The thing is max_depth=20 was actually to complicated for such case.
There is a parameter min_child_weight that restricts number of instance per node. This will restricts how deep the tree could grow (as a indirect consequence). It was by default set to 1.

So as dataset grows from 1m to 10m. This actually causes the tool to grow deeper trees with more leaves (which cost more time in terms of one tree per datapoint and introduces overfitting) If you set min_child_weight to 10 in the 10m case. I guess things will become normal.

szilard · 2015-05-14T17:22:26Z

That sounds like a plausible explanation. However, I re-ran it with min_child_weight = 10 for all data sizes (all other tools had the same settings across the datasets) and actually AUC decreased and not increased for n = 10M. The trend lines have similar shapes. Here are the results:

n 0.01 time 3     AUC 70.3   
  0.1      17         73.1
    1     150         75.0
   10     4600        76.2

For comparison, without min_child_weight = 10 (same info as in the repo's README):

n 0.01  time  3   AUC 69.8    
    0.1      20      73.2  
     1       170     75.3     
    10    4800       76.3

tqchen · 2015-05-14T17:57:37Z

Thanks for running this. The min_child_weight do have an implicit impact on the running time, but seems not as much as I expected. I am less worried about the AUC drop, since it could due to random effect introduced in RF.

szilard · 2015-05-14T18:32:47Z

Well, I think both run times and AUC matter.

However, my main goal in this project is just to see which commonly used /out-of-the-box tools can train complex models (RF, boosting etc) on ~10 million observations, i.e. can finish in decent time (speed), not crash (running out of memory) and provide a reasonable accuracy. For this, for RF, xgboost satisfies all criteria :)

Nevertheless, the shape of the trend line for AUC and especially for run time (compared to the trendlines of the other tools) is weird, see:
https://github.com/szilard/benchm-ml/raw/master/2-rf/x-plot-time.png
https://raw.githubusercontent.com/szilard/benchm-ml/master/2-rf/x-plot-auc.png

Now, we can just say that's what it is. Or try to find out why, actually your guess sounded perfectly plausible, unfortunately it does not seem to be the case.

Also notice the high AUC with H2O (for n = 10M). Actually that's a bit weird too. I would (priori) expect a curve more like xgboost's. Strange that the 2 are so different.

Anyway, thanks @hetong007 and @tqchen for contributing. If you guys have more ideas for RF happy to include/re-run. And hopefully, I'll get soon to evaluate boosting.

tqchen · 2015-05-14T21:37:40Z

I totally agree with you that both runtime and AUC matters. What I really mean is that there could be various reasons on difference in AUC. For example, difference in default parameter setting, variance in RF construction etc :)

szilard · 2015-05-14T21:47:22Z

Yes. As I said before, this project is not a Kaggle :) I want to primarily see what tools pass a basic sanity check for 10M rows. I did little or no tuning, but in some way it would be interesting to see how the various methods can be tuned. Although I acknowledge this is just a toy dataset, so maybe not worth the effort.

tqchen · 2015-05-14T21:55:06Z

Thanks for the clarification! BTW, do you have any idea if there is any other dataset of such type for benchmarking? For example, a dataset with more columns and rows.

One thing I noticed about this dataset is that seems the output was very dependent on one variable(when the features are randomly dropped at rate of 50%, one output tree could be very bad). This might make the result become a singular case where the result simply repeatively cut on a single feature.

szilard · 2015-05-14T22:11:15Z

@tqchen I moved your last question to a new issue: #11

tqchen · 2015-05-16T07:00:53Z

I now think the bump in running time was due to cache-line issues. As there are some non-consecutive going on xgboost. Having larger amount of rows could mean less cache hit rate, but the impact should not be large as this has things to do micro level optimization.

I have pushed some optimization to do prefetching, which should in general improve the speed of xgboost. Would be great if you want to run another round of test.

szilard · 2015-05-16T18:07:13Z

Moved this topic into a new issue here #14

szilard closed this as completed May 16, 2015

szilard reopened this May 16, 2015

szilard mentioned this issue May 16, 2015

xgboost RF bump for n=10M #14

Closed

szilard closed this as completed May 16, 2015

tqchen mentioned this issue Mar 13, 2016

Update Latest version of XGBoost #37

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add xgboost to benchmark #2

add xgboost to benchmark #2

tqchen commented Apr 27, 2015

szilard commented Apr 27, 2015

tqchen commented Apr 27, 2015

szilard commented Apr 27, 2015

tqchen commented Apr 27, 2015

szilard commented May 14, 2015

tqchen commented May 14, 2015

szilard commented May 14, 2015

tqchen commented May 14, 2015

szilard commented May 14, 2015

tqchen commented May 14, 2015

szilard commented May 14, 2015

tqchen commented May 14, 2015

szilard commented May 14, 2015

tqchen commented May 16, 2015

szilard commented May 16, 2015

add xgboost to benchmark #2

add xgboost to benchmark #2

Comments

tqchen commented Apr 27, 2015

szilard commented Apr 27, 2015

tqchen commented Apr 27, 2015

szilard commented Apr 27, 2015

tqchen commented Apr 27, 2015

szilard commented May 14, 2015

tqchen commented May 14, 2015

szilard commented May 14, 2015

tqchen commented May 14, 2015

szilard commented May 14, 2015

tqchen commented May 14, 2015

szilard commented May 14, 2015

tqchen commented May 14, 2015

szilard commented May 14, 2015

tqchen commented May 16, 2015

szilard commented May 16, 2015