-
-
Notifications
You must be signed in to change notification settings - Fork 25.3k
-
-
Notifications
You must be signed in to change notification settings - Fork 25.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Random Forest Performance #1435
Comments
Cost of sampling w/ replacement:
the costs stem from a) fancy indexing and b) re-computing X_argsorted . |
wow that is a huge overhead. |
@amueller I too think that there's lots to be gained - regarding Brians PR: what is the idea behind his enhancement? AFAIK he sorts features only when they are needed (i.e. they fall into a feature sub-sample)? |
Yes, exactly. thinking about it, maybe that doesn't help as much, ... Also, speeding up ExtraTrees is probably even easier.that doesn't need any sorting at all! |
Personally, I don't expect high gains with lazy pre-sorting either. The second idea is something different, that's the way R's "randomForest" does it; it basically re-orders samples such that partitions are consecutive regions in the data / auxiliary arrays. |
I did a quick hack on top of @ndawe 's PR #522 (sample weights for trees): Here is the result on the MNIST benchmark::
That's a 2-fold increase in performance (and probably more in memory efficiency)! ** The slight difference in ACC might be due to numerical issues The branch is here https://github.com/pprett/scikit-learn/tree/rf-tree-weights . |
Nice! That's quite a good news Peter. Maybe it's time to finally review all On 2 December 2012 18:12, Peter Prettenhofer notifications@github.comwrote:
|
Btw, I think this is a duplicate issue of a previous one by @bdholt1. We should have a look at the data mining literature on building trees fast: |
I agree - the point is: we should add sample weights ASAP and get rid of the sampling w/ replacement overhead in RF - then we can tackle the tree building itself. @glouppe we could split the work on reviewing #522 - are you busy at the moment? regarding the literature pointed out be @amueller: literature on scalable decision tree induction is vast . We should definetly start to collect the most interesting approaches (github wiki page?) and do a reading group. I did some investigation of different software implementations (focus on GBRT) - you can find it here https://docs.google.com/spreadsheet/ccc?key=0AlBhwRZOwyxRdGo1V3A0eHYtNTY5TDVIa29pYWVjd1E (still work in progress though) |
@pprett I have been more and more busy lately :-) But this is on my todo list. I plan to review the code at the end of the week. Your help is more than welcome though! I also agree that we should add sample weights asap such that we can get rid (for free) of the sampling with replacement (huge) overhead. |
Btw, can we close either this one or #964? |
I got an access to wakari.io where one can use WiseRF. WiseRF is indeed faster, but it is not as bad (i.e., not "5x to 100x faster") as what the benchmarks mentionned above indicate. I turned off bootstrap to see where we should be when it'll be properly reimplemented using sample weights. This is nothing very scientific though. It is just one test.
|
great news - thanks for investigating - btw: do you know whether wiseRF supports instance subsampling? (cannot find anything in the anaconda docs) |
@glouppe I recently checked the difference in test time performance (both, batches and single data points) I'm testing with 10 features and 100 trees. First, wiseRF::
Cannot believe my eyes - 2.69 seconds for a single data point - are you kidding me?! Looks like there is a huge overhead involved; performance-wise it doesn't make a difference if you predict 10000 examples or just one. now sklearn::
that's better but still - 5ms for one data point and 100 trees is pretty slow - IMHO we could do better (fewer function calls, faster input checks) |
@pprett From what I have been able to discover, they actually store the forest as a string (!). (See the |
I am also glad to see that we are "up to 544x faster" at prediction time (sic) ;) |
Adding the content of #1532 here: Currently the tree fitting procedure tries all possible splits between unique values of each feature in TMVA [1] implements both this same procedure as well as a mode that histograms each feature with a fixed number of bins [2]:
[1] http://tmva.sourceforge.net/ |
I really like the idea of introducing a n_cuts variable with a sensible default value. |
I was just wondering about the in-place sorting of X_argsorted that we discussed. That would mean we need a copy of X_argsorted per tree and sharing the arrays across processes (as @ogrisel is working on) would not work any more, right? |
Correct. :/ On 5 February 2013 09:39, Andreas Mueller notifications@github.com wrote:
|
I agree 2013/2/5 Andreas Mueller notifications@github.com
Peter Prettenhofer |
but we probably still want that, right?l |
One copy of the whole data per-sub estimator? That does not seem like a reasonable approach to me. Unless you pre-allocate one temporary buffer per computational worker and reuse those buffer sequentially for each new subestimator fit on the the indiviual workers. However if the original dataset is barely fitting in RAM, then the |
so we want to make that an option maybe? Or think harder about how to implement it. but I don't think there is a chance to not allocate something dataset-size per estimator and still be fast. Also, it would be good to know how large the sample mask that we currently use is. |
sample_mask is small, O(n). On 5 February 2013 11:05, Andreas Mueller notifications@github.com wrote:
|
are predictions supposed to be slow? rf.predict(one_observation_with_ten_vars) takes ~ 1 second , that is a long time with millions of rows to predict. |
I think you'll find that there is some overhead so it should be faster with
|
@jtoy can you please post your RandomForest arguments, 2013/2/22 Brian Holt notifications@github.com
Peter Prettenhofer |
After meditating over it a bit, I think it would be easiest if we first try to speed up the extra-trees. They don't need sorting and could work by just storing lists of sample points in each node if I am not mistaken. For Random Forests, it looks like there is a non-trivial memory / speed tradeoff. |
I've got a test implementation of the binning working and it's definitely
|
Another benefit of binning is that since the cuts are only ever placed at the bin edges, the tree is somewhat less prone to overfitting. |
is this still an issue? I have previously gone around it by doing multiple predictions at a time, but I believe the problem is still there. |
@jtoy which issue in particular are you referring to? There is still room for improvement in the forest speed ;) |
The issue I have seen is running predictions on scikit vs R on the same datasets, scikit is several orders of magnitude slower. |
Several orders of magnitude? That doesn't seem right. I guess it would depend a lot on the parameters and dataset. But I think there shouldn't be more than a factor of, say 2 or 3, afaik. @glouppe ? |
@jtoy can you please elaborate on the benchmark - dataset size, model parameters, etc - how do you test: batch prediction or prediction of single data points? One issue that often bites users is the fact that our forest estimators use |
Just to put a note: despite the fact that our implementation of RFs and GBRTs is much more optimized than it used to be, it still does not implement binning / approximate histograms for speeding up the best split search as mentioned in @ndawe's comment: #1435 (comment). xgboost (another very fast open source implementation of RFs and GBRTs written in C++) is apparently using approximate feature histogram implemented with a custom quantile sketch datastructure: https://github.com/dmlc/xgboost/blob/master/src/tree/updater_histmaker-inl.hpp This might be the primary reason of the improved performance of xgboost (even in single-threaded mode). |
I met the author of xgboost some months ago and he told that binned splits are used only in the non-distributed case. I agree however that we should look a bit more into this implementation in order to better understand the performance gap. In the case of boosting, one thing that I know is that the trees that are built with XGBoost are strictly different from ours, because of different loss functions and different impurity criteria. In particular, both include regularization terms which prevent complex (and deep) trees to be constructed, which in addition to generalize better may also be faster to construct. |
Random Forest is a popular classification technique; recent benchmarks [1][2] have shown that performance of sklearn's RandomForestClassifier is inferior to competing software implementations.
The performance penalty most likely stems from the underlying tree building procedure, however, changes here require considerable effort. These changes include:
Some low-hanging fruits may be found in the forest module itself:
[1] http://continuum.io/blog/wiserf-use-cases-and-benchmarks
[2] http://wise.io/wiserf.html
The text was updated successfully, but these errors were encountered: