-
-
Notifications
You must be signed in to change notification settings - Fork 25.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enh/tree (performance optimised) #310
Conversation
Thanks a lot for the hard work. I think @glouppe will be interested in this branch. |
"num labels is %s and num features is %s " | ||
% (len(labels), len(features))) | ||
|
||
sample_dims = np.array(xrange(n_dims)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be::
sample_dims = np.arange(n_dims)
Timing differences:
In [2]: %timeit np.array(xrange(m))
1000 loops, best of 3: 1.6 ms per loop
In [3]: %timeit np.arange(m)
100000 loops, best of 3: 12 us per loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Thanks @pprett
While I've been looking through neighbors.py, I've noticed a lot of
to support use within a pipeline. Should tree.py also support this? |
We were just about to deprecate that. |
Thanks for this work and taking the time to profile and benchmark it against competing implementations. This looks very promising. I will try to secure some time in the coming days to give this PR a deeper look. |
I experiment the following broken doctest:
You should probably use |
To fix the previous breakage you should pass a |
I am about to do a first pass at code style cleanup. Expect a pull request soon. |
Thanks Olivier! I've just added some more functionality (bringing this into line with the standard labels accepted for binary classification). You will probably need to merge. |
Hi, I did not have time to come up with a new PR while after my short trip in a plane as I first intended to. I'll let you know when it's ready. |
@pprett: The script I used to do the benchmarking is at https://github.com/bdholt1/ml-benchmarks/blob/enh%2Ftree/benchmarks/bench_tree.py The training set is both madelon and arcene (using the ml-benchmarks framework).
That's great! This implementation might even get to be the same ballpark speed as Orange, but with the higher score! |
Following @pprett's lead, I moved
A significant speed increase: (arcene) Done, 100 samples with 10000 features loaded into memory milk: mean 68.21, std 0.31 Orange: mean 10.26, std 0.02 All the tests still pass. This is a 10* speed increase, making it the fastest of the 3 implementations, while still easy to read and modify. |
thanks @satra |
nicely done. |
@bdholt1 yesterday evening I played around with the second enhancement I told you about (updating the pm counts [or means in regression] vs. computing them from scratch). It turned out that this causes a massive improvement in terms of performance (it's about an order of magnitude or even more): Madelon: Arcene: The code is here: When you look at _tree.pyx you see new extension types at the top ( |
+1! Thats a significant improvement! I suspect it was this very idea that lead to the gain (but clearly not as much) in my previous attempt. The latest version is also quite a bit more readable. Given this major gain, I think it warrants some serious attention, so I'll try to work this into my implementation or just use yours straight. |
What me strikes is the bad performance of the tree on the Arcene dataset - compared to RBF SVMs (score about 0.73). |
I'm not too surprised about the performance. Technically a decision tree is more of a non-parametric classifier than non-linear - it works by partitioning up the feature space perpendicular to the dimensions in a greedy fashion, which is quite different to casting it into a higher dimensional space and finding a separating hyperplane... The good news is that Orange's performance is in the same ballpark, and I've worked through the algorithm quite thoroughly so I'm confident its doing it correctly. Hopefully we will see a significant performance gain when it comes to using these trees in a forest. Forests will compete with the best classifiers out there. |
Where can I retrieve the Arcene dataset? |
http://archive.ics.uci.edu/ml/datasets/Arcene 2011/8/24 mblondel
Peter Prettenhofer |
I use ml-benchmarks (which comes with the Arcene and Madelon UCI datasets). The benchmarking script for this code that I use is https://github.com/bdholt1/ml-benchmarks/blob/enh%2Ftree/benchmarks/bench_tree.py which I'm hoping will be accepted (scikit-learn/ml-benchmarks#6) when this PR is accepted. |
@bdholt1: does the error of a non-terminal node has to be smaller than the error of the parent node? I cannot find anything in ELSII (page 308, 1st paragraph discusses something related). But if I turn the check off accuracy on Arcene goes up to 0.7 (from 0.58). |
@pprett Well, it depends how you define error. In the way it is used in your code (which I'm pilfering as we speak), its the relative error (i.e. the amount of entropy in that node). What you would expect is that the overall entropy decreases as we go down the tree. In my code, I first compute the error at the node when all samples are on the right, and then test against that. |
@pprett: is it any slower if you turn it off? |
a bit (factor 2): btw: if we set min_split to 5 (AFAIK recommended in ELSII) then the performance goes up too |
@pprett: I've taken your code and done a bit to make it easier to understand/modify and made sure it passes the unit tests (although the regression error on boston is still higher than the current version on this branch). min_split = 1 Arcene: min_split = 5 Arcene: I've now got three versions (min_split = 1): Since they all pass the unit tests (and I've gone through the internal workings fairly thoroughly), I'm going to focus on the version that uses sample masks and get that into the branch. |
Great - thanks for your work! btw: for the arcene results with 0.7 acc I turned off the error check (simply pass |
These results are with |
@bdholt1: thanks! BTW: does anybody of you know how one usually grows a tree with exactly J terminal nodes (==leaves); In ESLII they use such trees for boosting but they don't describe how exactly they are grown; does one always expand the node with the lowest error - or the one with the highest decrease in error... |
@pprett: my guess is that they use a pruning method after the tree has been grown to full size. That way, you know how many leaves there are and you can prune until you only have J of them. |
Ok +1 for merge. |
Actually the doc has a couple of links in markdown syntax that needs to be translated to the reST syntax. Also the complexity of a balanced tree is stated as |
This piece is the introduction where the basic idea of the tree is explained. Yes there is a dependency on If after ready that section again you think it should be re-worded, I'd be happy to do so. |
Yeah so the complexity should be stated as |
I am +1 for this branch to be merged. Much work has already been done and we have now reached a pretty decent implementation of decision trees, both fairly efficient and well documented. |
@bdholt1 I will send you a PR with fixes on the links for the doc, merge from current master and some pep8. |
Merge ogrisel's doc and pep8 commits
Shouldn't the classes be called |
I think we have to be careful because CART is a trademark. |
I prefer to keep it |
I am +0. |
If CART is a trademark, then maybe we shouldn't use it. I'm no expert on trademark law, though; would it be illegal to use that name, or just to advertise with it? |
I don't want to deal with trademark stuff and I don't like acronyms in class names and the current impl seems to be generic enough hence I like the generic name, so +1 for defering class renaming discussions to the day where we actually have some alternative implementation that makes sense to contribute to the scikit. Any further comment before the merge? |
I'd like to review and merge tonight. I can dedicate a long sleepless |
I'd appreciate it, unfortunately I won't be online tonight though. Gael, perhaps it would be best if you copy the branch and submit a PR, or if you're happy then merge into master and then make your own changes. ------Original Message------
I'd like to review and merge tonight. I can dedicate a long sleepless Reply to this email directly or view it on GitHub: |
I was planning to do the merge myself tonight, and to make any changes I |
Great!
I was planning to do the merge myself tonight, and to make any changes I Reply to this email directly or view it on GitHub: |
Midnight local time. Just finished my other duties (namely reformating |
At a first glance, the test time of the scikit on my box went from 80s |
Noise! I hadn't done the merge, just a checkout, and thus didn't benefit |
I have merged this pull request. I haven't really changed much. I am too tired to think. Besides, I had already had a close look before. IMHO, they are no low hanging fruits. I still don't like the pointer arithmetic, but I guess that I will have to live with it. |
\o/ congrats to everybody involved, this is a great contrib for the project. I find it really amazing that so many people dived in to review, test, refactor and optimize this code. |
This is awesome, since it opens up the doors to random forests. Any comparisons with R's randomForest? (Or am I jumping the gun here?) |
Nope but if you would like to script one for instance using RPy2, please feel free to submit a pull request to https://github.com/scikit-learn/ml-benchmarks |
@bdholt, pprett, louppe: thank you very much for taking what was a crude, initial stab at this and turning it into a nicely optimized piece of code. great job! @scikits-learn community: for showing what a nice distributed coding can pull off. @yang: random forests and other ensemble techniques are next in the queue to build on top of this |
Good job to everyone who contributed! |
Thanks everyone! I am already using this implementation in my research. (and I am keen to try random forests too!) |
As @satra commented in #288 (comment), here is a separate PR for a decision tree. This version is significantly faster than the alternatives (Orange and MILK). Furthermore, Orange and scikits.learn support multiclass classification and regression, Milk only supports classification.
We would now welcome any comments with the aim of gaining acceptance for a merge into master.
Performance and scores
$python bench_tree.py madelon
Tree benchmarks
Loading data ...
Done, 2000 samples with 500 features loaded into memory
scikits.learn (initial): mean 84.23, std 0.62
Score: 0.76
scikits.learn (now): mean 0.65, std 0.00
Score: 0.76
milk: mean 115.31, std 1.57
Score: 0.75
Orange: mean 25.82, std 0.02
Score: 0.50
$python bench_tree.py arcene
Tree benchmarks
Loading data ...
Done, 100 samples with 10000 features loaded into memory
scikits.learn (initial): mean 40.95, std 0.44
Score: 0.60
scikits.learn (now): mean 0.20, std 0.00
Score: 0.60
milk: mean 71.00, std 0.60
Score: 0.60
Orange: mean 10.78, std 0.20
Score: 0.51
TODO before merge
increase test coverage to over 95%finish the documentation (fix broken example and plot links, add practical usage tips)demonstrate how to use a graphviz output in an exampleinclude a static grapvhiz output for the iris and boston datasets in the documentationaddfeature_names
toGraphvizExporter
extract the graphviz exporter code out of the tree classes (use visitor pattern), assign node numbers (not mem addresses)s/dimension/feature/gadd a test for the pickling of a fitted treecythonise predictionexplain in the documentation and in the docstrings how these classes relate to ID3, C4.5 and CARTFuture enhancements
rpart
)mvpart
)