Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sometimes growforest runs for a long time in the last few trees #50

Closed
vdemario opened this issue Jun 9, 2015 · 2 comments
Closed

Sometimes growforest runs for a long time in the last few trees #50

vdemario opened this issue Jun 9, 2015 · 2 comments

Comments

@vdemario
Copy link
Contributor

vdemario commented Jun 9, 2015

I've noticed more than once that growforest tends to output the first trees relatively fast and slows down in the end, when there are around 5 or 6 trees missing (out of a 100).

What I believe is happening is the recursion sometimes keeps going on for a really long time regardless of the depth. I haven't seem it go into an infinite loop or a stack overflow, but I suppose that's possible if my interpretation is correct.

At one point last year I remember having seen this and I made a change to my local copy in which I broke out of the recursion when the depth was some high number that almost never happened (100 thousand or 1 million, can't remember) and it worked, even though it was very ugly. Applyforest was happy with the .sf file generated, nothing seemed to be wrong.

This time around I'd like to understand what's happening better to see if there is a better solution. I've only been experimenting with combinations of -oob, -progress and -vet so far, so there might be flags already to help with this, I'm not sure.

@ryanbressler
Copy link
Owner

I have code to add a max depth parameter i'll push soon as part of an overhaul to boosting (which is often done with "stumps" or other simple trees).

I've definitely noticed some straggler issues like you're describing, there may be a few things going on.

A lot of it is because parallelism is done on a per tree level so as the number of trees left drops bellow the number of cores in use the rate at which trees finish drops off. Moving to parallelism on a per tree or per feature evaluated in split searching would speed this up and allow parallel boosting but require some sort of task queue so I haven't done it (yet).

I usually use a relatively large value of leaf size to limit model complexity (tree depth can be used for the same thing though slightly different) as this will both combat overfitting and result in faster training. The default settings are probably best for data sets that are small by modern standards. Smaller values of mTry will also limit tree depth as the tree will stop when it can't find a good split.

Definitely let me know if you run across a case where you believe tree growth should stop when it isn't

@vdemario
Copy link
Contributor Author

vdemario commented Mar 9, 2016

-maxDepth is on master since September, so I'm gonna close this issue. Thanks.

@vdemario vdemario closed this as completed Mar 9, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants