Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Issue with Gradient Forest for Large Data Set (sample.fraction) #137

Closed
jasonhuang99 opened this issue Nov 11, 2017 · 2 comments
Closed

Comments

@jasonhuang99
Copy link

I have been applying causal forest to datas that have tens millions of observations, and I've been encountering memory issues. When building a forest with only 1000 trees (lower than the recommended amount), the whole forest takes more than 200 GB of memory, even though the raw data is only 2 GB. I think the reason is each tree stores a copy of a fraction of the data (both in the "nodes" and "oob_samples" attributes).

One strategy I had of reducing the memory intensity is to use a small "sample.fraction" parameter. However, when I set a low value, the memory size for an individual tree actually went up. I think the reason is the size of tree$oob_samples went up, which makes sense. Is there a way we can not save the out of sample observations for each tree?

@jtibshirani
Copy link
Member

jtibshirani commented Nov 27, 2017

Yes, this is a very reasonable suggestion. We're currently planning to add an 'optimized' mode for the prediction-only case in #122, which will avoid storing the OOB sample information. I've also filed #145, which suggests we avoid storing OOB samples altogether.

@jtibshirani
Copy link
Member

jtibshirani commented Nov 27, 2017

Closing in favor of #122 and #145.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants