Skip to content

[MRG+2] Fix excessive memory usage in random forest prediction #8672

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Apr 3, 2017
Merged

[MRG+2] Fix excessive memory usage in random forest prediction #8672

merged 5 commits into from
Apr 3, 2017

Conversation

mikebenfield
Copy link
Contributor

@mikebenfield mikebenfield commented Mar 30, 2017

Fix #8244.

That issue was about random forests using excessive memory in predicting. The problem was that the forest held onto each prediction from each estimator, only to combine them at the end. It was suggested in the thread to solve this by parallelizing over the instances instead of the estimators. Instead, I still parallelize over the trees, but keep an array out into which I immediately sum each output. Python's GIL makes this thread safe.

My testing indicates runtime performance is the same, but memory usage is substantially decreased. See this gist. Running python prob_classification.py n gives a memory increment (around line 587 of forest.py) in MiB of

n master fix2-forest-memory
100 13.3 1.6
200 25.5 2.0
300 38.1 2.2
400 50.1 2.3
500 62.6 2.4
600 75.7 2.2

I wasn't sure where you'd like the functions _run_estimator_ and _run_estimator2. Let me know if I should move them.

@glemaitre
Copy link
Member

@mikebenfield Could you add the predict runtime of master and your PR for different sample size [1000, 10000, 100000, 1000000].

@mikebenfield
Copy link
Contributor Author

mikebenfield commented Mar 31, 2017

@glemaitre Using the same script I posted before, with n_estimators=200, I get this (in seconds, on my little 2-core laptop):

n_samples master fix2-forest-memory
100 0.108 0.107
1000 0.107 0.107
10000 0.113 0.103
100000 0.990 0.932
1000000 16.175 14.363

if self.n_outputs_ == 1:
for j in range(1, len(all_proba)):
proba += all_proba[j]
out = np.zeros((X.shape[0], self.n_classes_), dtype=np.float64)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep proba and all_proba instead of out regarding the naming.

@glemaitre
Copy link
Member

@ogrisel wdyt of this solution?

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a very good thing. Thanks!

out += f(X, check_input=False)


def _run_estimator2(f, X, out):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call this accumulate_prediction_multioutput, though I'd be interested in merging the single output case in to avoid lots of duplicated logic.

# they would be defined locally in ForestClassifier or ForestRegressor, but
# joblib complains that it cannot Pickle them when placed there.

def _run_estimator(f, X, out):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call this accumulate_prediction. Please call f func or predict.

for k in range(self.n_outputs_):
proba[k] += all_proba[j][k]
all_proba = [np.zeros((X.shape[0], j), dtype=np.float64)
for j in self.n_classes_]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you do np.atleast_1d around self.n_classes_ then you can generalise this to the 1 output case quite easily

# Parallel loop
all_y_hat = Parallel(n_jobs=n_jobs, verbose=self.verbose,
backend="threading")(
delayed(parallel_helper)(e, 'predict', X, check_input=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this now works without parallel_helper due to the threading backend, I suspect we can remove all uses of parallel_helper in random forests. Another PR is welcome.


def accumulate_prediction(predict, X, out):
prediction = predict(X, check_input=False)
if len(out) == 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't the right condition, I think. Isn't it examining a 2d array whose length is the number of samples. Rather you can test for the presence of attribute shape to distinguish array from list.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that tests are passing makes me think I've missed something or tests are too weak...

Copy link
Contributor Author

@mikebenfield mikebenfield Apr 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this revised function, out is always a list. It contains only one element in the single output case. The function could have either of these two interfaces:

  1. accept either a list of arrays or a single array, or
  2. only accept a list of arrays.

I figured (2) was simpler, but I can change to (1) if desired.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion on any of the solution.
However, using if isinstance() is straight forward regarding the aim of the if statement.
With your solution, I think that you would need a comment preceding the if statement for the sake of clarity.

@jnothman
Copy link
Member

jnothman commented Apr 2, 2017 via email

@jnothman jnothman changed the title [MRG] Fix excessive memory usage in random forest prediction [MRG+1] Fix excessive memory usage in random forest prediction Apr 2, 2017
@jnothman
Copy link
Member

jnothman commented Apr 3, 2017 via email

backend="threading")(
delayed(parallel_helper)(e, 'predict', X, check_input=False)
Parallel(n_jobs=n_jobs, verbose=self.verbose, backend="threading")(
delayed(accumulate_prediction)(e.predict, X, [y_hat])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused about the [y_hat] aspect. It seems to indicate that regardless of if y_hat is multi-output or not, that accumulate_prediction should treat it as if it's single output? Can you clarify what's happening here?

@jmschrei
Copy link
Member

jmschrei commented Apr 3, 2017

Thanks for this! It looks great, except for one small confusion I had. Otherwise it LGTM.

@jnothman
Copy link
Member

jnothman commented Apr 3, 2017 via email

@jmschrei
Copy link
Member

jmschrei commented Apr 3, 2017

Ah, got it, didn't see that it was in RFRegresor.

@jmschrei jmschrei changed the title [MRG+1] Fix excessive memory usage in random forest prediction [MRG+2] Fix excessive memory usage in random forest prediction Apr 3, 2017
@jmschrei jmschrei merged commit 9be0922 into scikit-learn:master Apr 3, 2017
@jmschrei
Copy link
Member

jmschrei commented Apr 3, 2017

Thanks for the contribution!

@constantinpape
Copy link

Cool, this will make the random-forest useful for predicting larger data again.
Is there an upcoming release for sklearn, so this could be used in a portable manner (i,e, not having to build from master)?

@jnothman
Copy link
Member

jnothman commented Apr 3, 2017 via email

@GaelVaroquaux
Copy link
Member

That's very nice work, @mikebenfield ! Thanks!

massich pushed a commit to massich/scikit-learn that referenced this pull request Apr 26, 2017
…t-learn#8672)

[MRG+2] Fix excessive memory usage in random forest prediction
Sundrique pushed a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017
…t-learn#8672)

[MRG+2] Fix excessive memory usage in random forest prediction
NelleV pushed a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017
…t-learn#8672)

[MRG+2] Fix excessive memory usage in random forest prediction
paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
…t-learn#8672)

[MRG+2] Fix excessive memory usage in random forest prediction
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
…t-learn#8672)

[MRG+2] Fix excessive memory usage in random forest prediction
jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017
…t-learn#8672)

[MRG+2] Fix excessive memory usage in random forest prediction
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Random Forest: Memory increases linearly with n_estimators
6 participants