Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+2] Fix excessive memory usage in random forest prediction #8672

Merged
merged 5 commits into from Apr 3, 2017
Merged

[MRG+2] Fix excessive memory usage in random forest prediction #8672

merged 5 commits into from Apr 3, 2017

Conversation

@mikebenfield
Copy link
Contributor

@mikebenfield mikebenfield commented Mar 30, 2017

Fix #8244.

That issue was about random forests using excessive memory in predicting. The problem was that the forest held onto each prediction from each estimator, only to combine them at the end. It was suggested in the thread to solve this by parallelizing over the instances instead of the estimators. Instead, I still parallelize over the trees, but keep an array out into which I immediately sum each output. Python's GIL makes this thread safe.

My testing indicates runtime performance is the same, but memory usage is substantially decreased. See this gist. Running python prob_classification.py n gives a memory increment (around line 587 of forest.py) in MiB of

n master fix2-forest-memory
100 13.3 1.6
200 25.5 2.0
300 38.1 2.2
400 50.1 2.3
500 62.6 2.4
600 75.7 2.2

I wasn't sure where you'd like the functions _run_estimator_ and _run_estimator2. Let me know if I should move them.

@glemaitre
Copy link
Contributor

@glemaitre glemaitre commented Mar 31, 2017

@mikebenfield Could you add the predict runtime of master and your PR for different sample size [1000, 10000, 100000, 1000000].

@mikebenfield
Copy link
Contributor Author

@mikebenfield mikebenfield commented Mar 31, 2017

@glemaitre Using the same script I posted before, with n_estimators=200, I get this (in seconds, on my little 2-core laptop):

n_samples master fix2-forest-memory
100 0.108 0.107
1000 0.107 0.107
10000 0.113 0.103
100000 0.990 0.932
1000000 16.175 14.363
if self.n_outputs_ == 1:
for j in range(1, len(all_proba)):
proba += all_proba[j]
out = np.zeros((X.shape[0], self.n_classes_), dtype=np.float64)

This comment has been minimized.

@glemaitre

glemaitre Apr 1, 2017
Contributor

I would keep proba and all_proba instead of out regarding the naming.

@glemaitre
Copy link
Contributor

@glemaitre glemaitre commented Apr 1, 2017

@ogrisel wdyt of this solution?

Copy link
Member

@jnothman jnothman left a comment

This looks like a very good thing. Thanks!

out += f(X, check_input=False)


def _run_estimator2(f, X, out):

This comment has been minimized.

@jnothman

jnothman Apr 2, 2017
Member

call this accumulate_prediction_multioutput, though I'd be interested in merging the single output case in to avoid lots of duplicated logic.

# they would be defined locally in ForestClassifier or ForestRegressor, but
# joblib complains that it cannot Pickle them when placed there.

def _run_estimator(f, X, out):

This comment has been minimized.

@jnothman

jnothman Apr 2, 2017
Member

call this accumulate_prediction. Please call f func or predict.

for k in range(self.n_outputs_):
proba[k] += all_proba[j][k]
all_proba = [np.zeros((X.shape[0], j), dtype=np.float64)
for j in self.n_classes_]

This comment has been minimized.

@jnothman

jnothman Apr 2, 2017
Member

if you do np.atleast_1d around self.n_classes_ then you can generalise this to the 1 output case quite easily

# Parallel loop
all_y_hat = Parallel(n_jobs=n_jobs, verbose=self.verbose,
backend="threading")(
delayed(parallel_helper)(e, 'predict', X, check_input=False)

This comment has been minimized.

@jnothman

jnothman Apr 2, 2017
Member

Given that this now works without parallel_helper due to the threading backend, I suspect we can remove all uses of parallel_helper in random forests. Another PR is welcome.


def accumulate_prediction(predict, X, out):
prediction = predict(X, check_input=False)
if len(out) == 1:

This comment has been minimized.

@jnothman

jnothman Apr 2, 2017
Member

this isn't the right condition, I think. Isn't it examining a 2d array whose length is the number of samples. Rather you can test for the presence of attribute shape to distinguish array from list.

This comment has been minimized.

@jnothman

jnothman Apr 2, 2017
Member

The fact that tests are passing makes me think I've missed something or tests are too weak...

This comment has been minimized.

@mikebenfield

mikebenfield Apr 2, 2017
Author Contributor

In this revised function, out is always a list. It contains only one element in the single output case. The function could have either of these two interfaces:

  1. accept either a list of arrays or a single array, or
  2. only accept a list of arrays.

I figured (2) was simpler, but I can change to (1) if desired.

This comment has been minimized.

@glemaitre

glemaitre Apr 3, 2017
Contributor

I don't have a strong opinion on any of the solution.
However, using if isinstance() is straight forward regarding the aim of the if statement.
With your solution, I think that you would need a comment preceding the if statement for the sake of clarity.

@jnothman
Copy link
Member

@jnothman jnothman commented Apr 2, 2017

@jnothman jnothman changed the title [MRG] Fix excessive memory usage in random forest prediction [MRG+1] Fix excessive memory usage in random forest prediction Apr 2, 2017
@jnothman
Copy link
Member

@jnothman jnothman commented Apr 3, 2017

backend="threading")(
delayed(parallel_helper)(e, 'predict', X, check_input=False)
Parallel(n_jobs=n_jobs, verbose=self.verbose, backend="threading")(
delayed(accumulate_prediction)(e.predict, X, [y_hat])

This comment has been minimized.

@jmschrei

jmschrei Apr 3, 2017
Member

I'm a bit confused about the [y_hat] aspect. It seems to indicate that regardless of if y_hat is multi-output or not, that accumulate_prediction should treat it as if it's single output? Can you clarify what's happening here?

@jmschrei
Copy link
Member

@jmschrei jmschrei commented Apr 3, 2017

Thanks for this! It looks great, except for one small confusion I had. Otherwise it LGTM.

@jnothman
Copy link
Member

@jnothman jnothman commented Apr 3, 2017

@jmschrei
Copy link
Member

@jmschrei jmschrei commented Apr 3, 2017

Ah, got it, didn't see that it was in RFRegresor.

@jmschrei jmschrei changed the title [MRG+1] Fix excessive memory usage in random forest prediction [MRG+2] Fix excessive memory usage in random forest prediction Apr 3, 2017
@jmschrei jmschrei merged commit 9be0922 into scikit-learn:master Apr 3, 2017
5 checks passed
5 checks passed
ci/circleci Your tests passed on CircleCI!
Details
codecov/patch 100% of diff hit (target 95.5%)
Details
codecov/project 95.5% (+<.01%) compared to fb5a498
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@jmschrei
Copy link
Member

@jmschrei jmschrei commented Apr 3, 2017

Thanks for the contribution!

@constantinpape
Copy link

@constantinpape constantinpape commented Apr 3, 2017

Cool, this will make the random-forest useful for predicting larger data again.
Is there an upcoming release for sklearn, so this could be used in a portable manner (i,e, not having to build from master)?

@jnothman
Copy link
Member

@jnothman jnothman commented Apr 3, 2017

@GaelVaroquaux
Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Apr 4, 2017

That's very nice work, @mikebenfield ! Thanks!

massich added a commit to massich/scikit-learn that referenced this pull request Apr 26, 2017
…t-learn#8672)

[MRG+2] Fix excessive memory usage in random forest prediction
Sundrique added a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017
…t-learn#8672)

[MRG+2] Fix excessive memory usage in random forest prediction
NelleV added a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017
…t-learn#8672)

[MRG+2] Fix excessive memory usage in random forest prediction
paulha added a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
…t-learn#8672)

[MRG+2] Fix excessive memory usage in random forest prediction
maskani-moh added a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
…t-learn#8672)

[MRG+2] Fix excessive memory usage in random forest prediction
jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017
…t-learn#8672)

[MRG+2] Fix excessive memory usage in random forest prediction
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

6 participants