New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+2] Fix excessive memory usage in random forest prediction #8672

Merged
merged 5 commits into from Apr 3, 2017

Conversation

Projects
None yet
6 participants
@mikebenfield
Contributor

mikebenfield commented Mar 30, 2017

Fix #8244.

That issue was about random forests using excessive memory in predicting. The problem was that the forest held onto each prediction from each estimator, only to combine them at the end. It was suggested in the thread to solve this by parallelizing over the instances instead of the estimators. Instead, I still parallelize over the trees, but keep an array out into which I immediately sum each output. Python's GIL makes this thread safe.

My testing indicates runtime performance is the same, but memory usage is substantially decreased. See this gist. Running python prob_classification.py n gives a memory increment (around line 587 of forest.py) in MiB of

n master fix2-forest-memory
100 13.3 1.6
200 25.5 2.0
300 38.1 2.2
400 50.1 2.3
500 62.6 2.4
600 75.7 2.2

I wasn't sure where you'd like the functions _run_estimator_ and _run_estimator2. Let me know if I should move them.

@glemaitre

This comment has been minimized.

Show comment
Hide comment
@glemaitre

glemaitre Mar 31, 2017

Contributor

@mikebenfield Could you add the predict runtime of master and your PR for different sample size [1000, 10000, 100000, 1000000].

Contributor

glemaitre commented Mar 31, 2017

@mikebenfield Could you add the predict runtime of master and your PR for different sample size [1000, 10000, 100000, 1000000].

@mikebenfield

This comment has been minimized.

Show comment
Hide comment
@mikebenfield

mikebenfield Mar 31, 2017

Contributor

@glemaitre Using the same script I posted before, with n_estimators=200, I get this (in seconds, on my little 2-core laptop):

n_samples master fix2-forest-memory
100 0.108 0.107
1000 0.107 0.107
10000 0.113 0.103
100000 0.990 0.932
1000000 16.175 14.363
Contributor

mikebenfield commented Mar 31, 2017

@glemaitre Using the same script I posted before, with n_estimators=200, I get this (in seconds, on my little 2-core laptop):

n_samples master fix2-forest-memory
100 0.108 0.107
1000 0.107 0.107
10000 0.113 0.103
100000 0.990 0.932
1000000 16.175 14.363
@glemaitre

This comment has been minimized.

Show comment
Hide comment
@glemaitre

glemaitre Apr 1, 2017

Contributor

@ogrisel wdyt of this solution?

Contributor

glemaitre commented Apr 1, 2017

@ogrisel wdyt of this solution?

@jnothman

This looks like a very good thing. Thanks!

Show outdated Hide outdated sklearn/ensemble/forest.py
Show outdated Hide outdated sklearn/ensemble/forest.py
Show outdated Hide outdated sklearn/ensemble/forest.py
# Parallel loop
all_y_hat = Parallel(n_jobs=n_jobs, verbose=self.verbose,
backend="threading")(
delayed(parallel_helper)(e, 'predict', X, check_input=False)

This comment has been minimized.

@jnothman

jnothman Apr 2, 2017

Member

Given that this now works without parallel_helper due to the threading backend, I suspect we can remove all uses of parallel_helper in random forests. Another PR is welcome.

@jnothman

jnothman Apr 2, 2017

Member

Given that this now works without parallel_helper due to the threading backend, I suspect we can remove all uses of parallel_helper in random forests. Another PR is welcome.

def accumulate_prediction(predict, X, out):
prediction = predict(X, check_input=False)
if len(out) == 1:

This comment has been minimized.

@jnothman

jnothman Apr 2, 2017

Member

this isn't the right condition, I think. Isn't it examining a 2d array whose length is the number of samples. Rather you can test for the presence of attribute shape to distinguish array from list.

@jnothman

jnothman Apr 2, 2017

Member

this isn't the right condition, I think. Isn't it examining a 2d array whose length is the number of samples. Rather you can test for the presence of attribute shape to distinguish array from list.

This comment has been minimized.

@jnothman

jnothman Apr 2, 2017

Member

The fact that tests are passing makes me think I've missed something or tests are too weak...

@jnothman

jnothman Apr 2, 2017

Member

The fact that tests are passing makes me think I've missed something or tests are too weak...

This comment has been minimized.

@mikebenfield

mikebenfield Apr 2, 2017

Contributor

In this revised function, out is always a list. It contains only one element in the single output case. The function could have either of these two interfaces:

  1. accept either a list of arrays or a single array, or
  2. only accept a list of arrays.

I figured (2) was simpler, but I can change to (1) if desired.

@mikebenfield

mikebenfield Apr 2, 2017

Contributor

In this revised function, out is always a list. It contains only one element in the single output case. The function could have either of these two interfaces:

  1. accept either a list of arrays or a single array, or
  2. only accept a list of arrays.

I figured (2) was simpler, but I can change to (1) if desired.

This comment has been minimized.

@glemaitre

glemaitre Apr 3, 2017

Contributor

I don't have a strong opinion on any of the solution.
However, using if isinstance() is straight forward regarding the aim of the if statement.
With your solution, I think that you would need a comment preceding the if statement for the sake of clarity.

@glemaitre

glemaitre Apr 3, 2017

Contributor

I don't have a strong opinion on any of the solution.
However, using if isinstance() is straight forward regarding the aim of the if statement.
With your solution, I think that you would need a comment preceding the if statement for the sake of clarity.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Apr 2, 2017

Member
Member

jnothman commented Apr 2, 2017

@jnothman jnothman changed the title from [MRG] Fix excessive memory usage in random forest prediction to [MRG+1] Fix excessive memory usage in random forest prediction Apr 2, 2017

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Apr 3, 2017

Member
Member

jnothman commented Apr 3, 2017

backend="threading")(
delayed(parallel_helper)(e, 'predict', X, check_input=False)
Parallel(n_jobs=n_jobs, verbose=self.verbose, backend="threading")(
delayed(accumulate_prediction)(e.predict, X, [y_hat])

This comment has been minimized.

@jmschrei

jmschrei Apr 3, 2017

Member

I'm a bit confused about the [y_hat] aspect. It seems to indicate that regardless of if y_hat is multi-output or not, that accumulate_prediction should treat it as if it's single output? Can you clarify what's happening here?

@jmschrei

jmschrei Apr 3, 2017

Member

I'm a bit confused about the [y_hat] aspect. It seems to indicate that regardless of if y_hat is multi-output or not, that accumulate_prediction should treat it as if it's single output? Can you clarify what's happening here?

@jmschrei

This comment has been minimized.

Show comment
Hide comment
@jmschrei

jmschrei Apr 3, 2017

Member

Thanks for this! It looks great, except for one small confusion I had. Otherwise it LGTM.

Member

jmschrei commented Apr 3, 2017

Thanks for this! It looks great, except for one small confusion I had. Otherwise it LGTM.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Apr 3, 2017

Member
Member

jnothman commented Apr 3, 2017

@jmschrei

This comment has been minimized.

Show comment
Hide comment
@jmschrei

jmschrei Apr 3, 2017

Member

Ah, got it, didn't see that it was in RFRegresor.

Member

jmschrei commented Apr 3, 2017

Ah, got it, didn't see that it was in RFRegresor.

@jmschrei jmschrei changed the title from [MRG+1] Fix excessive memory usage in random forest prediction to [MRG+2] Fix excessive memory usage in random forest prediction Apr 3, 2017

@jmschrei jmschrei merged commit 9be0922 into scikit-learn:master Apr 3, 2017

5 checks passed

ci/circleci Your tests passed on CircleCI!
Details
codecov/patch 100% of diff hit (target 95.5%)
Details
codecov/project 95.5% (+<.01%) compared to fb5a498
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@jmschrei

This comment has been minimized.

Show comment
Hide comment
@jmschrei

jmschrei Apr 3, 2017

Member

Thanks for the contribution!

Member

jmschrei commented Apr 3, 2017

Thanks for the contribution!

@constantinpape

This comment has been minimized.

Show comment
Hide comment
@constantinpape

constantinpape Apr 3, 2017

Cool, this will make the random-forest useful for predicting larger data again.
Is there an upcoming release for sklearn, so this could be used in a portable manner (i,e, not having to build from master)?

constantinpape commented Apr 3, 2017

Cool, this will make the random-forest useful for predicting larger data again.
Is there an upcoming release for sklearn, so this could be used in a portable manner (i,e, not having to build from master)?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Apr 3, 2017

Member
Member

jnothman commented Apr 3, 2017

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux Apr 4, 2017

Member

That's very nice work, @mikebenfield ! Thanks!

Member

GaelVaroquaux commented Apr 4, 2017

That's very nice work, @mikebenfield ! Thanks!

massich added a commit to massich/scikit-learn that referenced this pull request Apr 26, 2017

[MRG+1] Fix excessive memory usage in random forest prediction (#8672)
[MRG+2] Fix excessive memory usage in random forest prediction

Sundrique added a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017

[MRG+1] Fix excessive memory usage in random forest prediction (#8672)
[MRG+2] Fix excessive memory usage in random forest prediction

NelleV added a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017

[MRG+1] Fix excessive memory usage in random forest prediction (#8672)
[MRG+2] Fix excessive memory usage in random forest prediction

paulha added a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

[MRG+1] Fix excessive memory usage in random forest prediction (#8672)
[MRG+2] Fix excessive memory usage in random forest prediction

maskani-moh added a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017

[MRG+1] Fix excessive memory usage in random forest prediction (#8672)
[MRG+2] Fix excessive memory usage in random forest prediction

jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017

[MRG+1] Fix excessive memory usage in random forest prediction (#8672)
[MRG+2] Fix excessive memory usage in random forest prediction
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment