[MRG+2] Fix excessive memory usage in random forest prediction #8672

mikebenfield · 2017-03-30T18:50:56Z

That issue was about random forests using excessive memory in predicting. The problem was that the forest held onto each prediction from each estimator, only to combine them at the end. It was suggested in the thread to solve this by parallelizing over the instances instead of the estimators. Instead, I still parallelize over the trees, but keep an array out into which I immediately sum each output. Python's GIL makes this thread safe.

My testing indicates runtime performance is the same, but memory usage is substantially decreased. See this gist. Running python prob_classification.py n gives a memory increment (around line 587 of forest.py) in MiB of

n	master	fix2-forest-memory
100	13.3	1.6
200	25.5	2.0
300	38.1	2.2
400	50.1	2.3
500	62.6	2.4
600	75.7	2.2

I wasn't sure where you'd like the functions _run_estimator_ and _run_estimator2. Let me know if I should move them.

glemaitre · 2017-03-31T07:51:53Z

@mikebenfield Could you add the predict runtime of master and your PR for different sample size [1000, 10000, 100000, 1000000].

mikebenfield · 2017-03-31T23:38:19Z

@glemaitre Using the same script I posted before, with n_estimators=200, I get this (in seconds, on my little 2-core laptop):

n_samples	master	fix2-forest-memory
100	0.108	0.107
1000	0.107	0.107
10000	0.113	0.103
100000	0.990	0.932
`1000000`	16.175	14.363

glemaitre · 2017-04-01T05:47:59Z

sklearn/ensemble/forest.py

        if self.n_outputs_ == 1:
-            for j in range(1, len(all_proba)):
-                proba += all_proba[j]
+            out = np.zeros((X.shape[0], self.n_classes_), dtype=np.float64)


I would keep proba and all_proba instead of out regarding the naming.

glemaitre · 2017-04-01T05:48:44Z

@ogrisel wdyt of this solution?

jnothman

This looks like a very good thing. Thanks!

jnothman · 2017-04-02T03:33:24Z

sklearn/ensemble/forest.py

+    out += f(X, check_input=False)
+
+
+def _run_estimator2(f, X, out):


call this accumulate_prediction_multioutput, though I'd be interested in merging the single output case in to avoid lots of duplicated logic.

jnothman · 2017-04-02T03:33:26Z

sklearn/ensemble/forest.py

+# they would be defined locally in ForestClassifier or ForestRegressor, but
+# joblib complains that it cannot Pickle them when placed there.
+
+def _run_estimator(f, X, out):


call this accumulate_prediction. Please call f func or predict.

jnothman · 2017-04-02T03:36:48Z

sklearn/ensemble/forest.py

-                for k in range(self.n_outputs_):
-                    proba[k] += all_proba[j][k]
+            all_proba = [np.zeros((X.shape[0], j), dtype=np.float64)
+                         for j in self.n_classes_]


if you do np.atleast_1d around self.n_classes_ then you can generalise this to the 1 output case quite easily

jnothman · 2017-04-02T03:39:55Z

sklearn/ensemble/forest.py

        # Parallel loop
-        all_y_hat = Parallel(n_jobs=n_jobs, verbose=self.verbose,
-                             backend="threading")(
-            delayed(parallel_helper)(e, 'predict', X, check_input=False)


Given that this now works without parallel_helper due to the threading backend, I suspect we can remove all uses of parallel_helper in random forests. Another PR is welcome.

jnothman · 2017-04-02T21:59:35Z

sklearn/ensemble/forest.py

+
+def accumulate_prediction(predict, X, out):
+    prediction = predict(X, check_input=False)
+    if len(out) == 1:


this isn't the right condition, I think. Isn't it examining a 2d array whose length is the number of samples. Rather you can test for the presence of attribute shape to distinguish array from list.

The fact that tests are passing makes me think I've missed something or tests are too weak...

In this revised function, out is always a list. It contains only one element in the single output case. The function could have either of these two interfaces:

accept either a list of arrays or a single array, or

only accept a list of arrays.

I figured (2) was simpler, but I can change to (1) if desired.

I don't have a strong opinion on any of the solution.
However, using if isinstance() is straight forward regarding the aim of the if statement.
With your solution, I think that you would need a comment preceding the if statement for the sake of clarity.

jnothman · 2017-04-02T22:54:34Z

oh of course. you're testing out, not the prediction. LGTM!

…

On 3 Apr 2017 8:46 am, "mikebenfield" ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In sklearn/ensemble/forest.py <#8672 (comment)> : > - -def _run_estimator(f, X, out): - out += f(X, check_input=False) - - -def _run_estimator2(f, X, out): - all_proba = f(X, check_input=False) - for i in range(len(out)): - out[i] += all_proba[i] +# This is a utility function for joblib's Parallel. It can't go locally in +# ForestClassifier or ForestRegressor, because joblib complains that it cannot +# pickle it when placed there. + +def accumulate_prediction(predict, X, out): + prediction = predict(X, check_input=False) + if len(out) == 1: In this revised function, out is always a list. It contains only one element in the single output case. The function could 1. accept either a list of arrays or a single array, or 2. only accept a list of arrays. I figured (2) was simpler, but I can change to (1) if desired. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#8672 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6wohebeB_ySN_ixEZlH-9UPmaPIYks5rsCUzgaJpZM4MuxbP> .

jnothman · 2017-04-03T00:39:37Z

you may wish to add an entry in what's new

…

On 3 Apr 2017 8:54 am, "Joel Nothman" ***@***.***> wrote: oh of course. you're testing out, not the prediction. LGTM! On 3 Apr 2017 8:46 am, "mikebenfield" ***@***.***> wrote: > ***@***.**** commented on this pull request. > ------------------------------ > > In sklearn/ensemble/forest.py > <#8672 (comment)> > : > > > - > -def _run_estimator(f, X, out): > - out += f(X, check_input=False) > - > - > -def _run_estimator2(f, X, out): > - all_proba = f(X, check_input=False) > - for i in range(len(out)): > - out[i] += all_proba[i] > +# This is a utility function for joblib's Parallel. It can't go locally in > +# ForestClassifier or ForestRegressor, because joblib complains that it cannot > +# pickle it when placed there. > + > +def accumulate_prediction(predict, X, out): > + prediction = predict(X, check_input=False) > + if len(out) == 1: > > In this revised function, out is always a list. It contains only one > element in the single output case. The function could > > 1. accept either a list of arrays or a single array, or > 2. only accept a list of arrays. > > I figured (2) was simpler, but I can change to (1) if desired. > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#8672 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AAEz6wohebeB_ySN_ixEZlH-9UPmaPIYks5rsCUzgaJpZM4MuxbP> > . >

jmschrei · 2017-04-03T06:22:08Z

sklearn/ensemble/forest.py

-                             backend="threading")(
-            delayed(parallel_helper)(e, 'predict', X, check_input=False)
+        Parallel(n_jobs=n_jobs, verbose=self.verbose, backend="threading")(
+            delayed(accumulate_prediction)(e.predict, X, [y_hat])


I'm a bit confused about the [y_hat] aspect. It seems to indicate that regardless of if y_hat is multi-output or not, that accumulate_prediction should treat it as if it's single output? Can you clarify what's happening here?

jmschrei · 2017-04-03T06:22:28Z

Thanks for this! It looks great, except for one small confusion I had. Otherwise it LGTM.

jnothman · 2017-04-03T09:39:20Z

That's just to make the regression predicting look more like predict_proba

…

On 3 Apr 2017 4:22 pm, "Jacob Schreiber" ***@***.***> wrote: Thanks for this! It looks great, except for one small confusion I had. Otherwise it LGTM. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#8672 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6641s4SjiSMtvVhqrOtB_T8WNtA6ks5rsJAmgaJpZM4MuxbP> .

jmschrei · 2017-04-03T16:32:24Z

Ah, got it, didn't see that it was in RFRegresor.

jmschrei · 2017-04-03T16:33:08Z

Thanks for the contribution!

constantinpape · 2017-04-03T23:08:24Z

Cool, this will make the random-forest useful for predicting larger data again.
Is there an upcoming release for sklearn, so this could be used in a portable manner (i,e, not having to build from master)?

jnothman · 2017-04-03T23:34:18Z

We're aiming for a release before the end of May, right, @amueller?

…

On 4 April 2017 at 09:08, Constantin Pape ***@***.***> wrote: Cool, this will make the random-forest useful for predicting larger data again. Is there an upcoming release for sklearn, so this could be used in a portable manner (i,e, not having to build from master)? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#8672 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz627n1rFmjoPtgy3Ie0_l-j8jRCY4ks5rsXvqgaJpZM4MuxbP> .

GaelVaroquaux · 2017-04-04T05:46:06Z

That's very nice work, @mikebenfield ! Thanks!

…t-learn#8672) [MRG+2] Fix excessive memory usage in random forest prediction

mikebenfield added 3 commits March 30, 2017 11:27

Profiling.

379ff2d

Initial fix.

5e28248

Move run_estimator; remove memory profiling.

88c0720

glemaitre reviewed Apr 1, 2017

View reviewed changes

Rename out variable.

806ed48

jnothman reviewed Apr 2, 2017

View reviewed changes

More renames, merge single and multioutput case.

27aad4d

jnothman reviewed Apr 2, 2017

View reviewed changes

jnothman changed the title ~~[MRG] Fix excessive memory usage in random forest prediction~~ [MRG+1] Fix excessive memory usage in random forest prediction Apr 2, 2017

jmschrei reviewed Apr 3, 2017

View reviewed changes

jmschrei changed the title ~~[MRG+1] Fix excessive memory usage in random forest prediction~~ [MRG+2] Fix excessive memory usage in random forest prediction Apr 3, 2017

jmschrei merged commit 9be0922 into scikit-learn:master Apr 3, 2017

mikebenfield mentioned this pull request Apr 21, 2017

[MRG] Remove now-unnecessary uses of parallel_helper in forest.py. #8772

Closed

massich pushed a commit to massich/scikit-learn that referenced this pull request Apr 26, 2017

[MRG+1] Fix excessive memory usage in random forest prediction (sciki…

b5580de

…t-learn#8672) [MRG+2] Fix excessive memory usage in random forest prediction

lesteve mentioned this pull request Apr 28, 2017

Add ability to override joblib backend for scikit-learn estimators #8804

Closed

lesteve mentioned this pull request Jun 2, 2017

Add allow_override kwarg to Parallel joblib/joblib#524

Closed

Sundrique pushed a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017

[MRG+1] Fix excessive memory usage in random forest prediction (sciki…

83bc543

…t-learn#8672) [MRG+2] Fix excessive memory usage in random forest prediction

NelleV pushed a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017

[MRG+1] Fix excessive memory usage in random forest prediction (sciki…

c4beab7

…t-learn#8672) [MRG+2] Fix excessive memory usage in random forest prediction

paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

[MRG+1] Fix excessive memory usage in random forest prediction (sciki…

5445899

…t-learn#8672) [MRG+2] Fix excessive memory usage in random forest prediction

lesteve mentioned this pull request Sep 28, 2017

[MRG] FIX Avoid accumulating forest predictions in non-threadsafe manner #9830

Merged

maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017

[MRG+1] Fix excessive memory usage in random forest prediction (sciki…

d20cc5b

…t-learn#8672) [MRG+2] Fix excessive memory usage in random forest prediction

jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017

[MRG+1] Fix excessive memory usage in random forest prediction (sciki…

5f556d0

…t-learn#8672) [MRG+2] Fix excessive memory usage in random forest prediction

		out += f(X, check_input=False)


		def _run_estimator2(f, X, out):

Uh oh!

[MRG+2] Fix excessive memory usage in random forest prediction #8672

[MRG+2] Fix excessive memory usage in random forest prediction #8672

Uh oh!

Conversation

mikebenfield commented Mar 30, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

glemaitre commented Mar 31, 2017

Uh oh!

mikebenfield commented Mar 31, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glemaitre commented Apr 1, 2017

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mikebenfield Apr 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Apr 2, 2017 via email

Uh oh!

jnothman commented Apr 3, 2017 via email

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jmschrei commented Apr 3, 2017

Uh oh!

jnothman commented Apr 3, 2017 via email

Uh oh!

jmschrei commented Apr 3, 2017

Uh oh!

jmschrei commented Apr 3, 2017

Uh oh!

constantinpape commented Apr 3, 2017

Uh oh!

jnothman commented Apr 3, 2017 via email

Uh oh!

GaelVaroquaux commented Apr 4, 2017

Uh oh!

Uh oh!

mikebenfield commented Mar 30, 2017 •

edited

Loading

mikebenfield commented Mar 31, 2017 •

edited

Loading

mikebenfield Apr 2, 2017 •

edited

Loading