[MRG+2] Fix excessive memory usage in random forest prediction #8672
Conversation
@mikebenfield Could you add the |
@glemaitre Using the same script I posted before, with
|
if self.n_outputs_ == 1: | ||
for j in range(1, len(all_proba)): | ||
proba += all_proba[j] | ||
out = np.zeros((X.shape[0], self.n_classes_), dtype=np.float64) |
glemaitre
Apr 1, 2017
Contributor
I would keep proba
and all_proba
instead of out
regarding the naming.
I would keep proba
and all_proba
instead of out
regarding the naming.
@ogrisel wdyt of this solution? |
This looks like a very good thing. Thanks! |
out += f(X, check_input=False) | ||
|
||
|
||
def _run_estimator2(f, X, out): |
jnothman
Apr 2, 2017
Member
call this accumulate_prediction_multioutput
, though I'd be interested in merging the single output case in to avoid lots of duplicated logic.
call this accumulate_prediction_multioutput
, though I'd be interested in merging the single output case in to avoid lots of duplicated logic.
# they would be defined locally in ForestClassifier or ForestRegressor, but | ||
# joblib complains that it cannot Pickle them when placed there. | ||
|
||
def _run_estimator(f, X, out): |
jnothman
Apr 2, 2017
Member
call this accumulate_prediction
. Please call f
func
or predict
.
call this accumulate_prediction
. Please call f
func
or predict
.
for k in range(self.n_outputs_): | ||
proba[k] += all_proba[j][k] | ||
all_proba = [np.zeros((X.shape[0], j), dtype=np.float64) | ||
for j in self.n_classes_] |
jnothman
Apr 2, 2017
Member
if you do np.atleast_1d
around self.n_classes_
then you can generalise this to the 1 output case quite easily
if you do np.atleast_1d
around self.n_classes_
then you can generalise this to the 1 output case quite easily
# Parallel loop | ||
all_y_hat = Parallel(n_jobs=n_jobs, verbose=self.verbose, | ||
backend="threading")( | ||
delayed(parallel_helper)(e, 'predict', X, check_input=False) |
jnothman
Apr 2, 2017
Member
Given that this now works without parallel_helper
due to the threading backend, I suspect we can remove all uses of parallel_helper
in random forests. Another PR is welcome.
Given that this now works without parallel_helper
due to the threading backend, I suspect we can remove all uses of parallel_helper
in random forests. Another PR is welcome.
|
||
def accumulate_prediction(predict, X, out): | ||
prediction = predict(X, check_input=False) | ||
if len(out) == 1: |
jnothman
Apr 2, 2017
Member
this isn't the right condition, I think. Isn't it examining a 2d array whose length is the number of samples. Rather you can test for the presence of attribute shape
to distinguish array from list.
this isn't the right condition, I think. Isn't it examining a 2d array whose length is the number of samples. Rather you can test for the presence of attribute shape
to distinguish array from list.
jnothman
Apr 2, 2017
Member
The fact that tests are passing makes me think I've missed something or tests are too weak...
The fact that tests are passing makes me think I've missed something or tests are too weak...
mikebenfield
Apr 2, 2017
•
Author
Contributor
In this revised function, out
is always a list. It contains only one element in the single output case. The function could have either of these two interfaces:
- accept either a list of arrays or a single array, or
- only accept a list of arrays.
I figured (2) was simpler, but I can change to (1) if desired.
In this revised function, out
is always a list. It contains only one element in the single output case. The function could have either of these two interfaces:
- accept either a list of arrays or a single array, or
- only accept a list of arrays.
I figured (2) was simpler, but I can change to (1) if desired.
glemaitre
Apr 3, 2017
Contributor
I don't have a strong opinion on any of the solution.
However, using if isinstance()
is straight forward regarding the aim of the if
statement.
With your solution, I think that you would need a comment preceding the if
statement for the sake of clarity.
I don't have a strong opinion on any of the solution.
However, using if isinstance()
is straight forward regarding the aim of the if
statement.
With your solution, I think that you would need a comment preceding the if
statement for the sake of clarity.
oh of course. you're testing out, not the prediction.
LGTM!
…On 3 Apr 2017 8:46 am, "mikebenfield" ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In sklearn/ensemble/forest.py
<#8672 (comment)>
:
> -
-def _run_estimator(f, X, out):
- out += f(X, check_input=False)
-
-
-def _run_estimator2(f, X, out):
- all_proba = f(X, check_input=False)
- for i in range(len(out)):
- out[i] += all_proba[i]
+# This is a utility function for joblib's Parallel. It can't go locally in
+# ForestClassifier or ForestRegressor, because joblib complains that it cannot
+# pickle it when placed there.
+
+def accumulate_prediction(predict, X, out):
+ prediction = predict(X, check_input=False)
+ if len(out) == 1:
In this revised function, out is always a list. It contains only one
element in the single output case. The function could
1. accept either a list of arrays or a single array, or
2. only accept a list of arrays.
I figured (2) was simpler, but I can change to (1) if desired.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#8672 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6wohebeB_ySN_ixEZlH-9UPmaPIYks5rsCUzgaJpZM4MuxbP>
.
|
you may wish to add an entry in what's new
…On 3 Apr 2017 8:54 am, "Joel Nothman" ***@***.***> wrote:
oh of course. you're testing out, not the prediction.
LGTM!
On 3 Apr 2017 8:46 am, "mikebenfield" ***@***.***> wrote:
> ***@***.**** commented on this pull request.
> ------------------------------
>
> In sklearn/ensemble/forest.py
> <#8672 (comment)>
> :
>
> > -
> -def _run_estimator(f, X, out):
> - out += f(X, check_input=False)
> -
> -
> -def _run_estimator2(f, X, out):
> - all_proba = f(X, check_input=False)
> - for i in range(len(out)):
> - out[i] += all_proba[i]
> +# This is a utility function for joblib's Parallel. It can't go locally in
> +# ForestClassifier or ForestRegressor, because joblib complains that it cannot
> +# pickle it when placed there.
> +
> +def accumulate_prediction(predict, X, out):
> + prediction = predict(X, check_input=False)
> + if len(out) == 1:
>
> In this revised function, out is always a list. It contains only one
> element in the single output case. The function could
>
> 1. accept either a list of arrays or a single array, or
> 2. only accept a list of arrays.
>
> I figured (2) was simpler, but I can change to (1) if desired.
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#8672 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AAEz6wohebeB_ySN_ixEZlH-9UPmaPIYks5rsCUzgaJpZM4MuxbP>
> .
>
|
backend="threading")( | ||
delayed(parallel_helper)(e, 'predict', X, check_input=False) | ||
Parallel(n_jobs=n_jobs, verbose=self.verbose, backend="threading")( | ||
delayed(accumulate_prediction)(e.predict, X, [y_hat]) |
jmschrei
Apr 3, 2017
Member
I'm a bit confused about the [y_hat] aspect. It seems to indicate that regardless of if y_hat is multi-output or not, that accumulate_prediction
should treat it as if it's single output? Can you clarify what's happening here?
I'm a bit confused about the [y_hat] aspect. It seems to indicate that regardless of if y_hat is multi-output or not, that accumulate_prediction
should treat it as if it's single output? Can you clarify what's happening here?
Thanks for this! It looks great, except for one small confusion I had. Otherwise it LGTM. |
That's just to make the regression predicting look more like predict_proba
…On 3 Apr 2017 4:22 pm, "Jacob Schreiber" ***@***.***> wrote:
Thanks for this! It looks great, except for one small confusion I had.
Otherwise it LGTM.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#8672 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6641s4SjiSMtvVhqrOtB_T8WNtA6ks5rsJAmgaJpZM4MuxbP>
.
|
Ah, got it, didn't see that it was in RFRegresor. |
Thanks for the contribution! |
Cool, this will make the random-forest useful for predicting larger data again. |
We're aiming for a release before the end of May, right, @amueller?
…On 4 April 2017 at 09:08, Constantin Pape ***@***.***> wrote:
Cool, this will make the random-forest useful for predicting larger data
again.
Is there an upcoming release for sklearn, so this could be used in a
portable manner (i,e, not having to build from master)?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#8672 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz627n1rFmjoPtgy3Ie0_l-j8jRCY4ks5rsXvqgaJpZM4MuxbP>
.
|
That's very nice work, @mikebenfield ! Thanks! |
…t-learn#8672) [MRG+2] Fix excessive memory usage in random forest prediction
…t-learn#8672) [MRG+2] Fix excessive memory usage in random forest prediction
…t-learn#8672) [MRG+2] Fix excessive memory usage in random forest prediction
…t-learn#8672) [MRG+2] Fix excessive memory usage in random forest prediction
…t-learn#8672) [MRG+2] Fix excessive memory usage in random forest prediction
…t-learn#8672) [MRG+2] Fix excessive memory usage in random forest prediction
Fix #8244.
That issue was about random forests using excessive memory in predicting. The problem was that the forest held onto each prediction from each estimator, only to combine them at the end. It was suggested in the thread to solve this by parallelizing over the instances instead of the estimators. Instead, I still parallelize over the trees, but keep an array
out
into which I immediately sum each output. Python's GIL makes this thread safe.My testing indicates runtime performance is the same, but memory usage is substantially decreased. See this gist. Running
python prob_classification.py n
gives a memory increment (around line 587 of forest.py) in MiB ofI wasn't sure where you'd like the functions
_run_estimator_
and_run_estimator2
. Let me know if I should move them.