-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[MRG+2] Fix excessive memory usage in random forest prediction #8672
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@mikebenfield Could you add the |
@glemaitre Using the same script I posted before, with
|
sklearn/ensemble/forest.py
Outdated
if self.n_outputs_ == 1: | ||
for j in range(1, len(all_proba)): | ||
proba += all_proba[j] | ||
out = np.zeros((X.shape[0], self.n_classes_), dtype=np.float64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would keep proba
and all_proba
instead of out
regarding the naming.
@ogrisel wdyt of this solution? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a very good thing. Thanks!
sklearn/ensemble/forest.py
Outdated
out += f(X, check_input=False) | ||
|
||
|
||
def _run_estimator2(f, X, out): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
call this accumulate_prediction_multioutput
, though I'd be interested in merging the single output case in to avoid lots of duplicated logic.
sklearn/ensemble/forest.py
Outdated
# they would be defined locally in ForestClassifier or ForestRegressor, but | ||
# joblib complains that it cannot Pickle them when placed there. | ||
|
||
def _run_estimator(f, X, out): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
call this accumulate_prediction
. Please call f
func
or predict
.
sklearn/ensemble/forest.py
Outdated
for k in range(self.n_outputs_): | ||
proba[k] += all_proba[j][k] | ||
all_proba = [np.zeros((X.shape[0], j), dtype=np.float64) | ||
for j in self.n_classes_] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you do np.atleast_1d
around self.n_classes_
then you can generalise this to the 1 output case quite easily
# Parallel loop | ||
all_y_hat = Parallel(n_jobs=n_jobs, verbose=self.verbose, | ||
backend="threading")( | ||
delayed(parallel_helper)(e, 'predict', X, check_input=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that this now works without parallel_helper
due to the threading backend, I suspect we can remove all uses of parallel_helper
in random forests. Another PR is welcome.
|
||
def accumulate_prediction(predict, X, out): | ||
prediction = predict(X, check_input=False) | ||
if len(out) == 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this isn't the right condition, I think. Isn't it examining a 2d array whose length is the number of samples. Rather you can test for the presence of attribute shape
to distinguish array from list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that tests are passing makes me think I've missed something or tests are too weak...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this revised function, out
is always a list. It contains only one element in the single output case. The function could have either of these two interfaces:
- accept either a list of arrays or a single array, or
- only accept a list of arrays.
I figured (2) was simpler, but I can change to (1) if desired.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong opinion on any of the solution.
However, using if isinstance()
is straight forward regarding the aim of the if
statement.
With your solution, I think that you would need a comment preceding the if
statement for the sake of clarity.
oh of course. you're testing out, not the prediction.
LGTM!
…On 3 Apr 2017 8:46 am, "mikebenfield" ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In sklearn/ensemble/forest.py
<#8672 (comment)>
:
> -
-def _run_estimator(f, X, out):
- out += f(X, check_input=False)
-
-
-def _run_estimator2(f, X, out):
- all_proba = f(X, check_input=False)
- for i in range(len(out)):
- out[i] += all_proba[i]
+# This is a utility function for joblib's Parallel. It can't go locally in
+# ForestClassifier or ForestRegressor, because joblib complains that it cannot
+# pickle it when placed there.
+
+def accumulate_prediction(predict, X, out):
+ prediction = predict(X, check_input=False)
+ if len(out) == 1:
In this revised function, out is always a list. It contains only one
element in the single output case. The function could
1. accept either a list of arrays or a single array, or
2. only accept a list of arrays.
I figured (2) was simpler, but I can change to (1) if desired.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#8672 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6wohebeB_ySN_ixEZlH-9UPmaPIYks5rsCUzgaJpZM4MuxbP>
.
|
you may wish to add an entry in what's new
…On 3 Apr 2017 8:54 am, "Joel Nothman" ***@***.***> wrote:
oh of course. you're testing out, not the prediction.
LGTM!
On 3 Apr 2017 8:46 am, "mikebenfield" ***@***.***> wrote:
> ***@***.**** commented on this pull request.
> ------------------------------
>
> In sklearn/ensemble/forest.py
> <#8672 (comment)>
> :
>
> > -
> -def _run_estimator(f, X, out):
> - out += f(X, check_input=False)
> -
> -
> -def _run_estimator2(f, X, out):
> - all_proba = f(X, check_input=False)
> - for i in range(len(out)):
> - out[i] += all_proba[i]
> +# This is a utility function for joblib's Parallel. It can't go locally in
> +# ForestClassifier or ForestRegressor, because joblib complains that it cannot
> +# pickle it when placed there.
> +
> +def accumulate_prediction(predict, X, out):
> + prediction = predict(X, check_input=False)
> + if len(out) == 1:
>
> In this revised function, out is always a list. It contains only one
> element in the single output case. The function could
>
> 1. accept either a list of arrays or a single array, or
> 2. only accept a list of arrays.
>
> I figured (2) was simpler, but I can change to (1) if desired.
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#8672 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AAEz6wohebeB_ySN_ixEZlH-9UPmaPIYks5rsCUzgaJpZM4MuxbP>
> .
>
|
backend="threading")( | ||
delayed(parallel_helper)(e, 'predict', X, check_input=False) | ||
Parallel(n_jobs=n_jobs, verbose=self.verbose, backend="threading")( | ||
delayed(accumulate_prediction)(e.predict, X, [y_hat]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused about the [y_hat] aspect. It seems to indicate that regardless of if y_hat is multi-output or not, that accumulate_prediction
should treat it as if it's single output? Can you clarify what's happening here?
Thanks for this! It looks great, except for one small confusion I had. Otherwise it LGTM. |
That's just to make the regression predicting look more like predict_proba
…On 3 Apr 2017 4:22 pm, "Jacob Schreiber" ***@***.***> wrote:
Thanks for this! It looks great, except for one small confusion I had.
Otherwise it LGTM.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#8672 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6641s4SjiSMtvVhqrOtB_T8WNtA6ks5rsJAmgaJpZM4MuxbP>
.
|
Ah, got it, didn't see that it was in RFRegresor. |
Thanks for the contribution! |
Cool, this will make the random-forest useful for predicting larger data again. |
We're aiming for a release before the end of May, right, @amueller?
…On 4 April 2017 at 09:08, Constantin Pape ***@***.***> wrote:
Cool, this will make the random-forest useful for predicting larger data
again.
Is there an upcoming release for sklearn, so this could be used in a
portable manner (i,e, not having to build from master)?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#8672 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz627n1rFmjoPtgy3Ie0_l-j8jRCY4ks5rsXvqgaJpZM4MuxbP>
.
|
That's very nice work, @mikebenfield ! Thanks! |
…t-learn#8672) [MRG+2] Fix excessive memory usage in random forest prediction
…t-learn#8672) [MRG+2] Fix excessive memory usage in random forest prediction
…t-learn#8672) [MRG+2] Fix excessive memory usage in random forest prediction
…t-learn#8672) [MRG+2] Fix excessive memory usage in random forest prediction
…t-learn#8672) [MRG+2] Fix excessive memory usage in random forest prediction
…t-learn#8672) [MRG+2] Fix excessive memory usage in random forest prediction
Fix #8244.
That issue was about random forests using excessive memory in predicting. The problem was that the forest held onto each prediction from each estimator, only to combine them at the end. It was suggested in the thread to solve this by parallelizing over the instances instead of the estimators. Instead, I still parallelize over the trees, but keep an array
out
into which I immediately sum each output. Python's GIL makes this thread safe.My testing indicates runtime performance is the same, but memory usage is substantially decreased. See this gist. Running
python prob_classification.py n
gives a memory increment (around line 587 of forest.py) in MiB ofI wasn't sure where you'd like the functions
_run_estimator_
and_run_estimator2
. Let me know if I should move them.