New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+2] parallelized VotingClassifier #5805
[MRG+2] parallelized VotingClassifier #5805
Conversation
If someone wants to test it with nested multithreading, e.g. in the code below we call version of VotingClassifier from cross_val_score.
With In both cases everything works fine and provides correct results. |
Could you please add tests showing that? Same for the |
Ok, i'll add couple of tests with multithreading/processing. I've tested it already on existing tests from test_voting_classifier, with different default threading/processing parameters. |
|
||
assert_array_equal(eclf1.predict(X), eclf2.predict(X)) | ||
assert_array_equal(eclf1.predict_proba(X), eclf2.predict_proba(X)) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
two empty lines between functions please (PEP8)
There are a few more |
The number of jobs to run in parallel for both `fit` and `predict`. | ||
If -1, then the number of jobs is set to the number of cores. | ||
|
||
backend: str, {'multiprocessing', 'threading'} (default='multiprocessing') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should allow the user to choose this. AFAIK, the threading backend is useful only if the code is written in a nogil block and the user is highly unlikely to know in which estimators that this is being done.
We should maybe just allow the default "multiprocessing" backend just like in cross_val_score
and in other places.
@olologin Could you rebase? |
955733c
to
4c00284
Compare
assert_array_equal(eclf1.predict_proba(X), eclf2.predict_proba(X)) | ||
|
||
|
||
def test_parallel_majority_label_iris(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this test not redundant? Isn't the previous test sufficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It tests "hard" voting stability (So that 1threaded and 2threaded version will give same results) on iris dataset, while previous one tests 'soft' version.
Anyway first test is fast (Small artificial dataset), so it shouldn't increase testing time noticeably.
LGTM pending nitpick @olologin cc: @agramfort | @TomDLT for a quick second pass. |
And sorry for the delays.. |
@MechCoder , Actually I almost forgot about this PR, thought it was someone else’s :) Could you take a look at this one too #6116 ? I think it's good PR too. |
I can try after the sem breaks, if no one gets to it by then. |
|
||
self.estimators_ = Parallel(n_jobs=self.n_jobs)( | ||
delayed(_parallel_fit_estimator)( | ||
clone(clf), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few cosmetic issues here.
- Let's avoid this level of nesting.
- closing parentheses conventionally appear right after what they're closing, not on a new line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you sort this minor issue out
For many estimators (e.g. linear models), predicting in parallel will degrade performance, particularly with the multiprocessing backend. Perhaps we should be using the threading backend at predict time, for which overhead is much smaller, or remove it altogether. You're welcome to benchmark, but I'm not sure how you'd come up with a realistic set of voters in the ensemble. |
@jnothman, Yes, you are right about this. I made some benchmarks with this code:
Version with multiprocessing backend takes ~2.1sec, threading backend ~1.3sec, without any parallelization ~1.9sec for predict_proba and predict. I think I could revert parallelization in predict methods completely (though threading backend is faster, difference is not so big), or change it to threading backend. What do you think is best solution? |
I don't feel like I have the expertise to judge this, but would lean conservatively (i.e. no parallelism) since we can't know the makeup of the ensemble, and prediction time is small relative to fit time. |
Also, all prediction can be parallelised over samples, so slow predictors can deal with that, I supose. |
904a172
to
520089c
Compare
We can merge after rebase.. |
e950b2f
to
1c2d6b4
Compare
1c2d6b4
to
86d02f1
Compare
Uh, @MechCoder I've not actually given this my +1. |
(nor has anyone but you) |
assert_array_equal(eclf1.predict(X), eclf2.predict(X)) | ||
assert_array_equal(eclf1.predict_proba(X), eclf2.predict_proba(X)) | ||
|
||
sample_weight_ = np.random.RandomState(123).uniform(size=(len(y),)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get why there's an _
after sample_weight
here.
('lr', clf1), ('svc', clf3), ('knn', clf4)], | ||
voting='soft') | ||
msg = ('Underlying estimator \'knn\' does not support sample weights.') | ||
assert_raise_message(ValueError, msg, eclf3.fit, X, y, sample_weight) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PEP8 requires newline at end of file
LGTM! |
@jnothman, thanks. Sorry for having so many style problems in so small PR :) I'm working with C++ at work, so sometimes different code-style kicks in unintentionally. |
No big deal. A commit hook running flake8 on changed files can help... |
Thanks! |
…it-learn#5805) * parallelized VotingClassifier * rename list to list_ to avoid problems * Added new tests for sample_weight, multithreading and multiprocessing * assert_equal -> assert_array_equal * Fixed sample_weight existence check and test_sample_weight * Code is clearer now * Tests refactoring, 'backend' parameter removed * Tests indentation fix * reverted parallel predict and predict_proba to single threaded version * what's new section added * minor fixes * check for sample_weight support in underlying estimators added * newline at the end of test
First version, looks like it's working :)
Also, i added sample_weight parameter into fit method, don't know if it's appropriate.