[MRG+2] parallelized VotingClassifier #5805

olologin · 2015-11-12T20:37:16Z

First version, looks like it's working :)
Also, i added sample_weight parameter into fit method, don't know if it's appropriate.

olologin · 2015-11-13T07:54:12Z

If someone wants to test it with nested multithreading, e.g. in the code below we call version of VotingClassifier from cross_val_score.

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn import datasets
from sklearn.model_selection import cross_val_score

# Load the iris dataset and randomly permute it
iris = datasets.load_iris()
X, y = iris.data[:, 1:3], iris.target


"""Check classification by majority label on dataset iris."""
clf1 = LogisticRegression(random_state=123)
clf2 = RandomForestClassifier(random_state=123)
clf3 = GaussianNB()
eclf = VotingClassifier(estimators=[
    ('lr', clf1), ('rf', clf2), ('gnb', clf3)],
    voting='hard', n_jobs=-1, backend='multiprocessing')
scores = cross_val_score(eclf, X, y, cv=5, scoring='accuracy', n_jobs=-1)

With backend='multiprocessing' it shows message UserWarning: Multiprocessing-backed parallel loops cannot be nested, setting n_jobs=1, so it just falls back to -1 threads in cross_val_score, but VotingClassifier entirely works in 1 thread. With backend='threading' it doesn't show any warning.

In both cases everything works fine and provides correct results.

giorgiop · 2015-11-13T07:58:14Z

In both cases everything works fine and provides correct results.

Could you please add tests showing that? Same for the sample_weight

olologin · 2015-11-13T08:03:24Z

Could you please add tests showing that? Same for the sample_weight

Ok, i'll add couple of tests with multithreading/processing. I've tested it already on existing tests from test_voting_classifier, with different default threading/processing parameters.

betatim · 2015-11-16T13:27:18Z

sklearn/ensemble/tests/test_voting_classifier.py

+
+    assert_array_equal(eclf1.predict(X), eclf2.predict(X))
+    assert_array_equal(eclf1.predict_proba(X), eclf2.predict_proba(X))
+


two empty lines between functions please (PEP8)

betatim · 2015-11-16T13:59:36Z

There are a few more pep8 related things to fix (param = 42 should be param=42, etc) in the code. If you install pep8 to have it tell you about them.

MechCoder · 2016-04-28T23:47:59Z

sklearn/ensemble/voting_classifier.py

+        The number of jobs to run in parallel for both `fit` and `predict`.
+        If -1, then the number of jobs is set to the number of cores.
+
+    backend: str, {'multiprocessing', 'threading'} (default='multiprocessing')


I don't think we should allow the user to choose this. AFAIK, the threading backend is useful only if the code is written in a nogil block and the user is highly unlikely to know in which estimators that this is being done.

We should maybe just allow the default "multiprocessing" backend just like in cross_val_score and in other places.

MechCoder · 2016-04-29T00:27:50Z

@olologin Could you rebase?

MechCoder · 2016-04-30T22:04:07Z

sklearn/ensemble/tests/test_voting_classifier.py

+    assert_array_equal(eclf1.predict_proba(X), eclf2.predict_proba(X))
+
+
+def test_parallel_majority_label_iris():


Is this test not redundant? Isn't the previous test sufficient?

It tests "hard" voting stability (So that 1threaded and 2threaded version will give same results) on iris dataset, while previous one tests 'soft' version.
Anyway first test is fast (Small artificial dataset), so it shouldn't increase testing time noticeably.

MechCoder · 2016-04-30T22:07:28Z

LGTM pending nitpick @olologin

cc: @agramfort | @TomDLT for a quick second pass.

MechCoder · 2016-04-30T22:08:39Z

And sorry for the delays..

olologin · 2016-05-01T04:09:49Z

@MechCoder , Actually I almost forgot about this PR, thought it was someone else’s :)

Could you take a look at this one too #6116 ? I think it's good PR too.

MechCoder · 2016-05-05T18:54:50Z

I can try after the sem breaks, if no one gets to it by then.

jnothman · 2016-06-21T05:28:39Z

sklearn/ensemble/voting_classifier.py

+
+        self.estimators_ = Parallel(n_jobs=self.n_jobs)(
+                delayed(_parallel_fit_estimator)(
+                    clone(clf),


A few cosmetic issues here.

Let's avoid this level of nesting.

closing parentheses conventionally appear right after what they're closing, not on a new line.

could you sort this minor issue out

jnothman · 2016-06-21T05:39:25Z

For many estimators (e.g. linear models), predicting in parallel will degrade performance, particularly with the multiprocessing backend. Perhaps we should be using the threading backend at predict time, for which overhead is much smaller, or remove it altogether. You're welcome to benchmark, but I'm not sure how you'd come up with a realistic set of voters in the ensemble.

olologin · 2016-07-31T19:26:41Z

@jnothman, Yes, you are right about this. I made some benchmarks with this code:

import xgboost as xgb
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier
from sklearn.datasets import load_digits

digits = load_digits()

lr = LogisticRegression(C=1000)
svc = SVC(gamma=.001, C=100, kernel='rbf', probability=True)
rf= RandomForestClassifier(n_estimators=400)
extraGini = ExtraTreesClassifier(n_estimators=400, criterion='gini')
extraEntropy = ExtraTreesClassifier(n_estimators=400, criterion='entropy')
xgb_model = xgb.XGBClassifier(nthread=2, silent=True, colsample_bytree=.4, learning_rate=0.05, max_depth=4, gamma=0, n_estimators=800)

#Setup the ensemble classifier
eclf = VotingClassifier(estimators=[
    ('lr', lr),
    ('svc', svc),
    ('rf', rf),
    ('extraGini', extraGini),
    ('extraEntropy', extraEntropy),
    ('xgb_model', xgb_model)
], voting='soft', n_jobs=4)

eclf.fit(digits.data, digits.target)
%timeit eclf.predict(digits.data)
%timeit eclf.predict_proba(digits.data)

Version with multiprocessing backend takes ~2.1sec, threading backend ~1.3sec, without any parallelization ~1.9sec for predict_proba and predict.

I think I could revert parallelization in predict methods completely (though threading backend is faster, difference is not so big), or change it to threading backend. What do you think is best solution?

jnothman · 2016-07-31T22:36:41Z

I don't feel like I have the expertise to judge this, but would lean conservatively (i.e. no parallelism) since we can't know the makeup of the ensemble, and prediction time is small relative to fit time.

jnothman · 2016-08-01T02:08:24Z

Also, all prediction can be parallelised over samples, so slow predictors can deal with that, I supose.

MechCoder · 2016-08-24T06:09:40Z

We can merge after rebase..

jnothman · 2016-08-24T11:18:46Z

Uh, @MechCoder I've not actually given this my +1.

jnothman · 2016-08-24T11:19:01Z

(nor has anyone but you)

jnothman · 2016-08-24T11:21:23Z

sklearn/ensemble/tests/test_voting_classifier.py

+    assert_array_equal(eclf1.predict(X), eclf2.predict(X))
+    assert_array_equal(eclf1.predict_proba(X), eclf2.predict_proba(X))
+
+    sample_weight_ = np.random.RandomState(123).uniform(size=(len(y),))


I don't get why there's an _ after sample_weight here.

jnothman · 2016-08-26T01:53:19Z

sklearn/ensemble/tests/test_voting_classifier.py

+        ('lr', clf1), ('svc', clf3), ('knn', clf4)],
+        voting='soft')
+    msg = ('Underlying estimator \'knn\' does not support sample weights.')
+    assert_raise_message(ValueError, msg, eclf3.fit, X, y, sample_weight)


PEP8 requires newline at end of file

jnothman · 2016-08-26T01:54:07Z

LGTM!

olologin · 2016-08-26T04:03:21Z

@jnothman, thanks. Sorry for having so many style problems in so small PR :) I'm working with C++ at work, so sometimes different code-style kicks in unintentionally.

jnothman · 2016-08-26T04:41:03Z

No big deal. A commit hook running flake8 on changed files can help...

jnothman · 2016-08-26T04:41:38Z

Thanks!

…it-learn#5805) * parallelized VotingClassifier * rename list to list_ to avoid problems * Added new tests for sample_weight, multithreading and multiprocessing * assert_equal -> assert_array_equal * Fixed sample_weight existence check and test_sample_weight * Code is clearer now * Tests refactoring, 'backend' parameter removed * Tests indentation fix * reverted parallel predict and predict_proba to single threaded version * what's new section added * minor fixes * check for sample_weight support in underlying estimators added * newline at the end of test

betatim reviewed Nov 16, 2015
View reviewed changes

olologin changed the title ~~parallelized VotingClassifier~~ [MRG] parallelized VotingClassifier Jan 8, 2016

MechCoder reviewed Apr 28, 2016
View reviewed changes

olologin force-pushed the votingclassifier_multithreading branch from 955733c to 4c00284 Compare April 30, 2016 05:49

MechCoder reviewed Apr 30, 2016
View reviewed changes

MechCoder changed the title ~~[MRG] parallelized VotingClassifier~~ [MRG+1] parallelized VotingClassifier Apr 30, 2016

jnothman reviewed Jun 21, 2016
View reviewed changes

olologin force-pushed the votingclassifier_multithreading branch from 904a172 to 520089c Compare August 1, 2016 02:48

olologin force-pushed the votingclassifier_multithreading branch from e950b2f to 1c2d6b4 Compare August 24, 2016 06:36

olologin added 11 commits August 24, 2016 10:10

parallelized VotingClassifier

649018c

rename list to list_ to avoid problems

83d3907

Added new tests for sample_weight, multithreading and multiprocessing

ddc189e

assert_equal -> assert_array_equal

5bfbc27

Fixed sample_weight existence check and test_sample_weight

7d37c02

Code is clearer now

6de9a9f

Tests refactoring, 'backend' parameter removed

8a81ac5

Tests indentation fix

0fc6917

reverted parallel predict and predict_proba to single threaded version

b340cf4

what's new section added

3dfea9b

minor fixes

86d02f1

olologin force-pushed the votingclassifier_multithreading branch from 1c2d6b4 to 86d02f1 Compare August 24, 2016 07:20

jnothman reviewed Aug 24, 2016
View reviewed changes

check for sample_weight support in underlying estimators added

b0f6ff1

jnothman reviewed Aug 26, 2016
View reviewed changes

jnothman changed the title ~~[MRG+1] parallelized VotingClassifier~~ [MRG+2] parallelized VotingClassifier Aug 26, 2016

newline at the end of test

6b63453

jnothman merged commit b3e122a into scikit-learn:master Aug 26, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG+2] parallelized VotingClassifier #5805

[MRG+2] parallelized VotingClassifier #5805

olologin commented Nov 12, 2015

olologin commented Nov 13, 2015

giorgiop commented Nov 13, 2015

olologin commented Nov 13, 2015

betatim Nov 16, 2015

betatim commented Nov 16, 2015

MechCoder Apr 28, 2016

MechCoder commented Apr 29, 2016

MechCoder Apr 30, 2016

olologin May 1, 2016

MechCoder commented Apr 30, 2016 •

edited

MechCoder commented Apr 30, 2016

olologin commented May 1, 2016

MechCoder commented May 5, 2016

jnothman Jun 21, 2016 •

edited

jnothman Aug 23, 2016

jnothman commented Jun 21, 2016

olologin commented Jul 31, 2016 •

edited

jnothman commented Jul 31, 2016

jnothman commented Aug 1, 2016

MechCoder commented Aug 24, 2016

jnothman commented Aug 24, 2016

jnothman commented Aug 24, 2016

jnothman Aug 24, 2016

jnothman Aug 26, 2016

jnothman commented Aug 26, 2016

olologin commented Aug 26, 2016

jnothman commented Aug 26, 2016

jnothman commented Aug 26, 2016


		assert_array_equal(eclf1.predict(X), eclf2.predict(X))
		assert_array_equal(eclf1.predict_proba(X), eclf2.predict_proba(X))

		assert_array_equal(eclf1.predict_proba(X), eclf2.predict_proba(X))


		def test_parallel_majority_label_iris():

[MRG+2] parallelized VotingClassifier #5805

[MRG+2] parallelized VotingClassifier #5805

Conversation

olologin commented Nov 12, 2015

olologin commented Nov 13, 2015

giorgiop commented Nov 13, 2015

olologin commented Nov 13, 2015

betatim Nov 16, 2015

Choose a reason for hiding this comment

betatim commented Nov 16, 2015

MechCoder Apr 28, 2016

Choose a reason for hiding this comment

MechCoder commented Apr 29, 2016

MechCoder Apr 30, 2016

Choose a reason for hiding this comment

olologin May 1, 2016

Choose a reason for hiding this comment

MechCoder commented Apr 30, 2016 • edited

MechCoder commented Apr 30, 2016

olologin commented May 1, 2016

MechCoder commented May 5, 2016

jnothman Jun 21, 2016 • edited

Choose a reason for hiding this comment

jnothman Aug 23, 2016

Choose a reason for hiding this comment

jnothman commented Jun 21, 2016

olologin commented Jul 31, 2016 • edited

jnothman commented Jul 31, 2016

jnothman commented Aug 1, 2016

MechCoder commented Aug 24, 2016

jnothman commented Aug 24, 2016

jnothman commented Aug 24, 2016

jnothman Aug 24, 2016

Choose a reason for hiding this comment

jnothman Aug 26, 2016

Choose a reason for hiding this comment

jnothman commented Aug 26, 2016

olologin commented Aug 26, 2016

jnothman commented Aug 26, 2016

jnothman commented Aug 26, 2016

MechCoder commented Apr 30, 2016 •

edited

jnothman Jun 21, 2016 •

edited

olologin commented Jul 31, 2016 •

edited