Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Parallelize the feature selection loop in RFECV #2073

Closed
wants to merge 5 commits into
from

Conversation

Projects
None yet
5 participants
Contributor

djv commented Jun 16, 2013

No description provided.

@jnothman jnothman and 1 other commented on an outdated diff Jun 16, 2013

sklearn/feature_selection/rfe.py
- for k in range(0, max(ranking_)):
- mask = np.where(ranking_ <= k + 1)[0]
- estimator = clone(self.estimator)
- estimator.fit(X[train][:, mask], y[train])
-
- if self.loss_func is None:
- loss_k = 1.0 - estimator.score(X[test][:, mask], y[test])
- else:
- loss_k = self.loss_func(
- y[test], estimator.predict(X[test][:, mask]))
-
- if self.verbose > 0:
- print("Finished fold with %d / %d feature ranks, loss=%f"
- % (k, max(ranking_), loss_k))
- scores[k] += loss_k
+ scores += Parallel(n_jobs=self.n_jobs)(
@jnothman

jnothman Jun 16, 2013

Owner

Parallel also has a verbose option, but I'm not sure it's necessary here.

@djv

djv Jun 16, 2013

Contributor

I wasn't sure either since RFECV has it's own verbose printing. I'll pass it through to Parallel, it's nice to be able to see the remaining time.

@jnothman jnothman commented on an outdated diff Jun 16, 2013

sklearn/feature_selection/rfe.py
@@ -245,6 +267,9 @@ class RFECV(RFE, MetaEstimatorMixin):
verbose : int, default=0
Controls verbosity of output.
+ n_jobs : int, optional
+ Number of jobs to run in parallel (default 1).
@jnothman

jnothman Jun 16, 2013

Owner

Perhaps note that -1 uses all processors.

Owner

jnothman commented Jun 16, 2013

Thanks! Other than those minor comments, LGTM.

@jnothman jnothman and 1 other commented on an outdated diff Jun 16, 2013

sklearn/feature_selection/rfe.py
@@ -207,6 +208,27 @@ def predict_proba(self, X):
return self.estimator_.predict_proba(self.transform(X))
+def _cv(estimator, loss_func, X, y, train, test, ranking_, verbose, k):
+ """Score a set of features with ranking less than k. Used inside the
@jnothman

jnothman Jun 16, 2013

Owner

perhaps "Score" should be "Cross-validate"...

@djv

djv Jun 16, 2013

Contributor

The cross-validation happens in the outer loop in RFECV.fit(). This function scores growing subsets of features that were ranked by RFE in the beginning of fit().

@jnothman

jnothman Jun 16, 2013

Owner

Hmm. I guess we mean different things by cross-validation. In its finest meaning, cross-validation means taking a model trained on some data and evaluating it on some other data.

What I want to get across here is that while the rankings were produced on the basis of training data alone, this is an evaluation on the test data.

@djv

djv Jun 16, 2013

Contributor

After looking at the function it does indeed perform cross validation. Fixed in the next commit.

Owner

larsmans commented Jun 30, 2013

The code is clean and simple, but is this really worth it? I just tried it on the Friedman dataset:

>>> X, y = make_friedman1(n_samples=1000, n_features=30, random_state=0)
>>> selector = RFECV(estimator, step=1, cv=5, n_jobs=4)
>>> %timeit selector.fit(X, y)
1 loops, best of 3: 25.9 s per loop
>>> selector = RFECV(estimator, step=1, cv=5, n_jobs=1)
>>> %timeit selector.fit(X, y)
1 loops, best of 3: 36.4 s per loop

Ok, 30% off the running time, but that's nowhere near the four-fold optimal speedup from my quadcore processor. Did you find better speedups for different problems?

Contributor

djv commented Jun 30, 2013

Parallelizing estimator.fit would give a 4x improvement in your case. The problem is that the inner loop depends on the ranking that estimator.fit produced and I can't really see how to modify it. Any ideas?

Owner

larsmans commented Jun 30, 2013

Actually, estimator.fit is done 4× in parallel in my example, right?

Owner

jnothman commented Jun 30, 2013

RFE.fit is what's taking most of the time: it's not parallelised and runs estimator.fit up to X.shape / rfecv.step times. The part that's being parallelised here is a second pass with at most that many runs, and generally fewer features in each run... It still could be slow, but not usually nearly as slow as the first pass.

Contributor

djv commented Jul 1, 2013

Sorry, I meant rfe.fit in my last comment.

Contributor

djv commented Jul 1, 2013

I rewrote the code so that the outer cross-validation loop gets parallelized. Testing it with different number of jobs gives:

In [2]: selector.n_jobs=1

In [3]: %timeit selector.fit(X, y)
1 loops, best of 3: 21 s per loop

In [4]: selector.n_jobs=2

In [5]: %timeit selector.fit(X, y)
1 loops, best of 3: 14.1 s per loop

In [6]: selector.n_jobs=4

In [7]: %timeit selector.fit(X, y)
1 loops, best of 3: 10.7 s per loop
Owner

jnothman commented Jul 1, 2013

Of course you need to parallelise over the folds; I should have noticed that as it's how parallel CV is done everywhere else, and it's why the number of fits in the second pass is usually much greater than in the first.

But you're not quite doing it right: you should be able to do each fit in a separate job (or as necessary). See https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/grid_search.py#L492. With that approach you land up with a list of scores with contiguous results for folds for the same rank. So you can do scores = out.reshape(-1, len(cv)).mean(axis=1).

Contributor

djv commented Jul 1, 2013

The problem is that the inner loop depends on ranking_ which is computed in the outer loop and there's no way to know it in advance.

@amueller amueller commented on the diff Jul 1, 2013

sklearn/feature_selection/rfe.py
@@ -207,6 +208,33 @@ def predict_proba(self, X):
return self.estimator_.predict_proba(self.transform(X))
+def _cv(estimator, loss_func, X, y, train, test, rfe, verbose):
@amueller

amueller Jul 1, 2013

Owner

I would maybe rename this into _fit_one or _fit_fold or something like that, so that the relation to fit is obvious.

Owner

amueller commented Jul 1, 2013

LGTM. I hope we can get rid of this in the forseeable future or reuse some of the code that @jnothman wrote. For now this seems a good way to parallelize it, though and I'm +1 for merge.

Owner

jnothman commented Jul 1, 2013

The problem is that the inner loop depends on ranking_ which is computed in the outer loop and there's no way to know it in advance.

Ah right. I see now. And there's no way to separately parallelise the inner loop under joblib.

No, @amueller, I don't think this fits nicely into the CVScorer refactoring.

Owner

amueller commented Jul 1, 2013

Not yet ;)
My long-term goal is to have efficient grid searches without any nested cross-validation. That is not possible with the current implementation. Granted, it is not possible with your refactoring either, but it is a step in the right direction.

Owner

jnothman commented Jul 1, 2013

If what you mean is a queue-based asynchronous system, sure...

Owner

MechCoder commented Apr 28, 2016

Done in #5784

@MechCoder MechCoder closed this Apr 28, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment