Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

Parallelize the feature selection loop in RFECV #2073

Open
wants to merge 5 commits into from

4 participants

@djv

No description provided.

sklearn/feature_selection/rfe.py
((4 lines not shown))
- for k in range(0, max(ranking_)):
- mask = np.where(ranking_ <= k + 1)[0]
- estimator = clone(self.estimator)
- estimator.fit(X[train][:, mask], y[train])
-
- if self.loss_func is None:
- loss_k = 1.0 - estimator.score(X[test][:, mask], y[test])
- else:
- loss_k = self.loss_func(
- y[test], estimator.predict(X[test][:, mask]))
-
- if self.verbose > 0:
- print("Finished fold with %d / %d feature ranks, loss=%f"
- % (k, max(ranking_), loss_k))
- scores[k] += loss_k
+ scores += Parallel(n_jobs=self.n_jobs)(
@jnothman Owner

Parallel also has a verbose option, but I'm not sure it's necessary here.

@djv
djv added a note

I wasn't sure either since RFECV has it's own verbose printing. I'll pass it through to Parallel, it's nice to be able to see the remaining time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/feature_selection/rfe.py
@@ -245,6 +267,9 @@ class RFECV(RFE, MetaEstimatorMixin):
verbose : int, default=0
Controls verbosity of output.
+ n_jobs : int, optional
+ Number of jobs to run in parallel (default 1).
@jnothman Owner

Perhaps note that -1 uses all processors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jnothman
Owner

Thanks! Other than those minor comments, LGTM.

sklearn/feature_selection/rfe.py
@@ -207,6 +208,27 @@ def predict_proba(self, X):
return self.estimator_.predict_proba(self.transform(X))
+def _cv(estimator, loss_func, X, y, train, test, ranking_, verbose, k):
+ """Score a set of features with ranking less than k. Used inside the
@jnothman Owner

perhaps "Score" should be "Cross-validate"...

@djv
djv added a note

The cross-validation happens in the outer loop in RFECV.fit(). This function scores growing subsets of features that were ranked by RFE in the beginning of fit().

@jnothman Owner

Hmm. I guess we mean different things by cross-validation. In its finest meaning, cross-validation means taking a model trained on some data and evaluating it on some other data.

What I want to get across here is that while the rankings were produced on the basis of training data alone, this is an evaluation on the test data.

@djv
djv added a note

After looking at the function it does indeed perform cross validation. Fixed in the next commit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@larsmans
Owner

The code is clean and simple, but is this really worth it? I just tried it on the Friedman dataset:

>>> X, y = make_friedman1(n_samples=1000, n_features=30, random_state=0)
>>> selector = RFECV(estimator, step=1, cv=5, n_jobs=4)
>>> %timeit selector.fit(X, y)
1 loops, best of 3: 25.9 s per loop
>>> selector = RFECV(estimator, step=1, cv=5, n_jobs=1)
>>> %timeit selector.fit(X, y)
1 loops, best of 3: 36.4 s per loop

Ok, 30% off the running time, but that's nowhere near the four-fold optimal speedup from my quadcore processor. Did you find better speedups for different problems?

@djv

Parallelizing estimator.fit would give a 4x improvement in your case. The problem is that the inner loop depends on the ranking that estimator.fit produced and I can't really see how to modify it. Any ideas?

@larsmans
Owner

Actually, estimator.fit is done 4× in parallel in my example, right?

@jnothman
Owner

RFE.fit is what's taking most of the time: it's not parallelised and runs estimator.fit up to X.shape / rfecv.step times. The part that's being parallelised here is a second pass with at most that many runs, and generally fewer features in each run... It still could be slow, but not usually nearly as slow as the first pass.

@djv
djv commented

Sorry, I meant rfe.fit in my last comment.

@djv
djv commented

I rewrote the code so that the outer cross-validation loop gets parallelized. Testing it with different number of jobs gives:

In [2]: selector.n_jobs=1

In [3]: %timeit selector.fit(X, y)
1 loops, best of 3: 21 s per loop

In [4]: selector.n_jobs=2

In [5]: %timeit selector.fit(X, y)
1 loops, best of 3: 14.1 s per loop

In [6]: selector.n_jobs=4

In [7]: %timeit selector.fit(X, y)
1 loops, best of 3: 10.7 s per loop
@jnothman
Owner

Of course you need to parallelise over the folds; I should have noticed that as it's how parallel CV is done everywhere else, and it's why the number of fits in the second pass is usually much greater than in the first.

But you're not quite doing it right: you should be able to do each fit in a separate job (or as necessary). See https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/grid_search.py#L492. With that approach you land up with a list of scores with contiguous results for folds for the same rank. So you can do scores = out.reshape(-1, len(cv)).mean(axis=1).

@djv
djv commented

The problem is that the inner loop depends on ranking_ which is computed in the outer loop and there's no way to know it in advance.

@amueller amueller commented on the diff
sklearn/feature_selection/rfe.py
@@ -207,6 +208,33 @@ def predict_proba(self, X):
return self.estimator_.predict_proba(self.transform(X))
+def _cv(estimator, loss_func, X, y, train, test, rfe, verbose):
@amueller Owner
amueller added a note

I would maybe rename this into _fit_one or _fit_fold or something like that, so that the relation to fit is obvious.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@amueller
Owner

LGTM. I hope we can get rid of this in the forseeable future or reuse some of the code that @jnothman wrote. For now this seems a good way to parallelize it, though and I'm +1 for merge.

@jnothman
Owner

The problem is that the inner loop depends on ranking_ which is computed in the outer loop and there's no way to know it in advance.

Ah right. I see now. And there's no way to separately parallelise the inner loop under joblib.

No, @amueller, I don't think this fits nicely into the CVScorer refactoring.

@amueller
Owner

Not yet ;)
My long-term goal is to have efficient grid searches without any nested cross-validation. That is not possible with the current implementation. Granted, it is not possible with your refactoring either, but it is a step in the right direction.

@jnothman
Owner

If what you mean is a queue-based asynchronous system, sure...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Jun 16, 2013
  1. Parallelize the feature selection loop in RFECV.

    Daniel Velkov authored
  2. Pass verbose argument to Parallel.

    Daniel Velkov authored
  3. Fix the weird single quotes.

    Daniel Velkov authored
  4. Improve docstring for _cv()

    Daniel Velkov authored
Commits on Jul 1, 2013
  1. Parallelize the cross-validation loop in RFECV.

    Daniel Velkov authored
This page is out of date. Refresh to see the latest.
Showing with 38 additions and 26 deletions.
  1. +38 −26 sklearn/feature_selection/rfe.py
View
64 sklearn/feature_selection/rfe.py
@@ -14,6 +14,7 @@
from ..base import is_classifier
from ..cross_validation import check_cv
from .base import SelectorMixin
+from ..externals.joblib import Parallel, delayed
class RFE(BaseEstimator, MetaEstimatorMixin, SelectorMixin):
@@ -207,6 +208,33 @@ def predict_proba(self, X):
return self.estimator_.predict_proba(self.transform(X))
+def _cv(estimator, loss_func, X, y, train, test, rfe, verbose):
@amueller Owner
amueller added a note

I would maybe rename this into _fit_one or _fit_fold or something like that, so that the relation to fit is obvious.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
+ """Cross-validate a subset of all features with ranking less than k.
+ Used inside the cross validation loop in RFECV.
+ """
+ # Compute a full ranking of the features
+ ranking_ = rfe.fit(X[train], y[train]).ranking_
+
+ # Score each subset of features
+ loss = np.zeros(max(ranking_))
+ for k in range(0, max(ranking_)):
+ mask = np.where(ranking_ <= k + 1)[0]
+ estimator = clone(estimator)
+ estimator.fit(X[train][:, mask], y[train])
+
+ if loss_func is None:
+ loss[k] = 1.0 - estimator.score(X[test][:, mask], y[test])
+ else:
+ loss[k] = loss_func(
+ y[test], estimator.predict(X[test][:, mask]))
+
+ if verbose > 0:
+ print("Finished fold with %d / %d feature ranks, loss=%f"
+ % (k, max(ranking_), loss_k))
+
+ return loss
+
+
class RFECV(RFE, MetaEstimatorMixin):
"""Feature ranking with recursive feature elimination and cross-validated
selection of the best number of features.
@@ -245,6 +273,9 @@ class RFECV(RFE, MetaEstimatorMixin):
verbose : int, default=0
Controls verbosity of output.
+ n_jobs : int, optional
+ Number of jobs to run in parallel (default 1). -1 means 'all CPUs'.
+
Attributes
----------
`n_features_` : int
@@ -293,13 +324,14 @@ class RFECV(RFE, MetaEstimatorMixin):
Mach. Learn., 46(1-3), 389--422, 2002.
"""
def __init__(self, estimator, step=1, cv=None, loss_func=None,
- estimator_params={}, verbose=0):
+ estimator_params={}, verbose=0, n_jobs=1):
self.estimator = estimator
self.step = step
self.cv = cv
self.loss_func = loss_func
self.estimator_params = estimator_params
self.verbose = verbose
+ self.n_jobs = n_jobs
def fit(self, X, y):
"""Fit the RFE model and automatically tune the number of selected
@@ -322,32 +354,12 @@ def fit(self, X, y):
verbose=self.verbose - 1)
cv = check_cv(self.cv, X, y, is_classifier(self.estimator))
- scores = np.zeros(X.shape[1])
# Cross-validation
- n = 0
-
- for train, test in cv:
- # Compute a full ranking of the features
- ranking_ = rfe.fit(X[train], y[train]).ranking_
- # Score each subset of features
- for k in range(0, max(ranking_)):
- mask = np.where(ranking_ <= k + 1)[0]
- estimator = clone(self.estimator)
- estimator.fit(X[train][:, mask], y[train])
-
- if self.loss_func is None:
- loss_k = 1.0 - estimator.score(X[test][:, mask], y[test])
- else:
- loss_k = self.loss_func(
- y[test], estimator.predict(X[test][:, mask]))
-
- if self.verbose > 0:
- print("Finished fold with %d / %d feature ranks, loss=%f"
- % (k, max(ranking_), loss_k))
- scores[k] += loss_k
-
- n += 1
+ scores = sum(Parallel(n_jobs=self.n_jobs, verbose=self.verbose)(
+ delayed(_cv)(self.estimator, self.loss_func, X, y,
+ train, test, rfe, self.verbose)
+ for train, test in cv))
# Pick the best number of features on average
best_score = np.inf
@@ -373,5 +385,5 @@ def fit(self, X, y):
self.estimator_.set_params(**self.estimator_params)
self.estimator_.fit(self.transform(X), y)
- self.cv_scores_ = scores / n
+ self.cv_scores_ = scores / len(cv)
return self
Something went wrong with that request. Please try again.