Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setting search parameters on estimators #5082

Open
jnothman opened this issue Aug 4, 2015 · 10 comments

Comments

@jnothman
Copy link
Member

@jnothman jnothman commented Aug 4, 2015

The underscore notation for specifying grid search parameters is unwieldy, because adding a layer of indirection in the model (e.g. a Pipeline wrapping an estimator you want to search parameters on) means prefixing all corresponding parameters.

We should be able to specify parameter searches using the estimator instances. The interface proposed by @amueller at #4949 (comment) (and elsewhere) suggests a syntax like:

char_vec = CountVectorizer(analyzer="char").search_params(n_gram_range=[(3, 3), (3, 5), (5, 5)])
word_vec = CountVectorizer().search_params(n_gram_range=[(1, 1), (1, 2), (2, 2)])
svc = LinearSVC().search_params(C=[0.001, 0.1, 10, 100])
GridSearchCV(make_pipeline(make_feature_union(char_vec, word_vec), svc), cv=..., scoring=...).fit(X, y)

Calling search_params would presumably set an instance attribute on the estimator to record the search information.

Questions of fine semantics that need to be clarified for this approach include:

  1. does a call to search_params overwrite all previous settings for that estimator?
  2. does clone maintain the prior search_params?
  3. should this affect the search space of specialised CV objects (e.g. LassoCV)

Questions of functionality include:

a) is RandomizedSearchCV supported by merely making one of the search spaces a scipy.stats rv, making some searches GridSearchCV-incompatible?
b) is there any way to support multiple grids, as is currently allowed in GridSearchCV?

I have proposed an alternative syntax that still avoids problems with underscore notation, and does not have the above issues, but is less user-friendly than the syntax above:

char_vec = CountVectorizer(analyzer="char")
word_vec = CountVectorizer()
svc = LinearSVC()
param_grid = {(char_vec, 'n_gram_range'): [(3, 3), (3, 5), (5, 5)],
              (word_vec, 'n_gram_range'): [(1, 1), (1, 2), (2, 2)],
              (svc, 'C'): [0.001, 0.1, 10, 100]}
GridSearchCV(make_pipeline(make_feature_union(char_vec, word_vec), svc),
             param_grid,
             cv=..., scoring=...).fit(X, y)

Here, parameters are specified as a pair of (estimator, parameter name), but they are constructed directly as a grid and passed to GridSearchCV/RandomizedSearchCV

@amueller amueller added the API label Aug 4, 2015
@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Aug 4, 2015

Thank you for opening this issue and a great summary.

Sorry I misunderstood your approach of using the estimator instance to search in the parameter tree.
It is indeed at least as general as the one I mentioned.

Attaching the parameters to the estimator gets partially rid of the "parallel dictionaries" that you lamented elsewhere.

I took the liberty to enumerate you questions.

  1. I would say yes.
  2. Uh oh. This is tricky. Some might try to nest one GridSearchCV inside another. Do we want to support this?

For the curious, this actually runs:

from sklearn.grid_search import GridSearchCV
from sklearn.svm import LinearSVC
from sklearn.feature_selection import SelectKBest
from sklearn.pipeline import make_pipeline
from sklearn.datasets import load_iris

iris = load_iris()

grid1 = {"C": [0.01, 0.1, 1, 100]}
grid2 = {"selectkbest__k": [1, 2, 3]}
because_I_can = GridSearchCV(make_pipeline(SelectKBest(), GridSearchCV(LinearSVC(), grid1, verbose=10)),
                             grid2, verbose=10)

because_I_can.fit(iris.data, iris.target)
  1. Do you mean a LassoCV inside a GridSearchCV? Or just running LassoCV. Ideally, I'd say both, but the current interface for defining the parameters in the LassoCV is not a list :-/

a) I'd say yes. Trying out both is a bit awkward then :-/
b) you could bend the syntax as passing a list of dictionaries? I thought about passing a dict but as **kwargs is constructed anyhow, it seems a bit unnecessary.

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Aug 8, 2015

Not entirely related, but something that would be really cool is to create a dask-graph out of the pipeline and parameters. If we had an interface that would support this....

@jnothman

This comment has been minimized.

Copy link
Member Author

@jnothman jnothman commented Aug 15, 2015

I think we should not rule out needing to support nested grid search. Trying to think of cases, I could imagine OvR or similar having a grid search within it so that hyperparams are not tied across labels in a multilabel problem, but potentially a grid search outside of that too...?

I'm not sure how to proceed on this.

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Aug 18, 2015

I've been speaking to the auto-sklearn folks. They really want nested grids. I'm not saying we should to that now, but we should keep it in mind.
Oh wait, you meant something else by nested grids, you meant the thing I wrote above. Never mind, this was irrelevant.

@majidaldo

This comment has been minimized.

Copy link

@majidaldo majidaldo commented Feb 8, 2017

isn't this discussion getting into ideas like this? http://optunity.readthedocs.io/en/latest/notebooks/notebooks/sklearn-automated-classification.html where you have nested search spaces

search = {'algorithm': {'k-nn': {'n_neighbors': [1, 5]},
                        'SVM': {'kernel': {'linear': {'C': [0, 2]},
                                           'rbf': {'gamma': [0, 1], 'C': [0, 10]},
                                           'poly': {'degree': [2, 5], 'C': [0, 50], 'coef0': [0, 1]}
                                           }
                                },
                        'naive-bayes': None,
                        'random-forest': {'n_estimators': [10, 30],
                                          'max_features': [5, 20]}
                        }
         }
@jnothman

This comment has been minimized.

Copy link
Member Author

@jnothman jnothman commented Feb 13, 2017

Not really the same issue, no. Not sure what you're saying about such structures either.

@emanjavacas

This comment has been minimized.

Copy link

@emanjavacas emanjavacas commented Jan 9, 2018

I just realized that RandomizedSearchCV only accepts a dictionary as input to param_dists making it impossible to have conditions in the grid (e.g. if kernel is rbf explore also parameter gamma with a given distribution). The similar setup is possible in GridSearch by passing a list of dictionaries to the parameters argument. I was wondering if there is an actual technical reason for the symmetry to be lost.

@jnothman

This comment has been minimized.

Copy link
Member Author

@jnothman jnothman commented Jan 9, 2018

@emanjavacas

This comment has been minimized.

Copy link

@emanjavacas emanjavacas commented Jan 10, 2018

It would make sense for it to be uniformly sampled, the same as what happens when you input a list as value of a parameter key the RandomizedSearchCV parameters argument.

@jnothman

This comment has been minimized.

Copy link
Member Author

@jnothman jnothman commented Jan 10, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
4 participants
You can’t perform that action at this time.