Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grid search keeps all estimators in memory #565

Closed
mblondel opened this issue Jan 17, 2012 · 12 comments
Closed

grid search keeps all estimators in memory #565

mblondel opened this issue Jan 17, 2012 · 12 comments

Comments

@mblondel
Copy link
Member

I mentioned this issue on the ML a while back but since it hasn't been addressed yet, I create this report to track it.

Basically, grid search uses Parallel to compute out, which is a list of triplets (score, estimator, n_test_samples). Unfortunately, some estimators take a lot of space in memory (e.g., SVCs because the support vectors are copied in each estimator). It would be better to return only the best estimator directly and delete the other ones as they are computed.

@ogrisel
Copy link
Member

ogrisel commented Jan 17, 2012

+1 or maybe even having an option keep_estimators which accepts integer values, 'all' or None. When an integer value n is passed, only the top n best estimators are kept in memory and returned to the caller (all others are set to None).

@GaelVaroquaux
Copy link
Member

Ideally, Parallel would be an iterator so that we could consume it on the
fly in a map-reduce way. For technical reasons (asynchronous
programming), it is hard to do. We could sprint on that in joblib, but I
think that I would need a good hand, and to free a couple of days (not
easy).

I am wondering if returning the clf_params, rather than the classifier
would be enough? It seems to me so.

Gael

@mblondel
Copy link
Member Author

I think we can fix the problem by changing the order of the loops:

for param in params:
    for train, test in cv:

instead of

for train, test in cv:
   for param in params:

This way, we can compute the score expectation with respect to a given parameter setting and throw away the estimator right away if the score expectation is worse than the best so far. We need to be able to restart the cv iterator though.

@GaelVaroquaux
Copy link
Member

On Tue, Jan 17, 2012 at 08:44:59AM -0800, Mathieu Blondel wrote:

I think we can fix the problem by changing the order of the loops:

for param in params:
    for train, test in cv:

instead of

for train, test in cv:
   for param in params:

Actually, we are doing neither: the two loops are sent to the Parallel
object to improve the parallelism. This helps a lot as quite often most
of the time is spent in fitting a few grid points, e.g. SVMs with large
C. In addition, I tend to have more CPUs than folds, when I do a 3 fold
(or a 5 fold).

Gaël

@mblondel
Copy link
Member Author

Returning just clf_params and, at the end, refitting with the best params would indeed work (the option refit=False wouldn't make any sense though).

@GaelVaroquaux
Copy link
Member

Returning just clf_params and, at the end, refitting with the best params would indeed work (the option refit=False wouldn't make any sense though).

I think it still would make sens: if we are interested in the grid_search
only for model selection and not to do prediction.

I am +1 for such modification, but I think that it should be discussed on
the mailing list, because I don't remember if their was a good reason to
keep the estimators.

Gael

@ogrisel
Copy link
Member

ogrisel commented Jan 17, 2012

+1 for the lightweight params only mode.

@amueller
Copy link
Member

Also +1 for keeping parameters and scores.

Maybe there was a reason because previously the score was not saved?
@ogrisel fixed that a while ago afaik.

@ogrisel
Copy link
Member

ogrisel commented Jan 19, 2012

I had an old branch that kept the scores of each CV fold of the params grid. Which is interesting to be able to estimate the variances. However that branch old had other unrelated stuff and I did not maintain it.

@GaelVaroquaux
Copy link
Member

----- Original message -----

I had an old branch that kept the scores of each CV fold of the params
grid.

I think that that part has been merged. The rest hasn't.

@mfergie
Copy link

mfergie commented Mar 13, 2012

Apparently, (thanks @amueller), keeping the estimators also effects pickling GridSearchCV. See mailing list post and below:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/tmp/<ipython-input-7-687becba8c7c> in <module>()
----> 1 cPickle.dump(gs, te)

/opt/epd-7.2/lib/python2.7/copy_reg.pyc in _reduce_ex(self, proto)
    68     else:
    69         if base is self.__class__:
---> 70             raise TypeError, "can't pickle %s objects" % base.__name__
    71         state = base(self)
    72     args = (self.__class__, base, state)

TypeError: can't pickle instancemethod objects

The method causing the problem:

> /opt/epd-7.2/lib/python2.7/copy_reg.py(70)_reduce_ex()
    69         if base is self.__class__:
---> 70             raise TypeError, "can't pickle %s objects" % base.__name__
    71         state = base(self)

ipdb> self
<bound method LinearSVC.predict of LinearSVC(C=1, dual=True,
fit_intercept=True, intercept_scaling=1, loss='l2',
    multi_class=False, penalty='l2', tol=0.0001)>

and some info about the GridSearchCV itself:

print GridSearchCV

GridSearchCV(cv=sklearn.cross_validation.StratifiedKFold(labels=[ 1.
1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
 1.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
 0.  0.  0.  0.  0.  0.], k=10),
      estimator=LinearSVC(C=1.0, dual=True, fit_intercept=True,
intercept_scaling=1,
    loss='l2', multi_class=False, penalty='l2', tol=0.0001),
      fit_params={}, iid=True, loss_func=None, n_jobs=4,
      param_grid={'C': [1, 10, 100, 1000, 10000, 100000]},
      pre_dispatch='2*n_jobs', refit=True, score_func=None, verbose=1)

@amueller
Copy link
Member

Addressed in #770.

musically-ut added a commit to musically-ut/scikit-learn that referenced this issue Oct 29, 2018
VisibleDeprecationWarning from Numpy is being ignored already
and numpy seems to be moving to FutureWarning as recommended in
PEP scikit-learn#565.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants