grid search keeps all estimators in memory #565

mblondel · 2012-01-17T15:43:10Z

I mentioned this issue on the ML a while back but since it hasn't been addressed yet, I create this report to track it.

Basically, grid search uses Parallel to compute out, which is a list of triplets (score, estimator, n_test_samples). Unfortunately, some estimators take a lot of space in memory (e.g., SVCs because the support vectors are copied in each estimator). It would be better to return only the best estimator directly and delete the other ones as they are computed.

The text was updated successfully, but these errors were encountered:

ogrisel · 2012-01-17T15:56:06Z

+1 or maybe even having an option keep_estimators which accepts integer values, 'all' or None. When an integer value n is passed, only the top n best estimators are kept in memory and returned to the caller (all others are set to None).

GaelVaroquaux · 2012-01-17T16:29:15Z

Ideally, Parallel would be an iterator so that we could consume it on the
fly in a map-reduce way. For technical reasons (asynchronous
programming), it is hard to do. We could sprint on that in joblib, but I
think that I would need a good hand, and to free a couple of days (not
easy).

I am wondering if returning the clf_params, rather than the classifier
would be enough? It seems to me so.

Gael

mblondel · 2012-01-17T16:44:58Z

I think we can fix the problem by changing the order of the loops:

for param in params:
    for train, test in cv:

instead of

for train, test in cv:
   for param in params:

This way, we can compute the score expectation with respect to a given parameter setting and throw away the estimator right away if the score expectation is worse than the best so far. We need to be able to restart the cv iterator though.

GaelVaroquaux · 2012-01-17T16:48:07Z

On Tue, Jan 17, 2012 at 08:44:59AM -0800, Mathieu Blondel wrote:

I think we can fix the problem by changing the order of the loops:
for param in params:
    for train, test in cv:
instead of
for train, test in cv:
   for param in params:

Actually, we are doing neither: the two loops are sent to the Parallel
object to improve the parallelism. This helps a lot as quite often most
of the time is spent in fitting a few grid points, e.g. SVMs with large
C. In addition, I tend to have more CPUs than folds, when I do a 3 fold
(or a 5 fold).

Gaël

mblondel · 2012-01-17T17:02:41Z

Returning just clf_params and, at the end, refitting with the best params would indeed work (the option refit=False wouldn't make any sense though).

GaelVaroquaux · 2012-01-17T17:05:03Z

Returning just clf_params and, at the end, refitting with the best params would indeed work (the option refit=False wouldn't make any sense though).

I think it still would make sens: if we are interested in the grid_search
only for model selection and not to do prediction.

I am +1 for such modification, but I think that it should be discussed on
the mailing list, because I don't remember if their was a good reason to
keep the estimators.

Gael

ogrisel · 2012-01-17T17:31:10Z

+1 for the lightweight params only mode.

amueller · 2012-01-18T23:15:02Z

Also +1 for keeping parameters and scores.

Maybe there was a reason because previously the score was not saved?
@ogrisel fixed that a while ago afaik.

ogrisel · 2012-01-19T07:59:15Z

I had an old branch that kept the scores of each CV fold of the params grid. Which is interesting to be able to estimate the variances. However that branch old had other unrelated stuff and I did not maintain it.

GaelVaroquaux · 2012-01-19T08:05:04Z

----- Original message -----

I had an old branch that kept the scores of each CV fold of the params
grid.

I think that that part has been merged. The rest hasn't.

mfergie · 2012-03-13T16:14:24Z

Apparently, (thanks @amueller), keeping the estimators also effects pickling GridSearchCV. See mailing list post and below:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/tmp/<ipython-input-7-687becba8c7c> in <module>()
----> 1 cPickle.dump(gs, te)

/opt/epd-7.2/lib/python2.7/copy_reg.pyc in _reduce_ex(self, proto)
    68     else:
    69         if base is self.__class__:
---> 70             raise TypeError, "can't pickle %s objects" % base.__name__
    71         state = base(self)
    72     args = (self.__class__, base, state)

TypeError: can't pickle instancemethod objects

The method causing the problem:

> /opt/epd-7.2/lib/python2.7/copy_reg.py(70)_reduce_ex()
    69         if base is self.__class__:
---> 70             raise TypeError, "can't pickle %s objects" % base.__name__
    71         state = base(self)

ipdb> self
<bound method LinearSVC.predict of LinearSVC(C=1, dual=True,
fit_intercept=True, intercept_scaling=1, loss='l2',
    multi_class=False, penalty='l2', tol=0.0001)>

and some info about the GridSearchCV itself:

print GridSearchCV

GridSearchCV(cv=sklearn.cross_validation.StratifiedKFold(labels=[ 1.
1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
 1.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
 0.  0.  0.  0.  0.  0.], k=10),
      estimator=LinearSVC(C=1.0, dual=True, fit_intercept=True,
intercept_scaling=1,
    loss='l2', multi_class=False, penalty='l2', tol=0.0001),
      fit_params={}, iid=True, loss_func=None, n_jobs=4,
      param_grid={'C': [1, 10, 100, 1000, 10000, 100000]},
      pre_dispatch='2*n_jobs', refit=True, score_func=None, verbose=1)

amueller · 2012-04-20T21:17:05Z

Addressed in #770.

VisibleDeprecationWarning from Numpy is being ignored already and numpy seems to be moving to FutureWarning as recommended in PEP scikit-learn#565.

amueller mentioned this issue Apr 13, 2012

MRG grid_search forgets estimators #770

Merged

amueller closed this as completed Apr 20, 2012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

grid search keeps all estimators in memory #565

grid search keeps all estimators in memory #565

mblondel commented Jan 17, 2012

ogrisel commented Jan 17, 2012

GaelVaroquaux commented Jan 17, 2012

mblondel commented Jan 17, 2012

GaelVaroquaux commented Jan 17, 2012

mblondel commented Jan 17, 2012

GaelVaroquaux commented Jan 17, 2012

ogrisel commented Jan 17, 2012

amueller commented Jan 18, 2012

ogrisel commented Jan 19, 2012

GaelVaroquaux commented Jan 19, 2012

mfergie commented Mar 13, 2012

amueller commented Apr 20, 2012

grid search keeps all estimators in memory #565

grid search keeps all estimators in memory #565

Comments

mblondel commented Jan 17, 2012

ogrisel commented Jan 17, 2012

GaelVaroquaux commented Jan 17, 2012

mblondel commented Jan 17, 2012

GaelVaroquaux commented Jan 17, 2012

mblondel commented Jan 17, 2012

GaelVaroquaux commented Jan 17, 2012

ogrisel commented Jan 17, 2012

amueller commented Jan 18, 2012

ogrisel commented Jan 19, 2012

GaelVaroquaux commented Jan 19, 2012

mfergie commented Mar 13, 2012

amueller commented Apr 20, 2012