warm_start in random forests #3364

Closed
mblondel opened this Issue Jul 11, 2014 · 18 comments

Projects

None yet

6 participants

@mblondel
Member

It would be nice to have the warm_start option, as in gradient boosting.

clf = None
scores = []
for n_estimators in (5, 10, 15):
    if clf is None:
        clf = RandomForestClassifier(n_estimators=n_estimators, warm_start=True)
    else:
        clf.n_estimators = n_estimators
    clf.fit(X, y)
    scores.append(clf.score(X, y))

This will of course require to store the random state.

@mjbommar
Contributor

Do we want to match the BaseGradientBoosting behavior exactly? RF support n_jobs parameter, which would complicate the random_state management, right?

Talking this through in the case where we have M existing estimators and pass warm_start=True with n_estimators=N.

  • do we warn if M >= N? yes if matching GB, but I can see value in allowing the user to drop trees to support dynamic hyperparameter optimization
  • how do we handle random_state in the case where n_jobs != 1 ? only allow warm_start with n_jobs=1?
  • .ensemble.forest.{_parallel_build_trees, _parallel_predict_proba}: do you see warm_start getting passed through here, or would we manage this outside the helpers? tied to n_jobs too I believe
@mjbommar
Contributor

Another use-case side-note: I can see using this with sample weights as a version of online learning with RFs.

@ldirer
Contributor
ldirer commented Jul 15, 2014

I would like to try and work on this one.
I am a beginner so feel free to correct me where needed.

From my understanding this would imply:

  1. Storing the random state (np.random.RandomState object) as a new attribute of BaseForest.
  2. Adding a _fit_stages method in BaseForest similar to the one in BaseGradientBoosting.
  3. Adding a begin_at_stage parameter to _partition_estimators (defaulting to 0) to allow partitioning only estimators that haven't yet been fit.

I would see this kept out of .ensemble.forest.{_parallel_build_trees, _parallel_predict_proba}.
This means it would not be as 'fine grained' as for the Gradient Boosted as we would not get access to info like decision_function, oob_improvement after each single stage.

@ogrisel
Member
ogrisel commented Jul 15, 2014
  1. random_state is a constructor param hence the self.random_state attribute should not be mutated in the fit method to be consistent with project convention. Instead we should still use the local variable patter random_state = check_random_state(self.random_state) in fit and use that to build new random_state integer seeds for each tree. However the newly generated seeds when warm_start=True should be offset-ed in some way to reflect the fact that self.estimators_ is not None or empty. We want new seeds for new trees. But we also want to have a fully reproducible outcome if you redo the same sequence of calls on a new forest instance with a fixed random_state integer seed as constructor param (for the forest itself). For instance this could be done as:
    # somewhere in the fit method
    random_state = check_random_state(self.random_state)
    n_old_estimators = len(self.estimators_)
    if self.warm_start and n_old_estimators > 0:
        warm_seed = random_state.randint(np.iinfo(np.uint64).max, size=n_old_estimators)[-1]
        random_state = np.random.RandomState(warm_seed)

    # ... then use this `random_state` instance to seed the trees as usual.

Edit: it forgot a [-1] in the warm_seed definition.

  1. _fit_stages is not a good name as trees are trained independently in a forest while they are trained sequentially in gradient boosting. There is already the necessary machine in the forest base classes to train trees in parallel.
  2. No need for the begin_at_stage stuff: just append the trees to the existing estimators_ list.
@arjoly
Member
arjoly commented Jul 15, 2014

Adding a _fit_stages method in BaseForest similar to the one in BaseGradientBoosting.
Adding a begin_at_stage parameter to _partition_estimators (defaulting to 0) to allow partitioning only estimators that haven't yet been fit.

Af first, I would preserve the interface as it and add only self.n_estimators - len(self.estimators_) trees if warm_start and self.estimators is not None, otherwise I would preserve the current behavior.

do we warn if M >= N? yes if matching GB, but I can see value in allowing the user to drop trees to support dynamic hyperparameter optimization

Personally, I would keep the 'extra' trees as ensembles often benefit from a higher number of tree.

how do we handle random_state in the case where n_jobs != 1 ? only allow warm_start with n_jobs=1?

As the code is written caching the random state should be enough.

.ensemble.forest.{_parallel_build_trees, _parallel_predict_proba}: do you see warm_start getting passed through here, or would we manage this outside the helpers? tied to n_jobs too I believe

I wouldn't add a warm_start parameter to those functions. I think this should be managed in the fit of BaseForest.

My quest would be : what do we do if the forest parameters are not equals between two fit with warm_restart=True?

@mblondel mblondel added Moderate and removed Easy labels Jul 15, 2014
@mblondel
Member

Following the comments above, I changed the difficulty of this task from easy to moderate.

@ogrisel
Member
ogrisel commented Jul 15, 2014

As the code is written caching the random state should be enough.

We could do that (e.g. by adding a self._random_state attribute) but I find that pickling models with random states instances can lead to surprises. np.random.RandomState is quite heavy in practice but arguably much smaller than a big forest of trees.

@arjoly
Member
arjoly commented Jul 15, 2014

I think that the slow pickling of random_state has been fixed in numpy/numpy#4763

@ogrisel
Member
ogrisel commented Jul 15, 2014

Good to know. Also I did not know that you could seed np.random.RandomState instance with a list of integer. That means that my random_state local variable snippet can be simplified to:

    # somewhere in the fit method
    random_state = check_random_state(self.random_state)
    n_old_estimators = len(self.estimators_)
    if self.warm_start and n_old_estimators > 0:
        warm_seed = random_state.randint(np.iinfo(np.uint64).max)
        random_state = np.random.RandomState([warm_seed, n_old_estimators])
@ogrisel
Member
ogrisel commented Jul 15, 2014

With the previous snippet, we should not need to store the random_state instance on the forest object.

@ldirer
Contributor
ldirer commented Jul 17, 2014

The above snippet raises an error with np.uint64, however in forest.py MAX_INT is defined using np.int32 which does not raise any error.
I could not get the part with random_state = np.random.RandomState([warm_seed, n_old_estimators]) to work, but drawing the right number of elements seems to be sufficient:
balancing_draw = random_state.randint(MAX_INT, size=n_old_estimators).

I think it is necessary to modify the _partition_estimators method so that it can partition only estimators starting from the n_old_estimators'th, as these are the ones we want to dispatch to the threads.

@ogrisel
Member
ogrisel commented Jul 17, 2014

I was wrong with the np.uint64. It's indeed a 32 bit seed.

@arjoly
Member
arjoly commented Jul 17, 2014

It would be nice to also have this in the bagging module.

@GaelVaroquaux
Member

It would be nice to also have this in the bagging module.

Let's get this guy merged first :)

@arjoly
Member
arjoly commented Jul 17, 2014

sure :-)

@ogrisel
Member
ogrisel commented Jul 17, 2014

It would be nice to also have this in the bagging module.

+1.

@mblondel
Member

It would be nice to also have this in the bagging module.

+1 too. Can you create a separate issue?

@arjoly
Member
arjoly commented Jul 18, 2014

I have opened an issue :-)

@ogrisel ogrisel closed this Jul 25, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment