New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+2] Early stopping for Gradient Boosting Classifier/Regressor #7071

Merged
merged 69 commits into from Aug 9, 2017

Conversation

Projects
None yet
9 participants
@raghavrv
Member

raghavrv commented Jul 24, 2016

Finishing up @vighneshbirodkar's #5689 (Also refer #1036)

Enables early stopping to gradient boosted models via new parameters n_iter_no_change, validation_fraction, tol.

(This takes inspiration from our MLPClassifier)

This has been rewritten after IRL discussions with @agramfort and @ogrisel.

@amueller @agramfort @MechCoder @vighneshbirodkar @ogrisel @glouppe @pprett

Show outdated Hide outdated sklearn/ensemble/gradient_boosting_cv.py
mean = np.mean(scores)
params['n_estimators'] = int(np.mean(n_est_list))
score_tuple = _CVScoreTuple(params, mean, scores)
grid_scores.append(_CVScoreTuple(params, mean, scores))

This comment has been minimized.

@raghavrv

raghavrv Jul 24, 2016

Member

@amueller @jnothman Should we use GridSearchCV like results_ here?

@raghavrv

raghavrv Jul 24, 2016

Member

@amueller @jnothman Should we use GridSearchCV like results_ here?

This comment has been minimized.

@amueller

amueller Jul 26, 2016

Member

Maybe. We should not use _CVScoreTuple, we should try to get rid of that one. The other EstimatorCV models implement scores_, right? I feel like they should all have a consistent interface.

@amueller

amueller Jul 26, 2016

Member

Maybe. We should not use _CVScoreTuple, we should try to get rid of that one. The other EstimatorCV models implement scores_, right? I feel like they should all have a consistent interface.

This comment has been minimized.

@raghavrv

raghavrv Aug 16, 2016

Member

A consistent interface that is like GridSearchCV's results_?

@raghavrv

raghavrv Aug 16, 2016

Member

A consistent interface that is like GridSearchCV's results_?

This comment has been minimized.

@jnothman

jnothman Aug 22, 2016

Member

Yes, I'd like to see a consistent interface like GridSearchCV's results, and always had that in mind. I don't suppose we'll see it in 0.18. Sorry I wasn't fastidious enough to post this as a new issue months ago.

@jnothman

jnothman Aug 22, 2016

Member

Yes, I'd like to see a consistent interface like GridSearchCV's results, and always had that in mind. I don't suppose we'll see it in 0.18. Sorry I wasn't fastidious enough to post this as a new issue months ago.

@raghavrv

This comment has been minimized.

Show comment
Hide comment
@raghavrv

raghavrv Jul 25, 2016

Member

Also any advice on what kind of tests I should add here?

Member

raghavrv commented Jul 25, 2016

Also any advice on what kind of tests I should add here?

@raghavrv

This comment has been minimized.

Show comment
Hide comment
@raghavrv

raghavrv Jul 25, 2016

Member

And finally would like to know

  • If we need the monitor callable at fit of GradientBoosting*CV too? Would that be redundant with the stop_rounds param?
  • Should we port the stop_rounds to GradientBoostingClassifier/Regressor too?
Member

raghavrv commented Jul 25, 2016

And finally would like to know

  • If we need the monitor callable at fit of GradientBoosting*CV too? Would that be redundant with the stop_rounds param?
  • Should we port the stop_rounds to GradientBoostingClassifier/Regressor too?
@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jul 26, 2016

Member

tests: that this does the same as naive grid-search and that it stops early when it should.

Member

amueller commented Jul 26, 2016

tests: that this does the same as naive grid-search and that it stops early when it should.

Show outdated Hide outdated doc/modules/classes.rst
@@ -383,6 +383,7 @@ Samples generator
ensemble.ExtraTreesRegressor
ensemble.GradientBoostingClassifier
ensemble.GradientBoostingRegressor
ensemble.GradientBoostingClassifierCV

This comment has been minimized.

@amueller

amueller Jul 26, 2016

Member

RegressorCV also

@amueller

amueller Jul 26, 2016

Member

RegressorCV also

This comment has been minimized.

@raghavrv

raghavrv Aug 5, 2016

Member

Added below

@raghavrv

raghavrv Aug 5, 2016

Member

Added below

Show outdated Hide outdated examples/ensemble/plot_gradient_boosting_cv.py
Gradient boosting works by fitting a regression tree at each iteration to
improve its prediction over the training data. The improvement in prediction
succesively reduces as more regression trees are added. The

This comment has been minimized.

@amueller

amueller Jul 26, 2016

Member

successively

@amueller

amueller Jul 26, 2016

Member

successively

This comment has been minimized.

@amueller

amueller Jul 26, 2016

Member

Also that sentence reads somewhat convoluted.

@amueller

amueller Jul 26, 2016

Member

Also that sentence reads somewhat convoluted.

Show outdated Hide outdated examples/ensemble/plot_gradient_boosting_cv.py
:class:`~sklearn.ensemble.GradientBoostingClassifierCV` class halts the
training when there is no significant improvement in accuracy on the validation
data during the last few iterations. This example illustrates how the
GradientBoostingClassifierCV class can acheive almost the same accuracy as

This comment has been minimized.

@amueller

amueller Jul 26, 2016

Member

ticks, :class:

@amueller

amueller Jul 26, 2016

Member

ticks, :class:

This comment has been minimized.

@amueller

amueller Jul 26, 2016

Member

achieve

@amueller

amueller Jul 26, 2016

Member

achieve

Show outdated Hide outdated examples/ensemble/plot_gradient_boosting_cv.py
training when there is no significant improvement in accuracy on the validation
data during the last few iterations. This example illustrates how the
GradientBoostingClassifierCV class can acheive almost the same accuracy as
compared to the GradientBoostingClassifier class with significantly lesser

This comment has been minimized.

@amueller

amueller Jul 26, 2016

Member

the phrasing is odd here. Also, it should achieve the same accuracy, shouldn't it?

@amueller

amueller Jul 26, 2016

Member

the phrasing is odd here. Also, it should achieve the same accuracy, shouldn't it?

This comment has been minimized.

@raghavrv

raghavrv Aug 5, 2016

Member

Since we do it a tad differently than grid search as I explained before, no...

@raghavrv

raghavrv Aug 5, 2016

Member

Since we do it a tad differently than grid search as I explained before, no...

Show outdated Hide outdated sklearn/ensemble/gradient_boosting_cv.py
class GradientBoostingClassifierCV(BaseEstimator):
""" Gradient Boosting Classification with Cross Validation
This class keeps adding estimators to the underlying GB estimator till

This comment has been minimized.

@amueller

amueller Jul 26, 2016

Member

I would rephrase this paragraph to first about what it does, not how it works. This is efficient cross-validation for GradientBoosting.

@amueller

amueller Jul 26, 2016

Member

I would rephrase this paragraph to first about what it does, not how it works. This is efficient cross-validation for GradientBoosting.

Show outdated Hide outdated sklearn/ensemble/gradient_boosting_cv.py
This class keeps adding estimators to the underlying GB estimator till
adding a new estimators does not improve the score by much. The goal of
this class is to determine the best choice of `n_estimators` for each set

This comment has been minimized.

@amueller

amueller Jul 26, 2016

Member

double backticks, right?

@amueller

amueller Jul 26, 2016

Member

double backticks, right?

Show outdated Hide outdated sklearn/ensemble/gradient_boosting_cv.py
places does not change for `n_stop_rounds` iterations, the gradient
boosting is halted.
score_precision : int, optional, default=2

This comment has been minimized.

@amueller

amueller Jul 26, 2016

Member

Is this a good way to parametrize? It seems a bit odd to me. How does XGBoost do it? Maybe just "if it doesn't improve by epsilon within n_stop_rounds? The way it is currently done is weirdly tied to the decimal system.

@amueller

amueller Jul 26, 2016

Member

Is this a good way to parametrize? It seems a bit odd to me. How does XGBoost do it? Maybe just "if it doesn't improve by epsilon within n_stop_rounds? The way it is currently done is weirdly tied to the decimal system.

Show outdated Hide outdated sklearn/ensemble/gradient_boosting_cv.py
deviance (= logistic regression) for classification
with probabilistic outputs. For loss 'exponential' gradient
boosting recovers the AdaBoost algorithm. If a list is given, all
loss functions are evaluated to choose the best one.

This comment has been minimized.

@amueller

amueller Jul 26, 2016

Member

I don't understand this. What does that mean?

@amueller

amueller Jul 26, 2016

Member

I don't understand this. What does that mean?

This comment has been minimized.

@raghavrv

raghavrv Aug 8, 2016

Member

loss is considered as any other hyper-parameter. This is the loss of the underlying GB model.

@raghavrv

raghavrv Aug 8, 2016

Member

loss is considered as any other hyper-parameter. This is the loss of the underlying GB model.

Show outdated Hide outdated sklearn/ensemble/gradient_boosting_cv.py
loss functions are evaluated to choose the best one.
learning_rate : float, list of floats, optional (default=0.1)
learning rate shrinks the contribution of each tree by `learning_rate`.

This comment has been minimized.

@amueller

amueller Jul 26, 2016

Member

Either capitalize the first word or use learning_rate.

@amueller

amueller Jul 26, 2016

Member

Either capitalize the first word or use learning_rate.

This comment has been minimized.

@amueller

amueller Jul 26, 2016

Member

double backticks.

@amueller

amueller Jul 26, 2016

Member

double backticks.

Show outdated Hide outdated sklearn/ensemble/gradient_boosting_cv.py
The minimum number of samples required to split an internal node:
- If int, then consider `min_samples_split` as the minimum number.
- If float, then `min_samples_split` is a percentage and
`ceil(min_samples_split * n_samples)` are the minimum

This comment has been minimized.

@amueller

amueller Jul 26, 2016

Member

double backticks

@amueller

amueller Jul 26, 2016

Member

double backticks

Show outdated Hide outdated sklearn/ensemble/gradient_boosting_cv.py
- If "sqrt", then `max_features=sqrt(n_features)`.
- If "log2", then `max_features=log2(n_features)`.
- If None, then `max_features=n_features`.
Choosing `max_features < n_features` leads to a reduction of variance

This comment has been minimized.

@amueller

amueller Jul 26, 2016

Member

Just double check the number of backticks everywhere, and proof-read the docstrings ;)

@amueller

amueller Jul 26, 2016

Member

Just double check the number of backticks everywhere, and proof-read the docstrings ;)

Show outdated Hide outdated sklearn/ensemble/gradient_boosting_cv.py
random_state : int, RandomState instance or None,
list of int or RandomState, optional (default=None)
If int, random_state is the seed used by the random number generator;

This comment has been minimized.

@amueller

amueller Jul 26, 2016

Member

random_state

@amueller

amueller Jul 26, 2016

Member

random_state

Show outdated Hide outdated sklearn/ensemble/gradient_boosting_cv.py
random_state : int, RandomState instance or None,
list of int or RandomState, optional (default=None)
If int, random_state is the seed used by the random number generator;
If RandomState instance, random_state is the random number generator;

This comment has been minimized.

@amueller

amueller Jul 26, 2016

Member

...

Show outdated Hide outdated sklearn/ensemble/gradient_boosting_cv.py
self.presort = presort
def fit(self, X, y):
"""Run fit with all sets of parameters till convergence.

This comment has been minimized.

@amueller

amueller Jul 26, 2016

Member

Check the docstring of the other CV estimators. Shouldn't this mention cross-validation?

@amueller

amueller Jul 26, 2016

Member

Check the docstring of the other CV estimators. Shouldn't this mention cross-validation?

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jul 26, 2016

Member

n_iter_combo is not documented. I'd really not implement randomized search here and wait until we have the right API with warm starts to do that with arbitrary search methods.

Member

amueller commented Jul 26, 2016

n_iter_combo is not documented. I'd really not implement randomized search here and wait until we have the right API with warm starts to do that with arbitrary search methods.

Show outdated Hide outdated sklearn/ensemble/gradient_boosting_cv.py
for train, validation in cv.split(X)
for params in param_iter)
n_splits = int(len(out)/len(param_iter))

This comment has been minimized.

@amueller

amueller Jul 26, 2016

Member

pep8, also //

@amueller

amueller Jul 26, 2016

Member

pep8, also //

Show outdated Hide outdated sklearn/ensemble/gradient_boosting_cv.py
params : dict
The parameters to instantiate the Gradient Boosting model with
stop_rounds : int

This comment has been minimized.

@amueller

amueller Jul 26, 2016

Member

maybe rename that to n_iter_no_change? We have similar parameters in other places, right? What are they called?

@amueller

amueller Jul 26, 2016

Member

maybe rename that to n_iter_no_change? We have similar parameters in other places, right? What are they called?

This comment has been minimized.

@amueller

amueller Jul 26, 2016

Member

there is convergence_iter in affinity propagation, but I'm not sure that's better.

@amueller

amueller Jul 26, 2016

Member

there is convergence_iter in affinity propagation, but I'm not sure that's better.

This comment has been minimized.

@TomDLT

TomDLT Aug 22, 2016

Member

at least be consistent with n_iter_no_change in your class GradientBoostingRegressorCV

@TomDLT

TomDLT Aug 22, 2016

Member

at least be consistent with n_iter_no_change in your class GradientBoostingRegressorCV

Show outdated Hide outdated sklearn/ensemble/gradient_boosting_cv.py
Returns
-------
score: float

This comment has been minimized.

@amueller

amueller Jul 26, 2016

Member

whitespace before :

@amueller

amueller Jul 26, 2016

Member

whitespace before :

Show outdated Hide outdated sklearn/ensemble/gradient_boosting_cv.py
scores = np.empty((stop_rounds,))
scores.fill(-np.inf)
X_train = X[train]

This comment has been minimized.

@amueller

amueller Jul 26, 2016

Member

wow the github highlighting is really f*ed

@amueller

amueller Jul 26, 2016

Member

wow the github highlighting is really f*ed

@raghavrv

This comment has been minimized.

Show comment
Hide comment
@raghavrv

raghavrv Jul 26, 2016

Member

@amueller could you enter captcha before reviewing? :p

Member

raghavrv commented Jul 26, 2016

@amueller could you enter captcha before reviewing? :p

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jul 26, 2016

Member

I remember distinctly that you asked for a review yesterday ;)

Member

amueller commented Jul 26, 2016

I remember distinctly that you asked for a review yesterday ;)

@raghavrv

This comment has been minimized.

Show comment
Hide comment
@raghavrv

raghavrv Jul 26, 2016

Member

Thanks a tonne for looking into the PR. I'll address them and update you!

Member

raghavrv commented Jul 26, 2016

Thanks a tonne for looking into the PR. I'll address them and update you!

Show outdated Hide outdated sklearn/ensemble/gradient_boosting_cv.py
gb.n_estimators = i
gb.fit(X_train, y_train)
scores = np.roll(scores, shift=-1)

This comment has been minimized.

@amueller

amueller Jul 26, 2016

Member

this seems an odd way of doing this. Why not append to a list?

@amueller

amueller Jul 26, 2016

Member

this seems an odd way of doing this. Why not append to a list?

This comment has been minimized.

@raghavrv

raghavrv Jul 26, 2016

Member

We need a way to retain the last 10 scores. Hence a shift register kinda setup.

@raghavrv

raghavrv Jul 26, 2016

Member

We need a way to retain the last 10 scores. Hence a shift register kinda setup.

@raghavrv

This comment has been minimized.

Show comment
Hide comment
@raghavrv

raghavrv Jul 26, 2016

Member

I remember distinctly that you asked for a review yesterday ;)

Indeed. I was just commenting on your speed ;)

Member

raghavrv commented Jul 26, 2016

I remember distinctly that you asked for a review yesterday ;)

Indeed. I was just commenting on your speed ;)

@raghavrv

This comment has been minimized.

Show comment
Hide comment
@raghavrv

raghavrv Jul 26, 2016

Member

BTW what do you think about the redundancy of the monitor in fit method of GradientBoostingClassifier?

Member

raghavrv commented Jul 26, 2016

BTW what do you think about the redundancy of the monitor in fit method of GradientBoostingClassifier?

Show outdated Hide outdated sklearn/ensemble/gradient_boosting_cv.py
scores[-1] = scorer(gb, X_validation, y_validation)
rounded_scores = np.round(scores, score_precision)
if np.all(rounded_scores[-1] <= rounded_scores):

This comment has been minimized.

@amueller

amueller Jul 26, 2016

Member

Using all scores seems odd. This is not how we want to implement multiple scores, right? I would stop based just on the first one.

@amueller

amueller Jul 26, 2016

Member

Using all scores seems odd. This is not how we want to implement multiple scores, right? I would stop based just on the first one.

This comment has been minimized.

@raghavrv

raghavrv Jul 29, 2016

Member

But that assumes that all the scores are strictly decreasing or same right? Would there be no case where one score drops spuriously?

Say

  0    1    2     3     4    5      6     7     8     9    10
0.91, 0.8, 0.9, 0.89, 0.88, 0.87, 0.86, 0.85, 0.84, 0.83, 0.82

In this case we don't want to stop at 10 because 10 is not better than 1 correct?

@raghavrv

raghavrv Jul 29, 2016

Member

But that assumes that all the scores are strictly decreasing or same right? Would there be no case where one score drops spuriously?

Say

  0    1    2     3     4    5      6     7     8     9    10
0.91, 0.8, 0.9, 0.89, 0.88, 0.87, 0.86, 0.85, 0.84, 0.83, 0.82

In this case we don't want to stop at 10 because 10 is not better than 1 correct?

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jul 26, 2016

Member

I don't see where warm_start is set ?!

Member

amueller commented Jul 26, 2016

I don't see where warm_start is set ?!

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jul 26, 2016

Member

Not sure about monitor. I'd not touch it for the moment ;) cc @pprett

Member

amueller commented Jul 26, 2016

Not sure about monitor. I'd not touch it for the moment ;) cc @pprett

@raghavrv

This comment has been minimized.

Show comment
Hide comment
@raghavrv

raghavrv Jul 28, 2016

Member

I tried time-profiling the GradientBoostingRegressorCV for different datasets for n_jobs=4 and n_jobs=8 comparing the threading and multiprocessing approaches to parallelism and found out multiprocessing to be faster than threading when the dataset is big.

# With the threading

* boston dataset with 506 samples, 13 features run with n_jobs=4
 took 0.26s

* covtype dataset with 581012 samples, 54 features run with n_jobs=4
 took 110.35s

* california_housing dataset with 20640 samples, 8 features run with n_jobs=4
 took 3.30s

* toy dataset with 1000000 samples, 20 features run with n_jobs=4
 took 520.50s

* iris dataset with 150 samples, 4 features run with n_jobs=4
 took 0.15s

* boston dataset with 506 samples, 13 features run with n_jobs=8
 took 0.23s

* covtype dataset with 581012 samples, 54 features run with n_jobs=8
 took 108.83s

* california_housing dataset with 20640 samples, 8 features run with n_jobs=8
 took 3.20s

* toy dataset with 1000000 samples, 20 features run with n_jobs=8
 took 511.35s

* iris dataset with 150 samples, 4 features run with n_jobs=8
 took 0.14s


# Multiprocessing

* boston dataset with 506 samples, 13 features run with n_jobs=4
 took 0.31s

* covtype dataset with 581012 samples, 54 features run with n_jobs=4
 took 108.38s

* california_housing dataset with 20640 samples, 8 features run with n_jobs=4
 took 3.02s

* toy dataset with 1000000 samples, 20 features run with n_jobs=4
 took 445.69s

* iris dataset with 150 samples, 4 features run with n_jobs=4
 took 0.18s

* boston dataset with 506 samples, 13 features run with n_jobs=8
 took 0.31s

* covtype dataset with 581012 samples, 54 features run with n_jobs=8
 took 113.60s

* california_housing dataset with 20640 samples, 8 features run with n_jobs=8
 took 3.11s

* toy dataset with 1000000 samples, 20 features run with n_jobs=8
 took 457.10s

* iris dataset with 150 samples, 4 features run with n_jobs=8
 took 0.18s

Should we retain the multiprocessing approach?

@ogrisel @agramfort @amueller @vighneshbirodkar

Member

raghavrv commented Jul 28, 2016

I tried time-profiling the GradientBoostingRegressorCV for different datasets for n_jobs=4 and n_jobs=8 comparing the threading and multiprocessing approaches to parallelism and found out multiprocessing to be faster than threading when the dataset is big.

# With the threading

* boston dataset with 506 samples, 13 features run with n_jobs=4
 took 0.26s

* covtype dataset with 581012 samples, 54 features run with n_jobs=4
 took 110.35s

* california_housing dataset with 20640 samples, 8 features run with n_jobs=4
 took 3.30s

* toy dataset with 1000000 samples, 20 features run with n_jobs=4
 took 520.50s

* iris dataset with 150 samples, 4 features run with n_jobs=4
 took 0.15s

* boston dataset with 506 samples, 13 features run with n_jobs=8
 took 0.23s

* covtype dataset with 581012 samples, 54 features run with n_jobs=8
 took 108.83s

* california_housing dataset with 20640 samples, 8 features run with n_jobs=8
 took 3.20s

* toy dataset with 1000000 samples, 20 features run with n_jobs=8
 took 511.35s

* iris dataset with 150 samples, 4 features run with n_jobs=8
 took 0.14s


# Multiprocessing

* boston dataset with 506 samples, 13 features run with n_jobs=4
 took 0.31s

* covtype dataset with 581012 samples, 54 features run with n_jobs=4
 took 108.38s

* california_housing dataset with 20640 samples, 8 features run with n_jobs=4
 took 3.02s

* toy dataset with 1000000 samples, 20 features run with n_jobs=4
 took 445.69s

* iris dataset with 150 samples, 4 features run with n_jobs=4
 took 0.18s

* boston dataset with 506 samples, 13 features run with n_jobs=8
 took 0.31s

* covtype dataset with 581012 samples, 54 features run with n_jobs=8
 took 113.60s

* california_housing dataset with 20640 samples, 8 features run with n_jobs=8
 took 3.11s

* toy dataset with 1000000 samples, 20 features run with n_jobs=8
 took 457.10s

* iris dataset with 150 samples, 4 features run with n_jobs=8
 took 0.18s

Should we retain the multiprocessing approach?

@ogrisel @agramfort @amueller @vighneshbirodkar

@raghavrv

This comment has been minimized.

Show comment
Hide comment
@raghavrv

raghavrv Jul 28, 2016

Member

I fired up a terminal with htop and found a similar memory consumption for both approaches. But I am not sure how to memory profile this code to get exact quantitative measures.

Should I use @fabianp's memory profiler?

I have no idea what I was looking at at that time, but the recent benchmarks with memory profiler confirms @ogrisel's remarks...

Member

raghavrv commented Jul 28, 2016

I fired up a terminal with htop and found a similar memory consumption for both approaches. But I am not sure how to memory profile this code to get exact quantitative measures.

Should I use @fabianp's memory profiler?

I have no idea what I was looking at at that time, but the recent benchmarks with memory profiler confirms @ogrisel's remarks...

@vighneshbirodkar

This comment has been minimized.

Show comment
Hide comment
@vighneshbirodkar

vighneshbirodkar Jul 28, 2016

Contributor

@raghavrv I am not following this discussion very closely, but did you find that values from predict_proba as produces by GridSearchCV and your class were the same ?

Contributor

vighneshbirodkar commented Jul 28, 2016

@raghavrv I am not following this discussion very closely, but did you find that values from predict_proba as produces by GridSearchCV and your class were the same ?

@raghavrv

This comment has been minimized.

Show comment
Hide comment
@raghavrv

raghavrv Jul 29, 2016

Member

Currently no. I think this needs to be fixed.

Currently this PRs approach is along the lines of -

for param in param_combinations:
    for split in all_cv_splits:
        for n_est in range(1, max_iterations + 1):
            ...
        find the best n_est, best score for THIS split.

    find mean n_estimators, mean best scores for THIS param setting.

Grid Search CV's approach of finding the best n_estimators is something like

for param in param_combinations:
    for n_est in range(1, max_iterations + 1):
        for split in all_cv_splits:
            ...
        find the mean cv score for THIS n_est
    find best n_est and its score for THIS param

find the best param, its best n_est and its score

This is the reason why they find different values for n_estimators even after I provisioned disabling early stopping...

I think we should modify this PR to make it work along the lines of Grid Search. The former seems okay if you assume n_estimators to be like a fit param which is 'learnt' but the latter would be more in concordance with the notion that n_estimators is a hyper-param.

Suggestions?

Member

raghavrv commented Jul 29, 2016

Currently no. I think this needs to be fixed.

Currently this PRs approach is along the lines of -

for param in param_combinations:
    for split in all_cv_splits:
        for n_est in range(1, max_iterations + 1):
            ...
        find the best n_est, best score for THIS split.

    find mean n_estimators, mean best scores for THIS param setting.

Grid Search CV's approach of finding the best n_estimators is something like

for param in param_combinations:
    for n_est in range(1, max_iterations + 1):
        for split in all_cv_splits:
            ...
        find the mean cv score for THIS n_est
    find best n_est and its score for THIS param

find the best param, its best n_est and its score

This is the reason why they find different values for n_estimators even after I provisioned disabling early stopping...

I think we should modify this PR to make it work along the lines of Grid Search. The former seems okay if you assume n_estimators to be like a fit param which is 'learnt' but the latter would be more in concordance with the notion that n_estimators is a hyper-param.

Suggestions?

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jul 29, 2016

Member

Hm what does XGboost do? A single split?
I prefer the first approach, but the question is how to do the final fit where we don't have a validation set to stop on? Or should we always keep a validation set for early stopping, even in the final fit?
@pprett probably knows about this (and we can try to spam him until he sees it ;)

Member

amueller commented Jul 29, 2016

Hm what does XGboost do? A single split?
I prefer the first approach, but the question is how to do the final fit where we don't have a validation set to stop on? Or should we always keep a validation set for early stopping, even in the final fit?
@pprett probably knows about this (and we can try to spam him until he sees it ;)

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jul 29, 2016

Member

travis errors seem relevant

Member

amueller commented Jul 29, 2016

travis errors seem relevant

@vighneshbirodkar

This comment has been minimized.

Show comment
Hide comment
@vighneshbirodkar

vighneshbirodkar Jul 29, 2016

Contributor

@raghavrv In the second approach, to take advantage of warm_start, won't you have to store len(all_cv_splits) GradientBoosting models ?
@amueller xgboost uses validation set given by the user

Contributor

vighneshbirodkar commented Jul 29, 2016

@raghavrv In the second approach, to take advantage of warm_start, won't you have to store len(all_cv_splits) GradientBoosting models ?
@amueller xgboost uses validation set given by the user

@jnothman

Otherwise LGTM

Show outdated Hide outdated examples/ensemble/plot_gradient_boosting_early_stopping.py
iterative fashion.
Early stopping support in Gradient Boosting enables us to find the least number
of iterations which is sufficient enough to build a model that generalizes well

This comment has been minimized.

@jnothman

jnothman Jul 24, 2017

Member

drop 'enough'

@jnothman

jnothman Jul 24, 2017

Member

drop 'enough'

Show outdated Hide outdated examples/ensemble/plot_gradient_boosting_early_stopping.py
Early stopping support in Gradient Boosting enables us to find the least number
of iterations which is sufficient enough to build a model that generalizes well
on unseen data.

This comment has been minimized.

@jnothman

jnothman Jul 24, 2017

Member

on -> to

@jnothman

jnothman Jul 24, 2017

Member

on -> to

Show outdated Hide outdated examples/ensemble/plot_gradient_boosting_early_stopping.py
of iterations which is sufficient enough to build a model that generalizes well
on unseen data.
The concept of early stopping is simple. We specify a `validation_fraction`

This comment has been minimized.

@jnothman

jnothman Jul 24, 2017

Member

double backticks

@jnothman

jnothman Jul 24, 2017

Member

double backticks

Show outdated Hide outdated examples/ensemble/plot_gradient_boosting_early_stopping.py
model is trained using the training set and evaluated using the validation set.
When each additional stage of regression tree is added, the validation set is
used to score the model. This is continued until the scores of the model in
the last `n_iter_no_change` stages do not improve by atleast `tol`. After that

This comment has been minimized.

@jnothman

jnothman Jul 24, 2017

Member

double backticks

@jnothman

jnothman Jul 24, 2017

Member

double backticks

Show outdated Hide outdated examples/ensemble/plot_gradient_boosting_early_stopping.py
"stopped early".
The number of stages of the final model is available at the attribute
`n_estimators_`.

This comment has been minimized.

This comment has been minimized.

@raghavrv

raghavrv Jul 24, 2017

Member

you meant double back ticks right?

@raghavrv

raghavrv Jul 24, 2017

Member

you meant double back ticks right?

Show outdated Hide outdated sklearn/ensemble/gradient_boosting.py
# We also provide an early stopping based on the score from
# validation set (X_val, y_val), if n_iter_no_change is set
if self.n_iter_no_change is None:
continue

This comment has been minimized.

@jnothman

jnothman Jul 24, 2017

Member

continue in the middle of a block is often hard to find. I wonder whether the code would be slightly more readable if you made this is not None and indented the rest. Definitely a cosmetic nitpick, though.

@jnothman

jnothman Jul 24, 2017

Member

continue in the middle of a block is often hard to find. I wonder whether the code would be slightly more readable if you made this is not None and indented the rest. Definitely a cosmetic nitpick, though.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jul 24, 2017

Member
Member

jnothman commented Jul 24, 2017

@raghavrv

This comment has been minimized.

Show comment
Hide comment
@raghavrv

raghavrv Jul 24, 2017

Member

I used clone instead of looping over params. Can you see if it looks more readable that way?

Member

raghavrv commented Jul 24, 2017

I used clone instead of looping over params. Can you see if it looks more readable that way?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jul 24, 2017

Member
Member

jnothman commented Jul 24, 2017

@raghavrv

This comment has been minimized.

Show comment
Hide comment
@raghavrv

raghavrv Jul 26, 2017

Member

@jnothman Do you want to give this a final review / Merge? :)

Member

raghavrv commented Jul 26, 2017

@jnothman Do you want to give this a final review / Merge? :)

@jnothman jnothman changed the title from [MRG+1] Early stopping for Gradient Boosting Classifier/Regressor to [MRG+2] Early stopping for Gradient Boosting Classifier/Regressor Jul 26, 2017

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jul 26, 2017

Member

yes, it looks good to me. With the exception of what's new which lists this under 0.19 (I think we need to ask @amueller or @ogrisel whether they consent to including an extra feature), and that the what's new entry has the spurious indentation of a bygone era!

Member

jnothman commented Jul 26, 2017

yes, it looks good to me. With the exception of what's new which lists this under 0.19 (I think we need to ask @amueller or @ogrisel whether they consent to including an extra feature), and that the what's new entry has the spurious indentation of a bygone era!

Show outdated Hide outdated doc/whats_new.rst
via ``n_iter_no_change``, ``validation_fraction`` and `tol`. :issue:`7071`
by `Raghav RV`_
- Added :class:`multioutput.ClassifierChain` for multi-label

This comment has been minimized.

@jnothman

jnothman Jul 26, 2017

Member

merge error

@jnothman

jnothman Jul 26, 2017

Member

merge error

This comment has been minimized.

@raghavrv

raghavrv Jul 27, 2017

Member

No this is in master. I don't why it shows in the diff.

@raghavrv

raghavrv Jul 27, 2017

Member

No this is in master. I don't why it shows in the diff.

Show outdated Hide outdated doc/whats_new.rst
@@ -81,6 +81,13 @@ New features
Classifiers and regressors
- :class:`ensemble.GradientBoostingClassifier` and
:class:`ensemble.GradientBoostingRegressor` now support early stopping
via ``n_iter_no_change``, ``validation_fraction`` and `tol`. :issue:`7071`

This comment has been minimized.

@amueller

amueller Jul 26, 2017

Member

why did tol only get single backticks? ;)

@amueller

amueller Jul 26, 2017

Member

why did tol only get single backticks? ;)

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jul 27, 2017

Member
Member

jnothman commented Jul 27, 2017

@raghavrv

This comment has been minimized.

Show comment
Hide comment
@raghavrv

raghavrv Aug 9, 2017

Member

@jnothman Have moved the whatsnew entry into a new section - 0.20

Member

raghavrv commented Aug 9, 2017

@jnothman Have moved the whatsnew entry into a new section - 0.20

Show outdated Hide outdated examples/ensemble/plot_gradient_boosting_early_stopping.py
This example illustrates how the early stopping can used in the
:class:`sklearn.ensemble.GradientBoostingClassifier` model to achieve
almost the same accuracy as compared to a model built without early stopping
using many fewer estimators. This can save memory and prediction time.

This comment has been minimized.

@ogrisel

ogrisel Aug 9, 2017

Member

This can significantly reduce training time, memory usage and prediction latency.

@ogrisel

ogrisel Aug 9, 2017

Member

This can significantly reduce training time, memory usage and prediction latency.

Show outdated Hide outdated sklearn/ensemble/gradient_boosting.py
early stopping. Must be between 0 and 1.
Only used if ``n_iter_no_change`` is set to an integer.
.. versionadded:: 0.19

This comment has been minimized.

@ogrisel

ogrisel Aug 9, 2017

Member

This should be changed to 0.20.

@ogrisel

ogrisel Aug 9, 2017

Member

This should be changed to 0.20.

Show outdated Hide outdated sklearn/ensemble/gradient_boosting.py
early stopping. Must be between 0 and 1.
Only used if early_stopping is True
.. versionadded:: 0.19

This comment has been minimized.

@ogrisel

ogrisel Aug 9, 2017

Member

same comment for this class docstring: 0.20.

@ogrisel

ogrisel Aug 9, 2017

Member

same comment for this class docstring: 0.20.

@ogrisel

ogrisel approved these changes Aug 9, 2017

LGTM besides small comments.

def test_gradient_boosting_validation_fraction():
X, y = make_classification(n_samples=1000, random_state=0)
gbc = GradientBoostingClassifier(n_estimators=100,

This comment has been minimized.

@ogrisel

ogrisel Aug 9, 2017

Member

This comment by @jnothman has not been addressed.

@ogrisel

ogrisel Aug 9, 2017

Member

This comment by @jnothman has not been addressed.

def test_gradient_boosting_validation_fraction():
X, y = make_classification(n_samples=1000, random_state=0)
gbc = GradientBoostingClassifier(n_estimators=100,

This comment has been minimized.

@ogrisel

ogrisel Aug 9, 2017

Member

We still use nosetest in travis and appveyor at this time though.

@ogrisel

ogrisel Aug 9, 2017

Member

We still use nosetest in travis and appveyor at this time though.

@raghavrv

This comment has been minimized.

Show comment
Hide comment
Member

raghavrv commented Aug 9, 2017

image

@ogrisel ogrisel merged commit 312b1df into scikit-learn:master Aug 9, 2017

5 checks passed

ci/circleci Your tests passed on CircleCI!
Details
codecov/patch 98.41% of diff hit (target 96.16%)
Details
codecov/project 96.17% (+<.01%) compared to afb3ee7
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Aug 9, 2017

Member

Merged. Thanks for this nice contribution @raghavrv!

Member

ogrisel commented Aug 9, 2017

Merged. Thanks for this nice contribution @raghavrv!

@raghavrv raghavrv deleted the raghavrv:gbcv branch Aug 9, 2017

paulha added a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

AishwaryaRK added a commit to AishwaryaRK/scikit-learn that referenced this pull request Aug 29, 2017

nnadeau added a commit to nnadeau/scikit-learn that referenced this pull request Oct 17, 2017

maskani-moh added a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017

jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment