[MRG+1] Add a stopping criterion in SGD, based on the score on a validation set #9043

TomDLT · 2017-06-07T16:31:15Z

Follow up [MRG+2] Deprecate n_iter in SGDClassifier and implement max_iter #5036, which implemented a stopping criterion based on the training loss.
This PR implements a stopping criterion based on the prediction score on a validation set.

In both case, the stopping criterion (and the API) is identical with the one in the MLP classes:
After each epoch, we compute the validation score or the training loss. The optimization stops if there is no improvement twice in a row (i.e. the patience is hard-coded to 2).

To match MLP classes API, I added two parameters:

early_stopping (default False), not well named, which selects if we monitor the validation score (early_stopping=True) or the training loss (early_stopping=False)
validation_fraction (default 0.1), which selects the split size between training set and validation set.

I also added a new learning rate strategy, learning_rate='adaptive', as found in MLP classes:
The learning rate is kept constant, and is divided by 5 when there is no improvement twice in a row. The optimization stops when the learning rate is too small.

TODO:

make things work
add a few tests
benchmark the GIL access at each epoch
add a few lines in the doc
add a whats_new entry
merge [MRG+2] Deprecate n_iter in SGDClassifier and implement max_iter #5036

TomDLT · 2017-06-07T16:39:12Z

Here is a benchmark script to check the effect of accessing the GIL at each epoch.
This GIL access is used to compute the prediction score on the validation set, when early_stopping=True.

On my desktop, with n_jobs=6:

# sequential runs with single thread
9.53 s ± 49.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# parallel runs without GIL access at each epoch
1.88 s ± 45.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# parallel runs with GIL access at each epoch (verbose > 0)
1.91 s ± 176 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
# parallel runs with GIL access at each epoch (_validation_score)
2.01 s ± 786 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

On a small cluster, with n_jobs = 16

# sequential runs with single thread
32.2 s ± 156 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# parallel runs without GIL access at each epoch
2.47 s ± 185 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# parallel runs with GIL access at each epoch (verbose > 0)
2.68 s ± 131 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# parallel runs with GIL access at each epoch (_validation_score)
3.66 s ± 64.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

from IPython import get_ipython
import contextlib

import numpy as np

from sklearn.linear_model import SGDRegressor
from sklearn.datasets import make_regression
from sklearn.externals.joblib import Parallel, delayed

ipython = get_ipython()

X, y = make_regression(n_samples=10000, n_features=500, n_informative=50,
                       n_targets=1, bias=10, noise=3., random_state=42)

validation_fraction = 1. / X.shape[0]


@contextlib.contextmanager
def capture():
    import sys
    from io import StringIO
    oldout = sys.stdout
    try:
        sys.stdout = StringIO()
        yield None
    finally:
        sys.stdout = oldout


def one_run(early_stopping, verbose):
    est = SGDRegressor(validation_fraction=validation_fraction,
                       early_stopping=early_stopping,
                       max_iter=100, tol=-np.inf, shuffle=False,
                       random_state=0, verbose=verbose)
    est.fit(X, y)


n_jobs = 16


def single_thread():
    print('single_thread')
    for _ in range(n_jobs):
        one_run(False, 0)


def multi_thread(early_stopping, verbose):
    with capture():
        delayed_one_run = delayed(one_run)
        Parallel(n_jobs=n_jobs, backend='threading')(
            delayed_one_run(early_stopping, verbose)
            for _ in range(n_jobs))


ipython.magic("timeit single_thread()")
ipython.magic("timeit multi_thread(False, 0)")
ipython.magic("timeit multi_thread(False, 1)")
ipython.magic("timeit multi_thread(True, 0)")

TomDLT · 2017-06-26T12:24:50Z

Now that #5036 is merged, is this planned to be in v0.19? @ogrisel

TomDLT · 2017-07-27T12:44:35Z

Add n_iter_no_change parameter, to match GradientBoosting API

amueller · 2017-07-27T16:06:45Z

related #9456

amueller · 2017-07-27T16:06:59Z

I say 👎 for 0.19

jnothman · 2017-07-28T05:10:57Z

I'm operating under the assumption that any new features (except perhaps deprecation) are 👎 for 0.19. We should really be getting 0.19 wrapped up.

…

On 28 July 2017 at 02:07, Andreas Mueller ***@***.***> wrote: I say 👎 for 0.19 — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#9043 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6xO2aqK_xOJKZbGExdQ7qBUlRmeAks5sSLWlgaJpZM4Ny9wT> .

amueller · 2017-07-28T15:30:37Z

Indeed, though we still haven't released a conda-forge package for the RC. though I guess we can move forward with the release without that.

TomDLT · 2017-10-12T10:05:45Z

Current estimators with early stopping:

GradientBoosting(validation_fraction=0.1, n_iter_no_change=None, tol=1e-4)
- n_iter_no_change=None leads to no stopping criterion.
- n_iter_no_change!=None enables early stopping based on validation score.
MLPClassifier(validation_fraction=0.1, n_iter_no_change=10, early_stopping=False, tol=1e-4)
- early_stopping=True enables early stopping based on validation score.
- early_stopping=False uses a stopping criterion based on training loss.
- n_iter_no_change is a parameter since [MRG+1] MLPRegressor quits fitting too soon due to self._no_improvement_count #9457. It has to be an integer or inf to disable all stopping criteria.
- to disables all stopping criteria and force max_iter, you can also use tol=-inf or tol=inf, depending on the stopping strategy.
SGDClassifier(tol=1e-4)
- tol=None leads to no stopping criterion.
- tol!=None uses a stopping criterion based on training loss.
- n_iter_no_change is not a parameter. The equivalent value is hard coded and equal to 1.

In this PR:

SGDClassifier(validation_fraction=0.1, early_stopping=False, n_iter_no_change=2, tol=1e-4)
- tol=None leads to no stopping criterion.
- n_iter_no_change is used for both stopping strategies.
- early_stopping=True enables early stopping based on validation score.
- early_stopping=False uses a stopping criterion based on training loss.

Conflicts: doc/whats_new/v0.20.rst sklearn/linear_model/stochastic_gradient.py

jnothman

This is nice work.

Is it worth adding or modifying an example to show it in action? I know we have an example for gradient boosting's early stopping.

It might be worth adding these frequent parameters to the Glossary. The myriad definitions and implementations of early stopping may also deserve a separate entry as a term.

jnothman · 2018-01-31T10:11:50Z

doc/modules/sgd.rst

+The classes :class:`SGDClassifier` and :class:`SGDRegressor` provide two
+criteria to stop the algorithm when a given level of convergence is reached:
+
+  * With ``early_stopping=True``, the input data is splitted into a training


splitted -> split

jnothman · 2018-01-31T10:13:14Z

doc/whats_new/v0.20.rst

+  :class:`linear_model.PassiveAggressiveRegressor` and
+  :class:`linear_model.Perceptron` now expose a ``early_stopping`` and
+  ``validation_fraction`` parameters, to stop optimization monitoring the
+  score on a validation set. :issue:`9043`  by `Tom Dupre la Tour`_.


Add another entry for adaptive learning rate, or put it here. I'm not sure if each estimator needs to be listed. You can reference the user guide instead...?

jnothman · 2018-01-31T10:16:26Z

sklearn/linear_model/passive_aggressive.py

+        validation score is not improving by at least tol for
+        n_iter_no_change consecutive epochs.
+
+        .. versionadded:: 0.20


How diligent of you to add this :)

jnothman · 2018-01-31T10:21:41Z

sklearn/linear_model/sgd_fast.pyx

    cdef double max_change = 0.0
    cdef double max_weight = 0.0

+    cdef short * validation_set_ptr = <short *> validation_set.data


I think you can just as well use a typed memoryview above..?

jnothman · 2018-01-31T10:29:25Z

sklearn/linear_model/stochastic_gradient.py

+        X_train, X_val, y_train, y_val = tmp[:4]
+        idx_train, idx_val, sample_weight_train, sample_weight_val = tmp[4:8]
+
+        self._X_val = X_val


should we delattr these at the end of fitting?

jnothman · 2018-01-31T10:30:39Z

sklearn/linear_model/stochastic_gradient.py

    penalty_type = est._get_penalty_type(est.penalty)
    learning_rate_type = est._get_learning_rate_type(learning_rate)

+    validation_set = est._train_validation_split(X, y, sample_weight)


perhaps validation_mask?

jnothman · 2018-01-31T10:34:49Z

sklearn/linear_model/tests/test_sgd.py

+        clf1 = self.factory(early_stopping=True, random_state=random_state,
+                            validation_fraction=validation_fraction,
+                            learning_rate='constant', eta0=0.01,
+                            tol=None, max_iter=1000, shuffle=shuffle)


I don't think it's clear from the documentation what early_stopping=True should do when tol=None

jnothman · 2018-01-31T10:37:17Z

sklearn/linear_model/tests/test_sgd.py

        clf.set_params(epsilon=0.1)
        assert clf.loss_functions['huber'][1] == 0.1

+    def test_early_stopping(self):


Should we be using inheritance to share these?

jnothman · 2018-01-31T10:39:04Z

sklearn/utils/seq_dataset.pxd

 # iterators over the rows of a matrix X and corresponding target values y.

 cdef class SequentialDataset:
    cdef int current_index


There are three things called index here. I hope they are documented somewhere.

Also, I'm not sure if we should consider this public interface that's problematic to change... Can't we use index_data_ptr[current_index] directly?

Right, I simplified the call using simply index_data_ptr[current_index].
I also added a bit of documentation there.

jnothman · 2018-01-31T10:46:06Z

if I understand correctly, this currently will run an extra epoch relative to the current early_stopping=False behaviour?

TomDLT · 2018-02-02T17:53:19Z

if I understand correctly, this currently will run an extra epoch relative to the current early_stopping=False behaviour?

Yes, the default is now n_iter_no_change=2, whereas the previous behavior corresponds to n_iter_no_change=1. To avoid breaking user's code, we could have the default to 1 and change it in the future....

ogrisel · 2018-06-27T16:12:34Z

Because this PR fixes the bug of previous_loss vs best_loss, I think there is no need to add a futurewarning for n_iter_no_change: we are already changing the estimator stopping condition by fixing this bug. We just need to document the bug fix on the stopping criterion in the change log

Invalidated by subsequent work

TomDLT · 2018-06-28T11:18:04Z

The bug of previous_loss vs best_loss is not present in master, since it is equivalent to n_iter_no_change=1, so the best loss is also the previous loss.
This was only a mistake in this PR.

glemaitre · 2018-06-29T21:07:23Z

@ogrisel any other comment?

ogrisel · 2018-07-04T10:05:52Z

I still have the feeling that the current default behavior (or master and the choice of n_iter_no_change=1 in this PR) is a bug: it can very often lead to premature stopping, especially on small datasets.

If we consider that this is a bug, we can change the default to n_iter_no_change=5 instead of issuing a FutureWarning.

I would be interested in the opinion of others (maybe @jnothman @amueller ?).

ogrisel

Other than the decision on FutureWarning and the default value of the patience parameter, LGTM.

ogrisel · 2018-07-04T10:10:57Z

doc/glossary.rst

+    ``n_iter_no_change``
+        Number of iterations with no improvement to wait before stopping the
+        iterative procedure. It is typically used with :term:`early stopping` to
+        avoid stopping too early.


For googlability, we should mention that this parameter is also named "patience" in other libraries.

ogrisel · 2018-07-04T10:11:28Z

sklearn/linear_model/passive_aggressive.py

+        .. versionadded:: 0.20
+
+    n_iter_no_change : int, default=1
+        Number of iterations with no improvement to wait before early stopping.


For googlability, we should mention that this parameter is also named "patience" in other libraries.

We should also recommend to set it to a large enough value such as 5 or 10 to avoid premature stopping.

jnothman · 2018-07-04T11:33:03Z

Optimisation isn't my expertise, but I can offer a software solution: change the default as a bug fix, but issue a ChangedBehaviorWarning if the previous default would have stopped earlier than the new default. Too complicated?

jnothman · 2018-07-04T11:49:51Z

I suppose what I mean is: warn if there is more than 1 iteration with no change but fewer than 5.

ogrisel · 2018-07-04T12:00:43Z

I am afraid that would make the code too complicated. I think I would rather stick with the FutureWarning that is easier to understand.

jnothman · 2018-07-04T12:03:30Z

Well I'm not entirely against just changing it as a bug, I just don't have a great idea of how problematic the variation might be...

ogrisel · 2018-07-04T13:35:06Z

Here is a test script on a toy dataset:

from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.datasets import load_digits
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split


digits = load_digits()

for seed in range(5):
    print(f"random seed: {seed}")
    for n_iter_no_change in [1, 5]:
        model = make_pipeline(
            MinMaxScaler(),
            SGDClassifier(max_iter=1000, tol=1e-3,
                          n_iter_no_change=n_iter_no_change, random_state=seed)
        )
        X_train, X_test, y_train, y_test = train_test_split(
            digits.data, digits.target, test_size=0.2, random_state=seed)
        model.fit(X_train, y_train)
        test_acc = model.score(X_test, y_test)
        print(f"n_iter_no_change: {n_iter_no_change}, "
              f" n_iter: {model.steps[-1][1].n_iter_},"
              f" test acc: {test_acc:0.3f}")

results:

random seed: 0
n_iter_no_change: 1,  n_iter: 11, test acc: 0.922
n_iter_no_change: 5,  n_iter: 46, test acc: 0.958
random seed: 1
n_iter_no_change: 1,  n_iter: 12, test acc: 0.967
n_iter_no_change: 5,  n_iter: 64, test acc: 0.972
random seed: 2
n_iter_no_change: 1,  n_iter: 11, test acc: 0.925
n_iter_no_change: 5,  n_iter: 34, test acc: 0.922
random seed: 3
n_iter_no_change: 1,  n_iter: 12, test acc: 0.900
n_iter_no_change: 5,  n_iter: 48, test acc: 0.950
random seed: 4
n_iter_no_change: 1,  n_iter: 12, test acc: 0.953
n_iter_no_change: 5,  n_iter: 50, test acc: 0.969

As you can see n_iter_no_change=1 results in a detrimental premature stopping most of the time.

ogrisel · 2018-07-04T13:59:09Z

The effect is even stronger on a small dataset such as iris:

random seed: 0
n_iter_no_change: 1,  n_iter: 3, test acc: 0.600
n_iter_no_change: 5,  n_iter: 19, test acc: 0.767
random seed: 1
n_iter_no_change: 1,  n_iter: 5, test acc: 0.900
n_iter_no_change: 5,  n_iter: 18, test acc: 1.000
random seed: 2
n_iter_no_change: 1,  n_iter: 4, test acc: 0.700
n_iter_no_change: 5,  n_iter: 34, test acc: 0.833
random seed: 3
n_iter_no_change: 1,  n_iter: 5, test acc: 0.700
n_iter_no_change: 5,  n_iter: 22, test acc: 0.933
random seed: 4
n_iter_no_change: 1,  n_iter: 5, test acc: 0.833
n_iter_no_change: 5,  n_iter: 24, test acc: 0.867

Arguably, iris is probably too small for serious machine learning, especially with stochastic solvers, but still.

ogrisel · 2018-07-04T14:44:21Z

For completeness I have also tried on a larger dataset (covertype), and while it's expected that than a large n_iter_no_change is not necessary in that case, it does not seem to hurt test accuracy:

random seed: 0
n_iter_no_change: 1,  n_iter: 6, test acc: 0.712
n_iter_no_change: 5,  n_iter: 10, test acc: 0.709
random seed: 1
n_iter_no_change: 1,  n_iter: 6, test acc: 0.708
n_iter_no_change: 5,  n_iter: 10, test acc: 0.712
random seed: 2
n_iter_no_change: 1,  n_iter: 6, test acc: 0.708
n_iter_no_change: 5,  n_iter: 10, test acc: 0.711
random seed: 3
n_iter_no_change: 1,  n_iter: 6, test acc: 0.710
n_iter_no_change: 5,  n_iter: 10, test acc: 0.714
random seed: 4
n_iter_no_change: 1,  n_iter: 6, test acc: 0.707
n_iter_no_change: 5,  n_iter: 10, test acc: 0.709

ogrisel · 2018-07-04T14:51:53Z

Same run on covertype but using early stopping on a 10% validation split:

random seed: 0
n_iter_no_change: 1,  n_iter: 5, test acc: 0.710
n_iter_no_change: 5,  n_iter: 13, test acc: 0.711
random seed: 1
n_iter_no_change: 1,  n_iter: 6, test acc: 0.709
n_iter_no_change: 5,  n_iter: 10, test acc: 0.710
random seed: 2
n_iter_no_change: 1,  n_iter: 5, test acc: 0.710
n_iter_no_change: 5,  n_iter: 19, test acc: 0.711
random seed: 3
n_iter_no_change: 1,  n_iter: 4, test acc: 0.708
n_iter_no_change: 5,  n_iter: 22, test acc: 0.712
random seed: 4
n_iter_no_change: 1,  n_iter: 4, test acc: 0.700
n_iter_no_change: 5,  n_iter: 8, test acc: 0.708

In this case, n_iter_no_change=5 is consistently better than n_iter_no_change=1 despite the size of the dataset.

ogrisel · 2018-07-04T16:06:30Z

@jnothman @TomDLT I will go offline before appveyor has completed. Feel free to merge when green. Based on the runs I made I am confident that n_iter_no_change=5 by default is the good / safe choice.

jnothman · 2018-07-04T23:12:28Z

very helpful, Olivier!

jnothman

@glemaitre, does this still have your +1?

jnothman · 2018-07-04T23:12:01Z

doc/whats_new/v0.20.rst

+  :class:`linear_model.PassiveAggressiveClassifier`,
+  :class:`linear_model.PassiveAggressiveRegressor` and
+  :class:`linear_model.Perceptron`, where the stopping criterion was stopping
+  the algorithm too early. A parameter `n_iter_no_change` was added and set by


Perhaps say "before convergence"

ogrisel · 2018-07-05T10:13:48Z

I reported the appveyor failure here: #11438. I believe it's unrelated to this PR.

ogrisel · 2018-07-05T14:25:51Z

Merged! Thank you very much @TomDLT!

TomDLT force-pushed the sgd_validation branch 3 times, most recently from cfb7bc7 to dfc29b1 Compare June 14, 2017 15:02

TomDLT changed the title ~~[WIP] Add a stopping criterion in SGD, based on the score on a validation set~~ [MRG] Add a stopping criterion in SGD, based on the score on a validation set Jun 14, 2017

TomDLT force-pushed the sgd_validation branch from dfc29b1 to 7165469 Compare June 26, 2017 09:12

TomDLT force-pushed the sgd_validation branch from 7165469 to 3052217 Compare July 17, 2017 14:05

TomDLT force-pushed the sgd_validation branch from 7f40e2f to 9f2225d Compare July 25, 2017 09:09

TomDLT force-pushed the sgd_validation branch from 9f2225d to 733f671 Compare October 11, 2017 16:23

add early stopping with validation set

6381943

TomDLT force-pushed the sgd_validation branch from 733f671 to 6381943 Compare October 11, 2017 16:26

add n_iter_no_change

60c7d65

FIX doctest

c55b749

TomDLT force-pushed the sgd_validation branch from b48c075 to c55b749 Compare October 12, 2017 11:04

TomDLT mentioned this pull request Oct 12, 2017

[MRG] add n_iter_no_change parameter in MLP #9914

Closed

Merge branch 'master' into sgd_validation

a4cec06

Conflicts: doc/whats_new/v0.20.rst sklearn/linear_model/stochastic_gradient.py

jnothman reviewed Jan 31, 2018

View reviewed changes

TomDLT added 2 commits February 2, 2018 16:06

address various comments

0ab1d1f

add early stopping example

db51755

TomDLT added 2 commits February 2, 2018 18:55

Merge branch 'master' into sgd_validation

3bba7b9

add early stopping and n_iter_no_change in the glossary

c395d5b

This comment has been minimized.

Sign in to view

glemaitre approved these changes Jun 29, 2018

View reviewed changes

FIX lgtm warning

95b3229

ogrisel approved these changes Jul 4, 2018

View reviewed changes

ogrisel reviewed Jul 4, 2018

View reviewed changes

TomDLT added 2 commits July 4, 2018 16:03

Change default n_iter_no_change to 5, as a bug fix

bad6718

Merge branch 'master' into sgd_validation

bff38ad

jnothman reviewed Jul 5, 2018

View reviewed changes

reformulate bugfix entry

3c1d1bc

ogrisel merged commit 0fc7ce6 into scikit-learn:master Jul 5, 2018

ogrisel deleted the sgd_validation branch July 5, 2018 14:25

ogrisel mentioned this pull request Jul 12, 2018

Multiple epochs for Incremental dask/dask-ml#264

Open

Uh oh!

[MRG+1] Add a stopping criterion in SGD, based on the score on a validation set #9043

[MRG+1] Add a stopping criterion in SGD, based on the score on a validation set #9043

Uh oh!

Conversation

TomDLT commented Jun 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomDLT commented Jun 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomDLT commented Jun 26, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomDLT commented Jul 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amueller commented Jul 27, 2017

Uh oh!

amueller commented Jul 27, 2017

Uh oh!

jnothman commented Jul 28, 2017 via email

Uh oh!

amueller commented Jul 28, 2017

Uh oh!

TomDLT commented Oct 12, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Current estimators with early stopping:

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Jan 31, 2018

Uh oh!

TomDLT commented Feb 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Jun 27, 2018

Uh oh!

This comment has been minimized.

TomDLT commented Jun 28, 2018

Uh oh!

glemaitre commented Jun 29, 2018

Uh oh!

ogrisel commented Jul 4, 2018

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Jul 4, 2018 via email

Uh oh!

jnothman commented Jul 4, 2018 via email

TomDLT commented Jun 7, 2017 •

edited

Loading

TomDLT commented Jun 7, 2017 •

edited

Loading

TomDLT commented Jun 26, 2017 •

edited

Loading

TomDLT commented Jul 27, 2017 •

edited

Loading

TomDLT commented Oct 12, 2017 •

edited

Loading

TomDLT commented Feb 2, 2018 •

edited

Loading

ogrisel commented Jul 4, 2018 •

edited

Loading