-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
[MRG+1] Add a stopping criterion in SGD, based on the score on a validation set #9043
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Here is a benchmark script to check the effect of accessing the GIL at each epoch. On my desktop, with # sequential runs with single thread
9.53 s ± 49.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# parallel runs without GIL access at each epoch
1.88 s ± 45.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# parallel runs with GIL access at each epoch (verbose > 0)
1.91 s ± 176 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
# parallel runs with GIL access at each epoch (_validation_score)
2.01 s ± 786 µs per loop (mean ± std. dev. of 7 runs, 1 loop each) On a small cluster, with # sequential runs with single thread
32.2 s ± 156 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# parallel runs without GIL access at each epoch
2.47 s ± 185 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# parallel runs with GIL access at each epoch (verbose > 0)
2.68 s ± 131 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# parallel runs with GIL access at each epoch (_validation_score)
3.66 s ± 64.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) from IPython import get_ipython
import contextlib
import numpy as np
from sklearn.linear_model import SGDRegressor
from sklearn.datasets import make_regression
from sklearn.externals.joblib import Parallel, delayed
ipython = get_ipython()
X, y = make_regression(n_samples=10000, n_features=500, n_informative=50,
n_targets=1, bias=10, noise=3., random_state=42)
validation_fraction = 1. / X.shape[0]
@contextlib.contextmanager
def capture():
import sys
from io import StringIO
oldout = sys.stdout
try:
sys.stdout = StringIO()
yield None
finally:
sys.stdout = oldout
def one_run(early_stopping, verbose):
est = SGDRegressor(validation_fraction=validation_fraction,
early_stopping=early_stopping,
max_iter=100, tol=-np.inf, shuffle=False,
random_state=0, verbose=verbose)
est.fit(X, y)
n_jobs = 16
def single_thread():
print('single_thread')
for _ in range(n_jobs):
one_run(False, 0)
def multi_thread(early_stopping, verbose):
with capture():
delayed_one_run = delayed(one_run)
Parallel(n_jobs=n_jobs, backend='threading')(
delayed_one_run(early_stopping, verbose)
for _ in range(n_jobs))
ipython.magic("timeit single_thread()")
ipython.magic("timeit multi_thread(False, 0)")
ipython.magic("timeit multi_thread(False, 1)")
ipython.magic("timeit multi_thread(True, 0)") |
cfb7bc7
to
dfc29b1
Compare
|
related #9456 |
I say 👎 for 0.19 |
I'm operating under the assumption that any new features (except perhaps
deprecation) are 👎 for 0.19. We should really be getting 0.19 wrapped up.
…On 28 July 2017 at 02:07, Andreas Mueller ***@***.***> wrote:
I say 👎 for 0.19
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#9043 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6xO2aqK_xOJKZbGExdQ7qBUlRmeAks5sSLWlgaJpZM4Ny9wT>
.
|
Indeed, though we still haven't released a conda-forge package for the RC. though I guess we can move forward with the release without that. |
9f2225d
to
733f671
Compare
733f671
to
6381943
Compare
Current estimators with early stopping:
In this PR:
|
b48c075
to
c55b749
Compare
Conflicts: doc/whats_new/v0.20.rst sklearn/linear_model/stochastic_gradient.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is nice work.
Is it worth adding or modifying an example to show it in action? I know we have an example for gradient boosting's early stopping.
It might be worth adding these frequent parameters to the Glossary. The myriad definitions and implementations of early stopping may also deserve a separate entry as a term.
doc/modules/sgd.rst
Outdated
The classes :class:`SGDClassifier` and :class:`SGDRegressor` provide two | ||
criteria to stop the algorithm when a given level of convergence is reached: | ||
|
||
* With ``early_stopping=True``, the input data is splitted into a training |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
splitted -> split
doc/whats_new/v0.20.rst
Outdated
:class:`linear_model.PassiveAggressiveRegressor` and | ||
:class:`linear_model.Perceptron` now expose a ``early_stopping`` and | ||
``validation_fraction`` parameters, to stop optimization monitoring the | ||
score on a validation set. :issue:`9043` by `Tom Dupre la Tour`_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add another entry for adaptive learning rate, or put it here. I'm not sure if each estimator needs to be listed. You can reference the user guide instead...?
validation score is not improving by at least tol for | ||
n_iter_no_change consecutive epochs. | ||
|
||
.. versionadded:: 0.20 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How diligent of you to add this :)
sklearn/linear_model/sgd_fast.pyx
Outdated
@@ -585,6 +627,8 @@ def _plain_sgd(np.ndarray[double, ndim=1, mode='c'] weights, | |||
cdef double max_change = 0.0 | |||
cdef double max_weight = 0.0 | |||
|
|||
cdef short * validation_set_ptr = <short *> validation_set.data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can just as well use a typed memoryview above..?
X_train, X_val, y_train, y_val = tmp[:4] | ||
idx_train, idx_val, sample_weight_train, sample_weight_val = tmp[4:8] | ||
|
||
self._X_val = X_val |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we delattr these at the end of fitting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -282,6 +325,8 @@ def fit_binary(est, i, X, y, alpha, C, learning_rate, max_iter, | |||
penalty_type = est._get_penalty_type(est.penalty) | |||
learning_rate_type = est._get_learning_rate_type(learning_rate) | |||
|
|||
validation_set = est._train_validation_split(X, y, sample_weight) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps validation_mask
?
clf1 = self.factory(early_stopping=True, random_state=random_state, | ||
validation_fraction=validation_fraction, | ||
learning_rate='constant', eta0=0.01, | ||
tol=None, max_iter=1000, shuffle=shuffle) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's clear from the documentation what early_stopping=True should do when tol=None
def test_loss_function_epsilon(self): | ||
clf = self.factory(epsilon=0.9) | ||
clf.set_params(epsilon=0.1) | ||
assert clf.loss_functions['huber'][1] == 0.1 | ||
|
||
def test_early_stopping(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be using inheritance to share these?
sklearn/utils/seq_dataset.pxd
Outdated
@@ -7,6 +7,7 @@ cimport numpy as np | |||
|
|||
cdef class SequentialDataset: | |||
cdef int current_index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are three things called index here. I hope they are documented somewhere.
Also, I'm not sure if we should consider this public interface that's problematic to change... Can't we use index_data_ptr[current_index]
directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I simplified the call using simply index_data_ptr[current_index]
.
I also added a bit of documentation there.
if I understand correctly, this currently will run an extra epoch relative to the current early_stopping=False behaviour? |
Yes, the default is now |
Because this PR fixes the bug of previous_loss vs best_loss, I think there is no need to add a futurewarning for |
This comment has been minimized.
This comment has been minimized.
The bug of |
@ogrisel any other comment? |
I still have the feeling that the current default behavior (or master and the choice of If we consider that this is a bug, we can change the default to I would be interested in the opinion of others (maybe @jnothman @amueller ?). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than the decision on FutureWarning and the default value of the patience parameter, LGTM.
doc/glossary.rst
Outdated
``n_iter_no_change`` | ||
Number of iterations with no improvement to wait before stopping the | ||
iterative procedure. It is typically used with :term:`early stopping` to | ||
avoid stopping too early. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For googlability, we should mention that this parameter is also named "patience" in other libraries.
.. versionadded:: 0.20 | ||
|
||
n_iter_no_change : int, default=1 | ||
Number of iterations with no improvement to wait before early stopping. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For googlability, we should mention that this parameter is also named "patience" in other libraries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also recommend to set it to a large enough value such as 5 or 10 to avoid premature stopping.
Optimisation isn't my expertise, but I can offer a software solution:
change the default as a bug fix, but issue a ChangedBehaviorWarning if the
previous default would have stopped earlier than the new default. Too
complicated?
|
I suppose what I mean is: warn if there is more than 1 iteration with no
change but fewer than 5.
|
I am afraid that would make the code too complicated. I think I would rather stick with the FutureWarning that is easier to understand. |
Well I'm not entirely against just changing it as a bug, I just don't have
a great idea of how problematic the variation might be...
|
Here is a test script on a toy dataset: from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.datasets import load_digits
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
digits = load_digits()
for seed in range(5):
print(f"random seed: {seed}")
for n_iter_no_change in [1, 5]:
model = make_pipeline(
MinMaxScaler(),
SGDClassifier(max_iter=1000, tol=1e-3,
n_iter_no_change=n_iter_no_change, random_state=seed)
)
X_train, X_test, y_train, y_test = train_test_split(
digits.data, digits.target, test_size=0.2, random_state=seed)
model.fit(X_train, y_train)
test_acc = model.score(X_test, y_test)
print(f"n_iter_no_change: {n_iter_no_change}, "
f" n_iter: {model.steps[-1][1].n_iter_},"
f" test acc: {test_acc:0.3f}") results:
As you can see |
The effect is even stronger on a small dataset such as iris:
Arguably, iris is probably too small for serious machine learning, especially with stochastic solvers, but still. |
For completeness I have also tried on a larger dataset (covertype), and while it's expected that than a large
|
Same run on covertype but using early stopping on a 10% validation split:
In this case, |
very helpful, Olivier!
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@glemaitre, does this still have your +1?
doc/whats_new/v0.20.rst
Outdated
:class:`linear_model.PassiveAggressiveClassifier`, | ||
:class:`linear_model.PassiveAggressiveRegressor` and | ||
:class:`linear_model.Perceptron`, where the stopping criterion was stopping | ||
the algorithm too early. A parameter `n_iter_no_change` was added and set by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps say "before convergence"
I reported the appveyor failure here: #11438. I believe it's unrelated to this PR. |
Merged! Thank you very much @TomDLT! |
In both case, the stopping criterion (and the API) is identical with the one in the MLP classes:
After each epoch, we compute the validation score or the training loss. The optimization stops if there is no improvement twice in a row (i.e. the patience is hard-coded to 2).
To match MLP classes API, I added two parameters:
early_stopping
(default False), not well named, which selects if we monitor the validation score (early_stopping=True
) or the training loss (early_stopping=False
)validation_fraction
(default 0.1), which selects the split size between training set and validation set.I also added a new learning rate strategy,
learning_rate='adaptive'
, as found in MLP classes:The learning rate is kept constant, and is divided by 5 when there is no improvement twice in a row. The optimization stops when the learning rate is too small.
TODO: