[MRG+2] Deprecate n_iter in SGDClassifier and implement max_iter #5036
Conversation
8a5f890
to
43d931b
@@ -700,7 +728,7 @@ def _plain_sgd(np.ndarray[double, ndim=1, mode='c'] weights, | |||
|
|||
w.reset_wscale() | |||
|
ogrisel
Aug 17, 2015
Member
Please raise a ConvergenceWarning
with an informative message if max_iter == epoch + 1
.
Please raise a ConvergenceWarning
with an informative message if max_iter == epoch + 1
.
TomDLT
Sep 7, 2015
Author
Member
It would raise a warning for each partial_fit
(which has max_iter=1
).
Instead I suggest to raise it in the _fit method.
It would raise a warning for each partial_fit
(which has max_iter=1
).
Instead I suggest to raise it in the _fit method.
ogrisel
Sep 8, 2015
Member
You could disable the convergence warning by passing tol=0
as a local variable only when called from partial_fit
while passing tol=self.tol
when called from fit
.
You could disable the convergence warning by passing tol=0
as a local variable only when called from partial_fit
while passing tol=self.tol
when called from fit
.
About your remark in #5022, you suggested that we could avoid deprecating I am +0 for this convenience feature. @amueller do you have an opinion in this regard? |
I'm -0 ;) Do we want a validation set for SGDClassifier in the future? And if so, do we introduce this using a |
I though it as deprecation, a temporary way not to break any user code, before removing completely
Do you mean to give a validation set with a performance goal as a stopping criterion, directly in the SGD solver? |
Classic deprecation is what you did in this PR: raise a DeprecationWarning now while still behaving the same if the user is passing a
Yes early stopping on the lack of improvement as measured on a validation set. The validation set can be specified as number between 0 and 1 (typically 0.1 by default) and the model extracts internally in the fit method by randomly splitting the user provided data into train and validation folds. But this is outside of the scope of this PR. |
verbose=0, loss="hinge", n_jobs=1, random_state=None, | ||
warm_start=False, class_weight=None): | ||
|
||
def __init__(self, C=1.0, fit_intercept=True, max_iter=5, tol=1e-4, |
ogrisel
Sep 8, 2015
Member
Now that we have a good stopping criterion, I think we should set max_iter=100
by default and expect the stopping criterion to kick in before that in 99% of the cases.
Now that we have a good stopping criterion, I think we should set max_iter=100
by default and expect the stopping criterion to kick in before that in 99% of the cases.
amueller
Sep 8, 2015
Member
On 09/08/2015 02:33 AM, Olivier Grisel wrote:
Now that we have a good stopping criterion, I think we should set
|max_iter=100| by default and expect the stopping criterion to kick in
before that in 99% of the cases.
Did someone do experiments on how well that works in practice?
On 09/08/2015 02:33 AM, Olivier Grisel wrote:
Now that we have a good stopping criterion, I think we should set
|max_iter=100| by default and expect the stopping criterion to kick in
before that in 99% of the cases.Did someone do experiments on how well that works in practice?
The number of passes over the training data (aka epochs). | ||
max_iter : int, optional | ||
The maximum number of passes over the training data (aka epochs). | ||
The maximum number of iterations is set to 1 if using partial_fit. |
ogrisel
Sep 8, 2015
Member
I would rather say that this parameter only impacts the behavior of the fit
method, not the partial_fit
method.
I would rather say that this parameter only impacts the behavior of the fit
method, not the partial_fit
method.
def __init__(self, C=1.0, fit_intercept=True, n_iter=5, shuffle=True, | ||
verbose=0, loss="epsilon_insensitive", | ||
epsilon=DEFAULT_EPSILON, random_state=None, warm_start=False): | ||
def __init__(self, C=1.0, fit_intercept=True, max_iter=5, tol=1e-4, |
ogrisel
Sep 8, 2015
Member
max_iter=100
as well here.
max_iter=100
as well here.
@@ -73,6 +76,13 @@ def __init__(self, loss, penalty='l2', alpha=0.0001, C=1.0, | |||
self.warm_start = warm_start | |||
self.average = average | |||
|
|||
if n_iter is not None: | |||
warnings.warn("n_iter parameter is deprecated and will be removed" |
ogrisel
Sep 8, 2015
Member
It's better to be very explicit: "n_iter parameter is deprecated in 0.17 and will be removed in 0.19. ..."
It's better to be very explicit: "n_iter parameter is deprecated in 0.17 and will be removed in 0.19. ..."
classes, sample_weight, coef_init, intercept_init) | ||
|
||
if self.n_iter_ == self.max_iter: |
ogrisel
Sep 8, 2015
Member
Please change this test to:
if self.tol > 0 and self.n_iter_ == self.max_iter:
so that the user can disable the ConvergenceWarning
when he/she decide to always perform max_iter
iterations intentionally (effectively disabling the stopping criterion).
Please change this test to:
if self.tol > 0 and self.n_iter_ == self.max_iter:
so that the user can disable the ConvergenceWarning
when he/she decide to always perform max_iter
iterations intentionally (effectively disabling the stopping criterion).
The handling of >>> from sklearn.datasets import load_boston
>>> from sklearn.linear_model import SGDRegressor
>>> from sklearn.utils import gen_batches
>>> boston = load_boston()
>>> n_samples, n_features = boston.data.shape
>>> n_samples, n_features
(506, 13)
>>> all_batches = list(gen_batches(n_samples, 100))
>>> m = SGDRegressor(max_iter=2)
>>> for batch in all_batches:
... m.fit(boston.data[batch], boston.target[batch])
...
>>> m.t_
13.0
In particular, calling m = SGDRegressor(max_iter=1, tol=0, shuffle=False, random_state=0)
m.fit(boston.data, boston.target) should be equivalent (same # max_iter should not impact incremental fitting at all
m = SGDRegressor(max_iter=42, shuffle=False, random_state=0)
for batch in all_batches:
m.partial_fit(boston.data[batch], boston.target[batch])
Furthermore: m = SGDRegressor(max_iter=10, tol=0, shuffle=False, random_state=0)
m.fit(boston.data, boston.target) should be equivalent to: m = SGDRegressor(max_iter=42, shuffle=False, random_state=0)
for i in 10:
for batch in all_batches:
m.partial_fit(boston.data[batch], boston.target[batch])
The fact that the tests do not fail means that |
Also for some reason I don't get the expected convergence warning when I do: >>> SGDRegressor(max_iter=1, tol=1e-15).fit(boston.data, boston.target) |
On the other hand, I would expect the following model to converge earlier ( >>> SGDRegressor(max_iter=10000, tol=1e-2).fit(boston.data, boston.target).n_iter_
10000 |
Edit: solved by scaling the data (#5036 (comment)) Thanks for the review. About your last comment, it comes from the fact that the SGD converges quite slowly with the boston dataset, as you can see in the following plot (tested on master): import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
from sklearn.linear_model import SGDRegressor
boston = load_boston()
X, y = boston.data, boston.target
n_features = X.shape[1]
iter_range = np.arange(1, 11) * 1000
coefs = np.zeros((n_features, iter_range.size))
for i, n_iter in enumerate(iter_range):
reg = SGDRegressor(n_iter=n_iter).fit(X, y)
coefs[:, i] = reg.coef_
for i in range(n_features):
plt.plot(iter_range, coefs[i, :])
plt.xlabel("n_iter")
plt.ylabel("coefs")
plt.show() I changed the default |
Actually I think the code is OK about >>>all_batches
[slice(0, 100, None),
slice(100, 200, None),
slice(200, 300, None),
slice(300, 400, None),
slice(400, 500, None),
slice(500, 506, None)]
>>> for batch in all_batches:
... m.fit(boston.data[batch], boston.target[batch])
...
>>> m.t_
13.0 You call With from sklearn.datasets import load_boston
from sklearn.linear_model import SGDRegressor
from sklearn.utils import gen_batches
boston = load_boston()
n_samples, n_features = boston.data.shape
all_batches = list(gen_batches(n_samples, 100))
for max_iter in range(1, 11):
# one full pass with fit
m1 = SGDRegressor(max_iter=max_iter, tol=0, shuffle=False, random_state=0)
m1.fit(boston.data, boston.target)
# batches with partial_fit
m2 = SGDRegressor(max_iter=42, shuffle=False, random_state=0)
for _ in range(max_iter):
for batch in all_batches:
m2.partial_fit(boston.data[batch], boston.target[batch])
print("%d, %f" % (m1.t_, m1.coef_[1]))
print("%d, %f" % (m2.t_, m2.coef_[1])) gives
About |
ebd2a42
to
7e30cff
for default parameters of SGD to work, the data needs to be scaled. |
Tag 0.17 @ogrisel ? Not sure if it's ready. |
I'm not sure if we should do this change simultaneously with a default scaling change? |
Indeed, adding a
The problem of scaling is rather not linked to I think this PR is OK, except if I missed your point @ogrisel |
Hm this is breaking behavior, right? |
It might be better to warn a future change. |
# When n_iter=None, and at least one of tol and max_iter is specified | ||
assert_no_warnings(init, 100, None, None) | ||
assert_no_warnings(init, None, 1e-3, None) | ||
assert_no_warnings(init, 100, 1e-3, None) |
ogrisel
Jun 22, 2017
•
Member
Please add assertions for the resulting values of clf.max_iter
and clf.tol
for each of these cases, e.g. something like:
clf = assert_no_warnings(SGDClassifier, max_iter=100, tol=1e-3, n_iter=None)
assert clf.max_iter == 100
assert clf.tol == 1e-3
Please add assertions for the resulting values of clf.max_iter
and clf.tol
for each of these cases, e.g. something like:
clf = assert_no_warnings(SGDClassifier, max_iter=100, tol=1e-3, n_iter=None)
assert clf.max_iter == 100
assert clf.tol == 1e-3
TomDLT
Jun 22, 2017
Author
Member
cf. test_tol_and_max_iter_default_values
?
cf. test_tol_and_max_iter_default_values
?
ogrisel
Jun 23, 2017
Member
Fine.
Fine.
Good point, the loss accumulator was no reset after each epoch, and was not scaled by
|
@TomDLT I have pushed two small improvements to my ogrisel/sgd_maxiter branch. Could you please include them in your PR to get CI to run on them. Other than that I think I am +1. |
@TomDLT please allow others scikit-learn devs to push into the branch of your next PRs in the future :) |
+1 once ogrisel/sgd_maxiter is included to this PR. |
Done and all green |
edeb3af
into
scikit-learn:master
Merged |
well done! it's been a long haul!
…On 24 Jun 2017 5:49 am, "Olivier Grisel" ***@***.***> wrote:
Merged
|
Wohooo!! |
Solve #5022
In SGDClassifier, SGDRegressor, Perceptron, PassiveAgressive:
n_iter
. Default is nowNone
. If not None, it warns and setsmax_iter = n_iter
andtol = 0
, to have exact previous behavior.max_iter
andtol
. The stopping criterion insgd_fast._plain_sgd()
is identical to the one in SAG new solver for Ridge and LogisticRegression.self.n_iter_
after the fit. For multiclass classifiers, we keep the maximumn_iter_
over all binary (OvA) fits.