Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Adding max_fun parameter to MLP for use in lbfgs optimization #9274

Merged
merged 28 commits into from Jul 2, 2019

Conversation

daniel-perry
Copy link
Contributor

@daniel-perry daniel-perry commented Jul 3, 2017

Reference Issue

Fixes #9273
Closes #10724

What does this implement/fix? Explain your changes.

Running l-bfgs to optimize the MLP regressor/classifier is limited to 15000 iterations because that is the default value for the 'maxiter' parameter in l-bfgs, and MLP doesn't allow over-riding it.
To address this, I've set 'maxiter' equal to the 'max_iter' MLP argument

Any other comments?

Ideally you would want to pass in both a 'max_iter' and 'max_fun' argument to MLP, however 'max_fun' doesn't make sense for anything but l-bfgs, so using 'max_iter' to control both seems a reasonable compromise.

Edit: see #9274 (comment) for the motivation for adding max_fun

@jnothman
Copy link
Member

jnothman commented Jul 3, 2017

Ideally you would want to pass in both a 'max_iter' and 'max_fun' argument to MLP, however 'max_fun' doesn't make sense for anything but l-bfgs, so using 'max_iter' to control both seems a reasonable compromise.

You say this, but then you seem to be setting both.

I assume there's no straightforward way to write a non-regression test for this. I think the premise is correct, although more generally, what's the relationship between the number of function evaluations and the number of iterations? Is the number of function evaluations bounded from above by the number of iterations?

@daniel-perry
Copy link
Contributor Author

daniel-perry commented Jul 3, 2017 via email

@jnothman
Copy link
Member

jnothman commented Jul 3, 2017 via email

@daniel-perry
Copy link
Contributor Author

While num fun >= num iter, there is apparently a test to stop if either 'maxiter' or 'maxfun' is reached.
So even if you set 'maxfun' to a very large number, and even though num fun >= num iter, if num iter reaches 'maxiter' the l-bfgs optimization ends (essentially maxfun is meaningless if maxiter is much smaller).

You can see these separate tests in a few lines in the l-bfgs code:
https://github.com/scipy/scipy/blob/v0.14.0/scipy/optimize/lbfgsb.py#L308
https://github.com/scipy/scipy/blob/v0.14.0/scipy/optimize/lbfgsb.py#L317

One way to address this is to set both 'maxiter' and 'maxfun' to the MLP parameter 'max_iters' (what I've done here). Another way is to provide MLP parameters for both.
I chose the first option because it didn't change the interface for MLP, but in retrospect the second is probably a more "correct" solution, since you can imagine someone wanting to set them to separate values (as the l-bfgs interface allows).

But what do you think, is it worth adding another parameter to MLP for this case?

@daniel-perry
Copy link
Contributor Author

I've added an example of this happening to the bug #9273

@ngoix
Copy link
Contributor

ngoix commented Aug 3, 2017

So if I understand correctly, what you've done is equivalent to set maxiter to infinity, as the control is only done through maxfun? Do you know situations where we may want to control the process through the number of iterations, and not the number of calls? Isn't the number of iterations indirectly controled by pgtol

Another option could be to set maxfun to infinity, and maxiter to self.max_iter, but if we can't control the number of calls per iteration then we can't control the time complexity.

@daniel-perry
Copy link
Contributor Author

I think the problem here may be subtle, so I'm going to back up and redefine things in a clear way:

Let bfgs_maxfun and bfgs_maxiter be the parameters passed to the BFGS solver as stopping criteria based on the number of functions calls and number of iterations, respectively.
Let bfgs_numfun and bfgs_numiter be the counters counting the number of function calls and the number of iterations, respectively, in the bfgs solver.
The scipy bfgs solver essentially has this as a termination test:

if bfgs_numfun > bfgs_maxfun  or  bfgs_numiter > bfgs_maxiter:
    return

It's written differently (separate tests) but this is the essence of the test. Note that if either is satisfied it will return.

Now, let mlp_maxiter be the parameter the MLP interface currently exposes to control the number of iterations (it does not currently expose a mlp_maxfun parameter).

The current code (before this patch), does the following when calling the bfgs solver:

bfgs_maxfun = mlp_maxiter
bfgs_maxiter = 10000

This is because the default bfgs_maxiter value is 10000, and MLP does not currently provide a value for it, so it uses the default.

This results in the following test:

if bfgs_numfun > mlp_maxiter  or  bfgs_numiter > 10000:
    return

Hopefully, you can see the problem now. This interface works great as long as for your problem the solver converges in under 10,000 iterations (admittedly this is probably fine 95% of the time).
I've provided a real example where this is not the case (requires > 10,000 iterations).

The patch I've provided changes the original code slightly to do the following:

if bfgs_numfun > mlp_maxiter  or  bfgs_numiter > mlp_maxiter:
    return

So that with this patch, if I provide a value for mlp_maxiter = 50000 the solver will continue until either bfgs_numiter or bfgs_numfun reaches that value.
The current code (before patch), will stop at bfgs_numiter = 10000 (ie it ignores the mlp parameter entirely).

Your comment about this being the same as setting bfgs_maxiter = infinity, I think, is valid, since bfgs_numfun >= bfgs_numiter should be true (each iteration should use at least one function call, if nothing else than to test for convergence).

Alternatively, we can modify the MLP interface to allow for both a "mlp_maxiter" and a mlp_maxfun parameter, with the stipulation that mlp_maxfun is only applicable when solver="lbfgs" (similar to how the beta1 and beta2 parameters are only applicable for solver="adam").

Personally, I think the current patch is sufficient, but if others disagree, it would not take me long to put together an alternative patch with the new parameter.
I just haven't heard any other opinions on that point, possibly because it is a confusing/subtle bug, and probably not at the top of the priority list.

Thanks for any more comments/discussion, hoping we can settle on the right fix for this subtle bug in MLP soon.

@ngoix
Copy link
Contributor

ngoix commented Aug 4, 2017

Thanks for the in-depth explanation. Could you please answer the last two points in my post? ie are you in knowledge of situations where we want to control the maximum number of iterations maxiter (bfgs_maxiter in your post), additionnally to controling the number of calls maxfun (bfgs_numfun in your post) and the tolerance pgtol? If such a situation exists, it could be useful to add both parameters to MLP.

The last point was that, since right now we can only control one of those params (as MLP has only a self.max_iter parameter), why did you chose to control the number of calls maxfun and not the number of iteration maxiter ?
IE : you chose to set both maxfun and maxiter equal to parameter self.max_iter , which is the same as setting maxiter to infinity since maxfun >= maxiter. Is there any reason not doing it in the opposite way, setting maxfun to infinity, and maxiter to self.max_iter ? For instance one valuable reason not to do so would be that the number of calls may explode with a fixed number of iterations?

@daniel-perry
Copy link
Contributor Author

Great, so here are a few answers to your questions:

  • yes pgtol also controls the number of iterations and my caricature of the stopping criteria was incomplete.. each statement should also include something like or max_pgerr < pgtol: In actually using the solver, this is really the termination criteria you want to be used in exiting the function, since this is indicative of convergence. Luckily, the pgtol parameter has been set to the same value as MLP's tol so this should work as expected (except in cases where the number of iterations needs to go beyond 10000, as pointed out already).
  • the choice between setting a limit to maxfun vs maxiter is an interesting tradeoff, and I think you've described the dangers in only limiting the iterations fairly -- it's possible you could be in a situation where at each iteration you would call the function a lot, this is typically true of situations where a finite-difference approximation is being used instead of an explicit gradient function (you can provide a gradient function to the bfgs parameter fprime, otherwise it should default to a finite difference). Since the MLP code doesn't provide fprime we are in this situation, and so the number of function calls can become quite high (BFGS is a second order method, so it's not only trying to approximate the gradient but also the Hessian, etc). Consequently, I think it's better to limit the function calls rather than the iterations here for the reason you provided.
  • one reason for not doing it this way is that we are misleading: the user specifies they want 10000 iterations and BFGS may only run for a fraction of that because bfgs_numfun reaches that number before bfgs_numiter.

Ok, so after answering your questions I now think I should add a mlp_maxfun parameter to the MLP interface, and perhaps have the default behavior for mlp_maxfun be to set it equal to mlp_maxiter (or maybe some multiple of it), so that if the user only sets mlp_maxiter.

Let me know what you think... I'll wait for a reply (or some time for a chance to reply) before I make the changes, but I'm thinking that would probably be the better way to go with this patch.

Thanks for your comments and questions, etc.

@ngoix
Copy link
Contributor

ngoix commented Aug 8, 2017

To summarize, there are two options:

  1. adding a new MLP param only used when the solver is bfgs (max_fun).
  2. having a MPL max_iter parameter controlling the number of calls instead of the number of iterations when the solver is bfgs.

I'd be more for 1. since in 2., the MLP doc would not be exact when the solver is bfgs. Also, there already are specific parameters for sgd and adam solvers.
@jnothman what do you think?

@ngoix
Copy link
Contributor

ngoix commented Aug 8, 2017

Also @daniel-perry, any idea why we don't need to distinguish maxfun from maxiter in the other solvers?

@jnothman
Copy link
Member

jnothman commented Aug 8, 2017 via email

@ngoix
Copy link
Contributor

ngoix commented Aug 9, 2017

@daniel-perry if you think that controlling both maxfun and maxiter is better than controlling only maxfun for lbfgs, then let's go with option 1. !

PS: For adam and sgd solvers, I think that we have nb of calls = batch_size * nb of iteration (L.509), that's why we only need to control max_iter.

@daniel-perry daniel-perry changed the title Setting both maxiter and maxfun in call to lbfgs. [MRG] Adding max_fun parameter to MLP for use in lbfgs optimization Sep 7, 2017
@daniel-perry
Copy link
Contributor Author

Hi, sorry for the delay, had other things come up.

I've just added some commits which add a "max_fun" paramter to the MLP classifier and regressor. As this is my first time committing to sklearn, I walked through the steps on http://scikit-learn.org/stable/developers/contributing.html to make sure everything as in order, but let me know if you notice anything is awry.

Some notes:

  • I changed the way the method reports n_iters_, instead of incrementing whenever the function is called (which would really be the number of function calls, not iterations), it now reports the number of iters that lbfgs reports when it returns.
  • I added a test to make sure the number of iters is capped when max_fun is specified. I added it because the contribution docs indicated adding tests were good, but if it seems superfulous I can remove it no problem. The point of the test is jus to make sure when using lbfgs, max fun is doing something expected.
  • I've indicated in the doc string that 'max_fun' is only applicable to the lbfgs solver.

@daniel-perry
Copy link
Contributor Author

@ngoix I believe the above changes address the issues we discussed. Do I need to change the status of the ticket somehow?

Copy link
Contributor

@ngoix ngoix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have to add a test which fails before your fix and passes now.

sklearn/neural_network/multilayer_perceptron.py Outdated Show resolved Hide resolved
sklearn/neural_network/multilayer_perceptron.py Outdated Show resolved Hide resolved
sklearn/neural_network/multilayer_perceptron.py Outdated Show resolved Hide resolved
sklearn/neural_network/tests/test_mlp.py Show resolved Hide resolved
sklearn/neural_network/tests/test_mlp.py Outdated Show resolved Hide resolved
sklearn/neural_network/tests/test_mlp.py Outdated Show resolved Hide resolved
@daniel-perry
Copy link
Contributor Author

Ok, I think the comments address your points now? I can modify further if not.
I increased the default max_fun so default behavior would match previous behavior, as discussed above.

As far as providing a test that failed before and passes now, the only way I can get it to fail is to:

  • include a new dataset I have (uploaded already to the corresponding bug page) in the tests, because I have yet to reproduce this public datasets (they converge too quickly)
  • or maybe artificially set the default max_fun low so that the same thing happens on the datasets already available

The first option requires adding the dataset somewhere so it's available, where is that typically done? Does this change merit that kind of an addition?
The second option is not really what you're asking for because I would use the "new" max_fun parameter to do it, but it would be at least showing what the problem was in a clear way?

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with not including a regression test.

sklearn/neural_network/tests/test_mlp.py Outdated Show resolved Hide resolved
sklearn/neural_network/multilayer_perceptron.py Outdated Show resolved Hide resolved
sklearn/neural_network/multilayer_perceptron.py Outdated Show resolved Hide resolved
sklearn/neural_network/tests/test_mlp.py Outdated Show resolved Hide resolved
sklearn/neural_network/tests/test_mlp.py Outdated Show resolved Hide resolved
@daniel-perry
Copy link
Contributor Author

Apologies again for the slow turnaround, life got busy.

I believe I've addressed the review comments, any other suggestions to change?

@ngoix
Copy link
Contributor

ngoix commented Oct 29, 2017

@daniel-perry any idea why travis is not happy?

@jnothman
Copy link
Member

Issue of mismatched repr in doctests:

    -MLPClassifier(activation='relu', alpha=1e-05, batch_size='auto',
    -       beta_1=0.9, beta_2=0.999, early_stopping=False,
    -       epsilon=1e-08, hidden_layer_sizes=(5, 2), learning_rate='constant',
    -       learning_rate_init=0.001, max_iter=200, momentum=0.9,
    +MLPClassifier(activation='relu', alpha=1e-05, batch_size='auto', beta_1=0.9,
    +       beta_2=0.999, early_stopping=False, epsilon=1e-08,
    +       hidden_layer_sizes=(5, 2), learning_rate='constant',
    +       learning_rate_init=0.001, max_fun=15000, max_iter=200, momentum=0.9,
            nesterovs_momentum=True, power_t=0.5, random_state=1, shuffle=True,
            solver='lbfgs', tol=0.0001, validation_fraction=0.1, verbose=False,

need to add max_fun to existing doctests in doc/modules/neural_networks_supervised.rst

@daniel-perry
Copy link
Contributor Author

Thanks for the tip @thomasjpfan that fixed the failures.

@daniel-perry
Copy link
Contributor Author

@adrinjalali I believe I've addressed the concerns you brought up above, please take a look to see if everything is in order.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM, sorry for the late review.

@adrinjalali
Copy link
Member

you also need to merge master and solve the conflict.

@daniel-perry
Copy link
Contributor Author

Ok, changes made, merged with mainline, and resolved conflicts.

@adrinjalali Once the unit tests finish, let me know if you need anything else before merging the PR.

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few last comments.

doc/whats_new/v0.22.rst Outdated Show resolved Hide resolved
sklearn/neural_network/tests/test_mlp.py Show resolved Hide resolved
@daniel-perry
Copy link
Contributor Author

@rth if you can take a look after the unit tests pass and let me know if you have any concerns before merging the PR?

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, merging !

I think the way we check for convergence in lbgfs can be be done more consistently between different estimators; I'll look into it in a separate PR.

@rth rth merged commit 370b642 into scikit-learn:master Jul 2, 2019
Sprint Paris 2019 automation moved this from Needs review to Done Jul 2, 2019
@jnothman
Copy link
Member

jnothman commented Jul 2, 2019 via email

koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
)

Since MLP estimators doesn't provide the derivative of the objective function to `fmin_l_bfgs_b` it needs to be manually evaluated, and so the number of function calls can be significantly larger than the number of iterations.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

Training MLP using l-bfgs limited to default l-bfgs maxiter value
8 participants