Skip to content
This repository was archived by the owner on Jul 10, 2025. It is now read-only.

Conversation

fchollet
Copy link
Contributor

The feedback phase will be open for two weeks until 2018-10-30.

Optimizer unification in TensorFlow 2.0

Status Proposed
Author(s) Francois Chollet (fchollet@google.com)
Sponsor Martin Wicke (wicke@google.com)
Updated 2018-10-16

Summary

Keras has its own set of optimizers, living in tf.keras.optimizers. TensorFlow has also its own set of optimizers, living in tf.train.

We will merge them into a single set of optimizers, which will be based on the existing TensorFlow optimizers, but with some added features. These optimizers will replace the Keras optimizers. In the process, there will be some signature changes.

This RFC describes all planned API changes.

@ewilderj ewilderj added RFC: Proposed RFC Design Document 2.0 TensorFlow 2.0 development labels Oct 19, 2018
### VI - Make the following work on every optimizer, in both eager and graph modes:

```python
optimizer = Adadelta(learning_rate=0.2)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs some clarification on whether the old methods of using a tensor as learning rate, is still supported,
e.g. Adadelta(learning_rate=tf.train.linear_cosine_decay(...)).
In this case optimizer.learning_rate=0.1 should throw an error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is

```python
Adadelta(clip_norm=0.)
Adadelta(clip_value=0.)
```

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm against adding this as a universal API because how vaguely the clipping is defined and because there is no standard convention on how clipping should be done.

  1. What to clip: do you clip the "gradients w.r.t variables", or do you clip the "updates made to variables"? In most optimizers they are different.
  2. How to clip: in tensorflow there are currently tf.clip_by_{norm, value, global_norm, average_norm}. Imagine more to come.

For now I can see from above a combination of 2x4=8 methods to do clipping. AFAIK there is no convention on this or no common wisdom on which ones are more preferable. Therefore it's probably not a good idea to add it as an API for all optimizers.

Copy link

@mdanatg mdanatg Oct 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more elegant solution would be to expose finer grained functions as alternative to apply_gradients/,minimize/etc., so that the user may perform any kind of manipulation.

get_updates (optionally allowing to pass my own gradient values)
apply_updates

Example flows:

updates = opt.get_updates()
# clip, log, mask, etc.
opt.apply_updates(updates)

grads = opt.get_gradients()  # or tf.gradients
# clip, etc.
updates = opt.get_updates(grads=grads)
opt.apply_updates(updates)

I don't know if this would be at odds with any existing designs.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are already such APIs: opt.compute_gradients and opt.apply_gradients. Everything is good as long as they are not removed in 2.0.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compute_gradients/apply_gradients pair only works if you need to manipulate the gradient. If you need to modify the updates, you're out of luck.


## Questions and Discussion Topics

- Do you have a use case where you need to reuse an optimizer across different sets of weights? (note: this will still be doable with this proposal) Describe your use case.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A common use case: each GPU has its own set of weights, and reuse the optimizer for every GPU, e.g.:

for k in range(8):
    optimizer.apply_gradients(grads_and_vars[k])

Sure it's possible to create 8 optimizers, but then there will be 8 learning_rate to set.


---

## Detailed Design
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API design notes:

  • Could you clarify the behavior of get_weights and set_weights? It's unclear what is the effect when TF variables are involved. An example for an optimizer that tracks multiple shadow variables like Adam would be great. A link to an existing reference would work too.
  • A get_updates method would be extremely useful for tasks such as meta learning and imperative execution. It would be a pity to let this opportunity go by without adding one.

This method is already present on the Model class and every Layer class. This method is required for Keras compatibility.

**Args:**
- weights: A flat list of Numpy arrays, in deterministic order, same as returned by get_weights. Note that since the optimizer creates its internal weights to match the set of weights it is trying to optimize, set_weights would only match get_weights when the set of weights being optimized is equivalent. E.g.:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this interface does not support the case of taking model A, taking part of it as model B, and importing optimizer weights from model A into model B.

@erfannoury
Copy link

Will something like tf.contrib.layers.optimize_loss remain part of TF 2.0? It looks like that this function can handle most of the operations that can be performed on the gradients in an easy-to-use way. Also it can easily handle gradient clipping, without having issues mentioned by @ppwwyyxx.

The known breaking changes are:
- Due to name changes, old checkpoints would not be loadable with the new optimizers. This is opt-in: your checkpoint won't break until you start using the new optimizers in your code (you can always import the old optimizers from tf.compat.v1).
- Some arguments are getting renamed.
- The `use_locking` argument is removed.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice, we always keep use_locking as default. But whether it is going to use lock in implementation if it is removed ?

- Nadam (not yet in TF)

We will remove `ProximalGradientDescent` and `ProximalAdagrad` (they will stay in `tf.compat.v1`). They do not appear to be used by a critical mass of users.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any changes about some wrapper optimizers such as sync_replicas_optimizer?

@omoindrot
Copy link

Is there some reason for putting the optimizers under the keras namespace (tf.keras.optimizers) instead of a shorter and more generic tf.optimizers (or keep tf.train) ?

I've seen the same thing with tf.keras.layers vs. tf.layers. As a user of TensorFlow I would rather not have to include an additional keras everywhere in my code.

- FTRL (not yet in Keras)
- RMSProp
- Adamax (not yet in TF)
- Nadam (not yet in TF)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


```python
Adadelta(clip_norm=0.)
Adadelta(clip_value=0.)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the gradient clipping will be merged into optimizer, when is it will be called? Curently, gradient clipping will be done between compute_gradients and apply_gradients.This will lead to do gradient clipping on each tower before gradient reduction in distributed implementation and make inconsistency with non-distribution case. Currently, both SyncReplicasOptimizer and MirroredStrategy have this problem. Is this possible to insert gradient clipping logic into apply_gradients before doing gradient reduction?

@PatWie
Copy link

PatWie commented Nov 17, 2018

Is there some reason for putting the optimizers under the keras namespace (tf.keras.optimizers) instead of a shorter and more generic tf.optimizers (or keep tf.train) ?

I've seen the same thing with tf.keras.layers vs. tf.layers. As a user of TensorFlow I would rather not have to include an additional keras everywhere in my code.

@omoindrot did you get any answer?

Seems we already lost that battle:
tensorflow/tensorflow@8efd178

It drives me crazy they decided to add the prefix keras everywhere. There are so many neat decisions in these RFC proposals. This is obvious none of them. The keras namespace is like a virus. This is the first and most important problem with the namespace: it spreads! Please, please, please, at least consider to use more common/generic names. Otherwise I am literally just waiting for the time being forced to write "tf.keras.variable" and "tf.keras.placeholder".

@omoindrot
Copy link

@PatWie: I tried to fight for tf.layers here but apparently it wasn’t enough :’(

I completely agree with you, I don’t understand why they keep pushing stuff to tf.keras. If they want unification it would be simpler to progressively remove keras and keep the working things like tf.layers. It might be internal politics...

@seanpmorgan
Copy link
Member

Kind of ironic that a library designed for user experience is now causing this annoyance by branding its name into one of its backends.

If the rationale is that it is a hat tip to the interface that made TF immensely more appealing, I get it to some degree.... but I think the community should at least get an answer as to why.

@ewilderj
Copy link
Contributor

cc @karmel

@martinwicke
Copy link
Member

For 2.0, I have been erring on the side of removing duplication across the API. This was a big problem in the past, and "there are five ways of doing X" was a very valid criticism of TensorFlow.

This was the case even where modules started out as exact equivalents of each other. In the 2.0 API, wherever possible, we have only a single place for functionality of any given type. For instance, there should only be one place for metrics. Since we do want to offer a complete Keras (i.e., tf.keras should contain all of Keras), this is the natural place for things to live.

We can add module aliases, basically adding statements like losses = keras.losses to the main TensorFlow module, but I am not quite convinced that the burden of the extra typing is bad enough to outweigh the benefits of having the clarity of having a single place for this functionality (wherever possible, we are making some compromises).

@ppwwyyxx
Copy link

ppwwyyxx commented Nov 19, 2018

IMHO, "five ways of doing X" is bad, mainly because the five ways are all slightly different in functionalities. So this argument does not actually apply to aliases.

@omoindrot
Copy link

@martinwicke : I understand the need for clarity and removing duplication, however it doesn't feel normal to let Keras prevail over TensorFlow.

You say you want to offer the full Keras experience, but what about users switching from TensorFlow 1.x to TensorFlow 2.0? A user used to writing estimators with tf.layers and tf.metrics will have to change his or her code to tf.keras.*.

It's weird to be changing the TensorFlow API for the needs of Keras.

I agree with @ppwwyyxx that an alias to tf.layers, tf.metrics and tf.train / tf.optimizer(s) would be an acceptable solution.

@yegord
Copy link

yegord commented Nov 20, 2018

@ppwwyyxx Five ways to import the same optimizer are still five ways to import the same optimizer. I cannot say I liked that there were (and still are) three ways of importing add_loss(): one from tf.contrib.slim.losses, one from tf.contrib.losses one from tf.losses. I never knew (or always forgot) if they were the same thing and/or compatible with each other and had to refer to the docs or even the source code to find this out. Let's put an end to this, please.

@PatWie
Copy link

PatWie commented Nov 20, 2018

User Experience A

there are five ways of doing X

User: I want to use a convolution layer, that is probably tf.layers.conv2d.
TF2.0: Sorry, it doesn't exist anymore. You have to be more specific since v2.0. Which option do you want to use?
User: Which options do I have?
TF2.0: Since 2.0 there are the following options: [tf.keras.layers.conv2d]
User: Mhm, difficult decision. Then I'll take the Keras-one.
User: Now please use the ADAM optimizer. All tutorials say this is tf.train.AdamOptimizer, which seems obvious.
TF2.0: Sorry, it doesn't exist anymore. You have to be more specific since v2.0. Which ADAM optimizer do you want to use?
User: Which options do I have?
TF2.0: Since 2.0 there are the following options: [tf.keras.optimizers.adam]
....

If there is only one option, why do I need to specify which one I want to use?
In compliance with the Zen of Python: "Flat is better than nested.".

User Experience B

Further from semver

Major version X (X.y.z | X > 0) MUST be incremented if any backwards incompatible changes are introduced to the public API.

Do not read it as:

Major version X (X.y.z | X > 0) MUST include backwards incompatible changes to the public API.

It will break all existing code, documentation and will eradicate the helpfulness of tutorials spread on the internet. (I only care for the first point. But the other seems to be valid as well.)

This is ok, if the entire implementation would change. But currently it is really just a rebranding. And to be clear a bad one. It will be a longer and less generic name.

The usage of keras is flawed itself.

When using Keras in combination with TensorFlow:
Why do I need to specify compute_output_shape within Python additionallly to SetShapeFn([](InferenceContext* c) (in C++)? SetShapeFn is one of the main advantage of computation-graphs.

In the past, I had a smile when I looked at the implementation of Pytorch. But this input-output specification also seems to conquer TF. The smile is gone.

Hopefully, TF3.0 will remove this kind of code-duplication. This reminds me of empty .Doc() in c++-ops and documentation in python functions.

@omoindrot
Copy link

I've created a reddit discussion to draw attention to this pull request.

@dillondaudert
Copy link

If the intent is to unify all optimizers into a single location, then in my view it makes little sense for that location to fall anywhere other than tf.optimizers, or a similar name. Locating them under tf.keras.optimizers, as others have been saying, is problematic for a number of reasons.

From a branding perspective, why should Keras (a back-end agnostic high-level API) take priority over TensorFlow within TF itself? If the interest were truly to create a unified API, then tf.keras wouldn't even exist. All core TensorFlow functionality should be branded as such, and if there's some desire to work within the Keras framework, then users can install and use Keras itself. Why else does it have its own repo?

@chrisdonahue
Copy link

No no no please do not do this. I wasn't a fan of the integration of keras into tf (would we want pandas to be merged into numpy?) but was fine as long as it was kept within a namespace. Don't compromise the entire low-level tf API by shoehorning the high-level one.

@kimbayashanologatiniburgettoilalalourvr
Copy link

kimbayashanologatiniburgettoilalalourvr commented Nov 20, 2018

In my experience with TensorFlow, high level APIs such as Keras have been really nice for quickly prototyping things, at the price of some inflexibility, and I'm all for unification. But if that kind of API becomes the sole way of doing things, per @PatWie's comment, then it just adds yet another level of fragmentation to documentation + existing codebases. That kind of move seems like it would cause the very problems that Keras' creation was originally intended to solve.

@JossWhittle
Copy link

tf.optimizers for the low level implementations, tf.keras.optimizers for wrapped versions that provide a lot of defaults that make sense in a Keras model given that all the other layers in that model will use a lot of defaults based on currently understood best practices encoded in Keras.

When I am building low level code in tensorflow it is because I want higher control over performance or because I am doing something novel and the current best practices are not necessarily applicable. And with respect, at that point I would like everyone to just get out the way and let me control it.

If I am doing Keras, yes please make my life easier in X, Y, and Z ways. If I am using tensorflow just for tensorflow, go away and let me just use tensorflow.

@seanpmorgan
Copy link
Member

Hmmm I think this has gotten a little away from the topic. This RFC will unify the optimizers into one location (core idea of TF 2.0), and for the most part all of the functionality of the old tf.optimizers is still there (and more such as serialization). You can still inherit from the base optimizer and create a new optimizer with any "low level" changes you want, but the discussion before was why have it under the tf.keras namespace instead of tf.optimizers.

The advantage that I see would be that the user would understand that these optimizer implementations are the same (almost the same?) as they would find in Keras. The downside is having to write out tf.keras which is an inconvenience, but not as dire as some of these posts are making it sound to be.

Regarding updating code from 1.x, I'm pretty sure there will be a converter util that can pretty easily replace the old tf.optimizers with tf.keras.optimizers. I still favor tf.optimizers, but some of these posts seem to be a bit misguided.

@inoryy
Copy link

inoryy commented Nov 20, 2018

As a long-time active user of Keras (and now tf.keras) I am adamantly against merging optimizers into tf.keras namespace. The changes so far (e.g. tf.layers -> tf.keras.layers) made sense to me because Keras model building API was already the goto approach for majority of TensorFlow users.

However, I would say it is the opposite when it comes to tf.optimizers. When I think of optimizers I think of the lower level functionality of TensorFlow, e.g. ability to have fine-grained control over gradients at will.
I do not think tf.keras namespace fits this.

@fchollet
Copy link
Contributor Author

The comment period ended 20 days ago (closing the thread now).

As explained in the document, this proposal does not "replace TensorFlow optimizers with Keras optimizers" (a perspective which seems unnecessarily adversarial since the Keras API spec is just the high-level interface of TensorFlow 2.0; it would be a bit like opposing "TensorFlow" and "tf.train").

This proposal simply takes the existing TensorFlow optimizers, improves and simplifies their signatures in a few minor places, and removes redundant sets of optimizers so that there is a single optimizer API. This is a very conservative change, that results in a clear improvement of the overall optimizer API (simpler, unified, more feature-complete, more user-friendly).

As Martin mentioned, if the cognitive overhead of typing out:

from tensorflow.keras import optimizers
optimizers.Adam(1e-4)

instead of

from tensorflow import train
train.AdamOptimizer(1e-4)

is too much, we could e.g. alias tf.keras.optimizers to tf.optimizers. But it seems to me this would just add a bit of confusion. And in the big picture, this not particularly important one way or another, since the underlying objects are the same.

Also: please no adversarial brigading.

@fchollet fchollet closed this Nov 20, 2018
@ewilderj ewilderj reopened this Nov 20, 2018
@ewilderj
Copy link
Contributor

(reopening this as it will need merging)

@ewilderj ewilderj closed this Nov 20, 2018
@ewilderj ewilderj reopened this Nov 20, 2018
@ewilderj ewilderj merged commit a72b0f8 into tensorflow:master Nov 20, 2018
@ewilderj
Copy link
Contributor

Thank you to everyone who commented, and on the Reddit thread. We have heard the comments and various team members are reviewing them. Please note that this PR isn't the only avenue for communication: you can discuss the general direction of TensorFlow on the discuss@tensorflow.org forum - https://groups.google.com/a/tensorflow.org/forum/#!forum/discuss

@martinwicke
Copy link
Member

Yes, thank you for the feedback!

@JossWhittle it is correct, there is often lower-level functionality that is accessible in TF, usually via tf.nn (though some simpler losses and metrics are in tf.math. These will of course stay. And as @seanpmorgan said, this design is about adding some functionality to the Optimizer class and its descendants (and @fchollet is right, this is a very conservative change from the existing API), and the behavior changes proposed do not at all prevent you from using it in the way you always have.

I do want to make sure that we clean up the divergent (and in the worst cases, just slightly different) modules providing essentially the same functionality, and this proposal is a step in that direction, unifying the currently slightly different tf.optimizers and tf.keras.optimizers.

I am fine with keeping module aliases, such that the following modules will continue to exist: tf.optimizers, tf.layers, tf.initializers, tf.losses, tf.metrics. The content of these will be identical to the corresponding modules in tf.keras, which should minimize confusion.

@inoryy
Copy link

inoryy commented Nov 20, 2018

@martinwicke I don't think anyone argued against modules consolidation, the issue is with the namespace being inconsistent with what is offered (and causing major BC issues for no good reason).
So as long as tf.optimizers, tf.losses, and so on remain intact it is a good change overall.

@PatWie
Copy link

PatWie commented Nov 20, 2018

Unfortunately, the attention from Reddit started an emoji-party and added too much noise to real concerns listed above and I regret that it raises the impression of being a biased trial.

I am fine with keeping module aliases, such that the following modules will continue to exist: tf.optimizers, tf.layers, tf.initializers, tf.losses, tf.metrics.

Please note, that this is the ever first definite statement in that direction (which is inconsistent to facts like tensorflow/tensorflow@8efd178) .

I appreciate the definite promise to not break too much code as a very positive outcome, compared to previous claims reducing this issue to a problem of "cognitive overhead".

Obviously refusing to keep the namings quite generic and neutral has been exactly the cause of the last part in this discussion.

@fchollet
Copy link
Contributor Author

I appreciate the definite promise to not break too much code as a very positive outcome, compared to previous claims reducing this issue to a problem of "cognitive overhead".

@PatWie Please do note that this is not a backwards compatibility concern, because there is currently no tf.optimizers module; it would be a new addition. Likewise, the contents of the tf.losses and tf.metrics modules (which do currently exist) would be wholly different. The question is really a cosmetic one, not a backwards compatibility one.

@tranhungnghiep
Copy link

Because keras is not and will not be enough for a serious research oriented framework.
How could tf have both high level and low level API, harmonically coexisting?

Other frameworks have free choice to copy the best design and evolve their APIs.

@omoindrot
Copy link

@fchollet : I mostly agree with you on the details of this RFC, the changes to optimizers or the fact that we want API consolidation. But this is not the point.

I think you are missing the forest for the trees. People are not complaining about breaking changes to tf.layers or about the additional cognitive overhead of typing tf.keras everywhere. A lot of users are worried about the future of TensorFlow and what merging with Keras means. There needs to be a clear explanation from the Tensorflow team about how TensorFlow and Keras will interact, if users will have to use tf.keras in their workflow and why the team thinks this is the best solution.

In any case, this is a discussion worth having and there should be a more appropriate place to hold it (maybe in a separate issue here in tensorflow/community).

@tranhungnghiep
Copy link

tranhungnghiep commented Nov 21, 2018

I also strongly support API consolidation.

However, the current trend seems to lead to a fragmented and confusing API. Partly due to non-semantic namespace (one reason to use tf.nn.*), partly due to high-level/low-level division (one reason to copy tf.keras.* to tf.nn.* and extend them beyond, of course independent keras still exists).

That is for the best of a framework, not like something new, other frameworks were already on the track.

@martinwicke
Copy link
Member

martinwicke commented Nov 21, 2018

@inoryy @fchollet

Please have this discussion elsewhere. I've deleted the comments that do not relate to this RFC.

@ewilderj FYI.

@tensorflow tensorflow deleted a comment from inoryy Nov 21, 2018
@tensorflow tensorflow deleted a comment from fchollet Nov 21, 2018
@tensorflow tensorflow deleted a comment from inoryy Nov 21, 2018
@tensorflow tensorflow deleted a comment from inoryy Nov 21, 2018
@ewilderj ewilderj added RFC: Accepted RFC Design Document: Accepted by Review and removed RFC: Proposed RFC Design Document labels Nov 21, 2018
@ewilderj
Copy link
Contributor

Hey folks, please ensure follow-up discussion not directly relevant to the content of the RFC be directed to the discuss@ list, or a GitHub issue, as appropriate. We strongly appreciate everyone's energetic interest, please continue the discussion civilly and in good faith.

@yoavz
Copy link

yoavz commented Nov 29, 2018

Can we make sure the unified optimizers do not face the same memory leak issues as in tensorflow/tensorflow#24047? e.g. the optimizer permanently stores some reference to the tf.Graph which restricts the Python GC from properly harvesting it even after the tf.Graph instance goes out of scope.

@chientranse
Copy link

Guyz, don't worry too much. It's just because there're too much keras users who don't want to install keras but tensorflow.
Instead of:
import tensorflow as tf
We should write:
import tensorflow.compat.v1 as tf
A do not forget to turn off eager execution.
Brace yourself, the tensorflow winter is coming!

@cgarciae
Copy link

cgarciae commented Mar 25, 2019 via email

@ewilderj
Copy link
Contributor

Folks, please keep the scope of your comments limited to responding to the technical details of this RFC. Thank you!

The comment period has already closed, so I am going to lock this discussion now.

@tensorflow tensorflow locked as resolved and limited conversation to collaborators Mar 25, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
2.0 TensorFlow 2.0 development RFC: Accepted RFC Design Document: Accepted by Review
Projects
None yet
Development

Successfully merging this pull request may close these issues.