-
Notifications
You must be signed in to change notification settings - Fork 579
RFC: Optimizer unification in TensorFlow 2.0 #24
Conversation
### VI - Make the following work on every optimizer, in both eager and graph modes: | ||
|
||
```python | ||
optimizer = Adadelta(learning_rate=0.2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs some clarification on whether the old methods of using a tensor as learning rate, is still supported,
e.g. Adadelta(learning_rate=tf.train.linear_cosine_decay(...))
.
In this case optimizer.learning_rate=0.1
should throw an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is
```python | ||
Adadelta(clip_norm=0.) | ||
Adadelta(clip_value=0.) | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm against adding this as a universal API because how vaguely the clipping is defined and because there is no standard convention on how clipping should be done.
- What to clip: do you clip the "gradients w.r.t variables", or do you clip the "updates made to variables"? In most optimizers they are different.
- How to clip: in tensorflow there are currently
tf.clip_by_{norm, value, global_norm, average_norm}
. Imagine more to come.
For now I can see from above a combination of 2x4=8 methods to do clipping. AFAIK there is no convention on this or no common wisdom on which ones are more preferable. Therefore it's probably not a good idea to add it as an API for all optimizers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A more elegant solution would be to expose finer grained functions as alternative to apply_gradients/,minimize/etc., so that the user may perform any kind of manipulation.
get_updates (optionally allowing to pass my own gradient values)
apply_updates
Example flows:
updates = opt.get_updates()
# clip, log, mask, etc.
opt.apply_updates(updates)
grads = opt.get_gradients() # or tf.gradients
# clip, etc.
updates = opt.get_updates(grads=grads)
opt.apply_updates(updates)
I don't know if this would be at odds with any existing designs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are already such APIs: opt.compute_gradients
and opt.apply_gradients
. Everything is good as long as they are not removed in 2.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The compute_gradients
/apply_gradients
pair only works if you need to manipulate the gradient. If you need to modify the updates, you're out of luck.
|
||
## Questions and Discussion Topics | ||
|
||
- Do you have a use case where you need to reuse an optimizer across different sets of weights? (note: this will still be doable with this proposal) Describe your use case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A common use case: each GPU has its own set of weights, and reuse the optimizer for every GPU, e.g.:
for k in range(8):
optimizer.apply_gradients(grads_and_vars[k])
Sure it's possible to create 8 optimizers, but then there will be 8 learning_rate to set.
|
||
--- | ||
|
||
## Detailed Design |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
API design notes:
- Could you clarify the behavior of get_weights and set_weights? It's unclear what is the effect when TF variables are involved. An example for an optimizer that tracks multiple shadow variables like Adam would be great. A link to an existing reference would work too.
- A get_updates method would be extremely useful for tasks such as meta learning and imperative execution. It would be a pity to let this opportunity go by without adding one.
This method is already present on the Model class and every Layer class. This method is required for Keras compatibility. | ||
|
||
**Args:** | ||
- weights: A flat list of Numpy arrays, in deterministic order, same as returned by get_weights. Note that since the optimizer creates its internal weights to match the set of weights it is trying to optimize, set_weights would only match get_weights when the set of weights being optimized is equivalent. E.g.: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems this interface does not support the case of taking model A, taking part of it as model B, and importing optimizer weights from model A into model B.
Will something like |
The known breaking changes are: | ||
- Due to name changes, old checkpoints would not be loadable with the new optimizers. This is opt-in: your checkpoint won't break until you start using the new optimizers in your code (you can always import the old optimizers from tf.compat.v1). | ||
- Some arguments are getting renamed. | ||
- The `use_locking` argument is removed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In practice, we always keep use_locking
as default. But whether it is going to use lock in implementation if it is removed ?
- Nadam (not yet in TF) | ||
|
||
We will remove `ProximalGradientDescent` and `ProximalAdagrad` (they will stay in `tf.compat.v1`). They do not appear to be used by a critical mass of users. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any changes about some wrapper optimizers such as sync_replicas_optimizer?
Is there some reason for putting the optimizers under the keras namespace ( I've seen the same thing with |
- FTRL (not yet in Keras) | ||
- RMSProp | ||
- Adamax (not yet in TF) | ||
- Nadam (not yet in TF) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there exists Adama and Nadam optimizers in tf.contrib
: https://github.com/tensorflow/tensorflow/blob/60c56a4103e9f9239b9b7b104a846619dc4ecc20/tensorflow/contrib/opt/python/training/adamax.py#L32
|
||
```python | ||
Adadelta(clip_norm=0.) | ||
Adadelta(clip_value=0.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the gradient clipping will be merged into optimizer, when is it will be called? Curently, gradient clipping will be done between compute_gradients and apply_gradients.This will lead to do gradient clipping on each tower before gradient reduction in distributed implementation and make inconsistency with non-distribution case. Currently, both SyncReplicasOptimizer and MirroredStrategy have this problem. Is this possible to insert gradient clipping logic into apply_gradients before doing gradient reduction?
@omoindrot did you get any answer? Seems we already lost that battle: It drives me crazy they decided to add the prefix keras everywhere. There are so many neat decisions in these RFC proposals. This is obvious none of them. The keras namespace is like a virus. This is the first and most important problem with the namespace: it spreads! Please, please, please, at least consider to use more common/generic names. Otherwise I am literally just waiting for the time being forced to write "tf.keras.variable" and "tf.keras.placeholder". |
@PatWie: I tried to fight for I completely agree with you, I don’t understand why they keep pushing stuff to |
Kind of ironic that a library designed for user experience is now causing this annoyance by branding its name into one of its backends. If the rationale is that it is a hat tip to the interface that made TF immensely more appealing, I get it to some degree.... but I think the community should at least get an answer as to why. |
cc @karmel |
For 2.0, I have been erring on the side of removing duplication across the API. This was a big problem in the past, and "there are five ways of doing X" was a very valid criticism of TensorFlow. This was the case even where modules started out as exact equivalents of each other. In the 2.0 API, wherever possible, we have only a single place for functionality of any given type. For instance, there should only be one place for metrics. Since we do want to offer a complete Keras (i.e., tf.keras should contain all of Keras), this is the natural place for things to live. We can add module aliases, basically adding statements like |
IMHO, "five ways of doing X" is bad, mainly because the five ways are all slightly different in functionalities. So this argument does not actually apply to aliases. |
@martinwicke : I understand the need for clarity and removing duplication, however it doesn't feel normal to let Keras prevail over TensorFlow. You say you want to offer the full Keras experience, but what about users switching from TensorFlow 1.x to TensorFlow 2.0? A user used to writing estimators with It's weird to be changing the TensorFlow API for the needs of Keras. I agree with @ppwwyyxx that an alias to |
@ppwwyyxx Five ways to import the same optimizer are still five ways to import the same optimizer. I cannot say I liked that there were (and still are) three ways of importing |
User Experience A
User: I want to use a convolution layer, that is probably If there is only one option, why do I need to specify which one I want to use? User Experience BFurther from semver
Do not read it as:
It will break all existing code, documentation and will eradicate the helpfulness of tutorials spread on the internet. (I only care for the first point. But the other seems to be valid as well.) This is ok, if the entire implementation would change. But currently it is really just a rebranding. And to be clear a bad one. It will be a longer and less generic name. The usage of keras is flawed itself.When using Keras in combination with TensorFlow: In the past, I had a smile when I looked at the implementation of Pytorch. But this input-output specification also seems to conquer TF. The smile is gone. Hopefully, TF3.0 will remove this kind of code-duplication. This reminds me of empty |
I've created a reddit discussion to draw attention to this pull request. |
If the intent is to unify all optimizers into a single location, then in my view it makes little sense for that location to fall anywhere other than From a branding perspective, why should Keras (a back-end agnostic high-level API) take priority over TensorFlow within TF itself? If the interest were truly to create a unified API, then |
No no no please do not do this. I wasn't a fan of the integration of keras into tf (would we want pandas to be merged into numpy?) but was fine as long as it was kept within a namespace. Don't compromise the entire low-level tf API by shoehorning the high-level one. |
In my experience with TensorFlow, high level APIs such as Keras have been really nice for quickly prototyping things, at the price of some inflexibility, and I'm all for unification. But if that kind of API becomes the sole way of doing things, per @PatWie's comment, then it just adds yet another level of fragmentation to documentation + existing codebases. That kind of move seems like it would cause the very problems that Keras' creation was originally intended to solve. |
tf.optimizers for the low level implementations, tf.keras.optimizers for wrapped versions that provide a lot of defaults that make sense in a Keras model given that all the other layers in that model will use a lot of defaults based on currently understood best practices encoded in Keras. When I am building low level code in tensorflow it is because I want higher control over performance or because I am doing something novel and the current best practices are not necessarily applicable. And with respect, at that point I would like everyone to just get out the way and let me control it. If I am doing Keras, yes please make my life easier in X, Y, and Z ways. If I am using tensorflow just for tensorflow, go away and let me just use tensorflow. |
Hmmm I think this has gotten a little away from the topic. This RFC will unify the optimizers into one location (core idea of TF 2.0), and for the most part all of the functionality of the old tf.optimizers is still there (and more such as serialization). You can still inherit from the base optimizer and create a new optimizer with any "low level" changes you want, but the discussion before was why have it under the tf.keras namespace instead of tf.optimizers. The advantage that I see would be that the user would understand that these optimizer implementations are the same (almost the same?) as they would find in Keras. The downside is having to write out Regarding updating code from 1.x, I'm pretty sure there will be a converter util that can pretty easily replace the old tf.optimizers with tf.keras.optimizers. I still favor tf.optimizers, but some of these posts seem to be a bit misguided. |
As a long-time active user of However, I would say it is the opposite when it comes to |
The comment period ended 20 days ago (closing the thread now). As explained in the document, this proposal does not "replace TensorFlow optimizers with Keras optimizers" (a perspective which seems unnecessarily adversarial since the Keras API spec is just the high-level interface of TensorFlow 2.0; it would be a bit like opposing "TensorFlow" and "tf.train"). This proposal simply takes the existing TensorFlow optimizers, improves and simplifies their signatures in a few minor places, and removes redundant sets of optimizers so that there is a single optimizer API. This is a very conservative change, that results in a clear improvement of the overall optimizer API (simpler, unified, more feature-complete, more user-friendly). As Martin mentioned, if the cognitive overhead of typing out: from tensorflow.keras import optimizers
optimizers.Adam(1e-4) instead of from tensorflow import train
train.AdamOptimizer(1e-4) is too much, we could e.g. alias Also: please no adversarial brigading. |
(reopening this as it will need merging) |
Thank you to everyone who commented, and on the Reddit thread. We have heard the comments and various team members are reviewing them. Please note that this PR isn't the only avenue for communication: you can discuss the general direction of TensorFlow on the discuss@tensorflow.org forum - https://groups.google.com/a/tensorflow.org/forum/#!forum/discuss |
Yes, thank you for the feedback! @JossWhittle it is correct, there is often lower-level functionality that is accessible in TF, usually via I do want to make sure that we clean up the divergent (and in the worst cases, just slightly different) modules providing essentially the same functionality, and this proposal is a step in that direction, unifying the currently slightly different I am fine with keeping module aliases, such that the following modules will continue to exist: |
@martinwicke I don't think anyone argued against modules consolidation, the issue is with the namespace being inconsistent with what is offered (and causing major BC issues for no good reason). |
Unfortunately, the attention from Reddit started an emoji-party and added too much noise to real concerns listed above and I regret that it raises the impression of being a biased trial.
Please note, that this is the ever first definite statement in that direction (which is inconsistent to facts like tensorflow/tensorflow@8efd178) . I appreciate the definite promise to not break too much code as a very positive outcome, compared to previous claims reducing this issue to a problem of "cognitive overhead". Obviously refusing to keep the namings quite generic and neutral has been exactly the cause of the last part in this discussion. |
@PatWie Please do note that this is not a backwards compatibility concern, because there is currently no |
Because Other frameworks have free choice to copy the best design and evolve their APIs. |
@fchollet : I mostly agree with you on the details of this RFC, the changes to optimizers or the fact that we want API consolidation. But this is not the point. I think you are missing the forest for the trees. People are not complaining about breaking changes to In any case, this is a discussion worth having and there should be a more appropriate place to hold it (maybe in a separate issue here in |
I also strongly support API consolidation. However, the current trend seems to lead to a fragmented and confusing API. Partly due to non-semantic namespace (one reason to use That is for the best of a framework, not like something new, other frameworks were already on the track. |
Hey folks, please ensure follow-up discussion not directly relevant to the content of the RFC be directed to the discuss@ list, or a GitHub issue, as appropriate. We strongly appreciate everyone's energetic interest, please continue the discussion civilly and in good faith. |
Can we make sure the unified optimizers do not face the same memory leak issues as in tensorflow/tensorflow#24047? e.g. the optimizer permanently stores some reference to the |
Guyz, don't worry too much. It's just because there're too much keras users who don't want to install keras but tensorflow. |
I don't think keras is the real reason for 2.0, its pytorch's success with
eager execution.
I think the API simplification + unification is good.
If anything I think this is the Tensorflow spring.
…On Mon, Mar 25, 2019, 10:02 chientranse ***@***.***> wrote:
Guyz, don't worry too much. It's just because there're too much keras
users who don't want to install keras but tensorflow.
Instead of:
import tensorflow as tf
We should write:
import tensorflow.compat.v1 as tf
A do not forget to turn off eager execution.
Brace yourself, the tensorflow winter is coming!
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#24 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFlzVItKmTNJmVZa7zG8ylXbiJAYBel0ks5vaOT_gaJpZM4XiXH8>
.
|
Folks, please keep the scope of your comments limited to responding to the technical details of this RFC. Thank you! The comment period has already closed, so I am going to lock this discussion now. |
The feedback phase will be open for two weeks until 2018-10-30.
Optimizer unification in TensorFlow 2.0
Summary
Keras has its own set of optimizers, living in
tf.keras.optimizers
. TensorFlow has also its own set of optimizers, living intf.train
.We will merge them into a single set of optimizers, which will be based on the existing TensorFlow optimizers, but with some added features. These optimizers will replace the Keras optimizers. In the process, there will be some signature changes.
This RFC describes all planned API changes.