Weights of Inner Optimizers Not Saved #2094

BinyanHu · 2020-08-15T14:10:33Z

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04 & Windows 10
TensorFlow version and how it was installed (source or binary): 2.3.0 from source
TensorFlow-Addons version and how it was installed (source or binary): 0.11.1 from source
Python version: 3.7
Is GPU used? (yes/no): yes

Describe the bug

Resume a training process needs the restoration of the optimizer states to continue training RIGHT from the previous state without any loss of accuracy. Currently, the keras interface of saving model keras.Model.save_weights checkpoints both the network parameters and the optimizer weights. However, when an optimizer is wrapped inside another, its weights can not be saved by this mean.

For example, when I was trying to use the Ranger optimizer, which is constructed by wrapping RAdam with Lookahead:

optimizer = tfa.optimizers.Lookahead(
    tfa.optimizers.RectifiedAdam()
)

I noticed a performance drop on resuming training. I found that the weights of the inner RAdam were not saved into the checkpoint. (I checked the .index file in the checkpoint folder and there are no variable names like "m" and "v", only "slow", which is the weights of Lookahead). Therefore, after loading the weights from file and restart fitting, the weights of RAdam are randomly reinitialized. This could because the weights of the inner optimizer are not automatically tracked.

Experiments

I trained the two LeNets on the FashionMNIST dataset. All the configurations are the same except for the optimizers. Both training are interrupted in the middle and then resumed.

Fig. TensorBoard. Blue: Ranger (Lookahead+RAdam), orange: RAdam.

Note the "bump" of the Ranger curve caused by the reinitialization of RAdam weights. Apparently, the weights of the inner optimizer are not correctly saved.

The text was updated successfully, but these errors were encountered:

bhack · 2020-08-15T14:56:56Z

Can you prepare a minimal PR with a new test to cover your case?

bhack · 2020-08-15T15:03:00Z

So that we could check if It is similar to #1911

BinyanHu · 2020-08-15T16:40:28Z

So that we could check if It is similar to #1911

Our issues are similar. But the real problem is on the missing weights of the inner RAdam optimizer.

First, I reran my program with status.assert_consumed(), and the errors are as follows:

AssertionError:
Unresolved object in checkpoint (root).optimizer.iter: attributes {
  name: "VARIABLE_VALUE"
  full_name: "iter"
  checkpoint_key: "optimizer/iter/.ATTRIBUTES/VARIABLE_VALUE"
}

Same as #1911. This is because the variable iter is not yet created by the time when we load the weights. If we do assert_consumed after fitting the model, the error goes away. The following warnings of all "slow" of the network parameters are not used in that issue are caused merely by non-training mode does not require loading the optimizer states, which is not a problem. In all, calling assert_consumed right after loading weights does not reveal the problem.

Second, learning rate warmup could help RAdam to re-accumulate the mean and variance statistics with small steps rather than "messing up" the network weights in the first few steps on resuming. This can, to some extent, alleviate the missing of the RAdam weights, but is definitely not the correct solution.

Plus, I just checked the sizes of the checkpoint files: Ranger 3381kb and RAdam 5070kb. With an extra slot "slow", the size of the Ranger checkpoint should not be smaller, indicating that the weights of RAdam are missing.

I think the reason is evident here. If a PR is still needed, how should the test be conducted? Would saving and loading a model with a Lookahead-wrapped optimizer with slots be enough to demonstrate the problem?

bhack · 2020-08-15T17:10:18Z

Lookahead test has no serializzation test currently.
So I think that you can add a small one and let It to fail in https://github.com/tensorflow/addons/blob/master/tensorflow_addons/optimizers/tests/lookahead_test.py

bhack · 2020-08-15T17:17:08Z

Check if some of the original author tests could be useful https://github.com/CyberZHG/keras-lookahead/blob/master/tests/test_optimizers.py

bhack · 2020-08-15T17:18:50Z

/cc @CyberZHG

bhack · 2020-08-15T17:23:11Z

Also check that you are recovering custom objects on load e.g. custom_objects={ 'RAdam': RAdam, 'Lookahead': Lookahead, })

WindQAQ · 2020-08-16T15:51:41Z

Hi @BinyanHu, thanks for investigating this. Can you provide the minimal code snippet to reproduce the issue, e.g. the way you save the model? Thank you!

AakashKumarNain · 2020-08-16T16:09:20Z

AssertionError:
Unresolved object in checkpoint (root).optimizer.iter: attributes {
  name: "VARIABLE_VALUE"
  full_name: "iter"
  checkpoint_key: "optimizer/iter/.ATTRIBUTES/VARIABLE_VALUE"
}

I think this is because of the fact that the value you pass to your optimizer is float, it gives this warning. You can use tf.Variable() for that.

On the other hand, I feel this is the real issue here

Plus, I just checked the sizes of the checkpoint files: Ranger 3381kb and RAdam 5070kb. With an extra slot "slow", the size of the Ranger checkpoint should not be smaller, indicating that the weights of RAdam are missing.

Inital fix of #2094 #2102

* Update lookahead.py Inital fix of #2094 #2102 * Fix linting * Resolve name conflict with mixed prexision * Track baseline optimizer in avg

* Update lookahead.py Inital fix of tensorflow#2094 tensorflow#2102 * Fix linting * Resolve name conflict with mixed prexision * Track baseline optimizer in avg

WindQAQ added bug Something isn't working optimizers labels Aug 16, 2020

BinyanHu mentioned this issue Aug 19, 2020

add test for reloading lookahead optimizer #2102

Open

21 tasks

bhack added a commit that referenced this issue Aug 27, 2020

Lookahead tracking base_optimizer

69b78e6

Inital fix of #2094 #2102

bhack added a commit that referenced this issue Aug 27, 2020

Update lookahead.py

0a57c39

Inital fix of #2094 #2102

bhack mentioned this issue Aug 27, 2020

Base optimizer tracking #2126

Merged

17 tasks

bhack linked a pull request Aug 27, 2020 that will close this issue

Base optimizer tracking #2126

Merged

17 tasks

WindQAQ closed this as completed in #2126 Sep 1, 2020

WindQAQ pushed a commit that referenced this issue Sep 1, 2020

Base optimizer tracking (#2126)

2bf57f8

* Update lookahead.py Inital fix of #2094 #2102 * Fix linting * Resolve name conflict with mixed prexision * Track baseline optimizer in avg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weights of Inner Optimizers Not Saved #2094

Weights of Inner Optimizers Not Saved #2094

BinyanHu commented Aug 15, 2020 •

edited

bhack commented Aug 15, 2020

bhack commented Aug 15, 2020

BinyanHu commented Aug 15, 2020

bhack commented Aug 15, 2020

bhack commented Aug 15, 2020

bhack commented Aug 15, 2020

bhack commented Aug 15, 2020

WindQAQ commented Aug 16, 2020

AakashKumarNain commented Aug 16, 2020

Weights of Inner Optimizers Not Saved #2094

Weights of Inner Optimizers Not Saved #2094

Comments

BinyanHu commented Aug 15, 2020 • edited

bhack commented Aug 15, 2020

bhack commented Aug 15, 2020

BinyanHu commented Aug 15, 2020

bhack commented Aug 15, 2020

bhack commented Aug 15, 2020

bhack commented Aug 15, 2020

bhack commented Aug 15, 2020

WindQAQ commented Aug 16, 2020

AakashKumarNain commented Aug 16, 2020

BinyanHu commented Aug 15, 2020 •

edited