Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NAdam different between TF1.15 and TF2.3 #766

Closed
JackTemaki opened this issue Nov 24, 2021 · 15 comments · Fixed by #768
Closed

NAdam different between TF1.15 and TF2.3 #766

JackTemaki opened this issue Nov 24, 2021 · 15 comments · Fixed by #768

Comments

@JackTemaki
Copy link
Collaborator

JackTemaki commented Nov 24, 2021

I am not posting the full information here yet, because I need to collect some more things, but this is more a starting point for this issue:

Summary of important facts:

  • Nearly lastest RETURNN (I started the tests at 9th November, so RETURNN is from 9th November)
  • Both versions run with CUDA 10.1 on the same machine(s)
  • 6 Layer BLSTM (nativelstm2) without any extras
  • No specaugment, only l2 and dropout
  • CTC loss from RASR process, 139 output labels
  • NAdam optimizer
  • Training is starting from a fixed initial state

What happens:

  • With TF1.15, the model starts to converge significantly around 500 steps
  • With TF2.3, no convergence happens in the first 3000 steps
  • With TF1.15 the LSTM gradients are low after a peak, but go upwards shortly after
  • With TF2.3, the LSTM gradients stay low
  • With TF1.15, the l2 constraint increases
  • With TF2.3, the l2 constraint decreases

I did many repetitions of this experiment with different learning rates, different batch sizes, tried step-based warmup, but it always resulted in the same behavior.

TF1:
TF1_grads
TF1_loss

TF2:
TF2_grads
TF2_loss

@albertz
Copy link
Member

albertz commented Nov 24, 2021

The gradient norms you mean?

When you make sure to start from the same model (e.g. via task="initialize_model"), you could verify whether it is really exactly the same or not (at least in the first batch).

@JackTemaki
Copy link
Collaborator Author

The gradient norms you mean?

If the norm is different the gradient is different. I just displayed the norm here, of course I looked at all values.

When you make sure to start from the same model (e.g. via task="initialize_model"), you could verify whether it is really exactly the same or not (at least in the first batch).

Ah sorry, I did not mention this. Yes this is the case, I start from the same initialized model. But the reason was rather to exclude the possibility that the initialization changed between TF1.15 and TF2.3 for any reason.

@albertz
Copy link
Member

albertz commented Nov 24, 2021

But is the gradient still the same (within some threshold) for the first mini batch? I.e. the difference accumulates slowly over time? Or is there some specific later step where it is clearly different, and from there on it becomes different? Or is it already different in the first mini batch? And where exactly is it different? Already the error signal from the loss? Or at what layer does it become different in backprop?

@JackTemaki
Copy link
Collaborator Author

JackTemaki commented Nov 25, 2021

But is the gradient still the same (within some threshold) for the first mini batch? I.e. the difference accumulates slowly over time? Or is there some specific later step where it is clearly different, and from there on it becomes different? Or is it already different in the first mini batch? And where exactly is it different? Already the error signal from the loss? Or at what layer does it become different in backprop?

The training logs show that the error signal is identical (first minibatch shows the same loss, so before the first gradient it is identical), and for the second minibatch there is already a noticeable difference.

I will try to come up with a toy task (artificial data, small network) that reproduces this.

For the last question I would need to have produce shorter logs, you can not see the raw values properly in the Tensorboard as it is now, so I can not see if for the first step the gradient is only different in e.g. the last LSTM layer but not for the linear transformation before the Softmax.

@albertz
Copy link
Member

albertz commented Nov 25, 2021

The same loss does not mean that the gradients are identical. Did you directly check the gradients (not just the norm)?

@albertz
Copy link
Member

albertz commented Nov 25, 2021

I would not just look at TensorBoard. I would write a small script which dumps the gradients to some file, maybe after each mini batch, maybe also other information, and then investigate that by hand.

@JackTemaki
Copy link
Collaborator Author

JackTemaki commented Nov 25, 2021

The same loss does not mean that the gradients are identical. Did you directly check the gradients (not just the norm)?

It does not, and I did not say that. I say the first entry (first minibatch before any update) shows identical loss, which means that the loss computation is not flawed.

The second entry shows diverging loss, so the first update was already different. For which location was different, I would indeed need to dump the gradients.

@albertz
Copy link
Member

albertz commented Nov 25, 2021

How much different is the loss in the second step? Or how much do the gradients really differ in the first step?

I would assume that the first gradients are still all very similar (up to some threshold) and the differences just accumulate over time.

This is then maybe not even a bug. Just different behavior in different TF versions, and you have bad luck here that this different behavior leads to convergence in one case but not in the other.

Although, as this is now a problem for you, you need to investigate more about what exactly is different in the behavior, and better understand it. Maybe replicate the old behavior somehow, or find other ways.

@albertz
Copy link
Member

albertz commented Nov 25, 2021

If the first gradients are already different, that's good. Because that means that it will be easy to debug. Then you should check at what point it becomes different. E.g. already at the loss gradient, or at some layer?

@JackTemaki

This comment has been minimized.

@JackTemaki

This comment has been minimized.

@albertz
Copy link
Member

albertz commented Nov 25, 2021

Ah interesting.

Can you disable Adam to see that Adam or some Adam specific things are not causing this? Just use standard SGD.

I'm not sure if this is maybe still within numerical fluctuations...

You can easily directly check the gradient of the loss then (w.r.t. the logits).

Maybe TF accumulates them differently between TF 1.15 and TF 2.3. For the bias, it will accumulate them over time (hopefully correctly discarding the padded frames as well).

@JackTemaki
Copy link
Collaborator Author

Okay, I did something wrong while managing my debug configs. The error is most likely that the NAdam implementation changed between TF1.15 and TF2.3

@JackTemaki
Copy link
Collaborator Author

So yes, the model is converging nicely when switching to Adam also with TF2.3. I already saw the NAdam discrepancy yesterday but thought I switched to Adam to exclude this mismatch.

The difference is the following implementation difference:
Native code in TF1.15 (tensorflow/tensorflow/core/kernels/training_ops_gpu.cu.cc)

template <typename T>
__global__ void ApplyAdamKernel(int32 data_dim, T* var, T* m, T* v,
                                const T* const beta1_power_,
                                const T* const beta2_power_, const T* const lr_,
                                const T* const beta1_, const T* const beta2_,
                                const T* const epsilon_, const T* grad,
                                bool use_nesterov) {
  eigen_assert(blockDim.y == 1);
  eigen_assert(blockDim.z == 1);
  eigen_assert(gridDim.y == 1);
  eigen_assert(gridDim.z == 1);

  const T mul_factor = (*lr_) * sqrt(static_cast<T>(1.0) - (*beta2_power_)) /
                       (static_cast<T>(1.0) - (*beta1_power_));
  const T epsilon = (*epsilon_);
  const T beta1 = (*beta1_);
  const T one_minus_beta1 = static_cast<T>(1.0) - (beta1);
  const T one_minus_beta2 = static_cast<T>(1.0) - (*beta2_);
  const int32 stripe = gridDim.x * blockDim.x;

  for (int32 i = blockIdx.x * blockDim.x + threadIdx.x; i < data_dim;
       i += stripe) {
    auto m_i = m[i];
    auto g_i = grad[i];
    auto v_i = v[i];

    m_i += one_minus_beta1 * (g_i - m_i);
    v_i += one_minus_beta2 * (g_i * g_i - v_i);
    if (use_nesterov) {
      var[i] -= mul_factor * (m_i * beta1 + one_minus_beta1 * g_i) /
                (epsilon + sqrt(v_i));
    } else {
      var[i] -= mul_factor * m_i / (epsilon + sqrt(v_i));
    }

    m[i] = m_i;
    v[i] = v_i;
  }
}

Code in TF2.3 / Keras (keras/optimizer_v2/nadam.py)

  def _resource_apply_dense(self, grad, var, apply_state=None):
    var_device, var_dtype = var.device, var.dtype.base_dtype
    coefficients = ((apply_state or {}).get((var_device, var_dtype))
                    or self._fallback_apply_state(var_device, var_dtype))

    m = self.get_slot(var, 'm')
    v = self.get_slot(var, 'v')

    g_prime = grad / coefficients['one_minus_m_schedule_new']
    m_t = (coefficients['beta_1_t'] * m +
           coefficients['one_minus_beta_1_t'] * grad)
    m_t = tf.compat.v1.assign(m, m_t, use_locking=self._use_locking)
    m_t_prime = m_t / coefficients['one_minus_m_schedule_next']
    v_t = (coefficients['beta_2_t'] * v +
           coefficients['one_minus_beta_2_t'] * tf.square(grad))
    v_t = tf.compat.v1.assign(v, v_t, use_locking=self._use_locking)
    v_t_prime = v_t / coefficients['v_t_prime_denominator']
    m_t_bar = (coefficients['one_minus_m_t'] * g_prime +
               coefficients['m_t_1'] * m_t_prime)
    var_t = var - coefficients['lr_t'] * m_t_bar / (
        tf.sqrt(v_t_prime) + coefficients['epsilon'])
    return tf.compat.v1.assign(var, var_t, use_locking=self._use_locking).op

The differences are:

  • different computation of the LR/multiplication scale (mul_factor)
  • numerically different computation of m_t / m_i and v_t/ v_i

@albertz albertz changed the title Different Gradients when switching from TF1.15 to TF2.3 NAdam different between TF1.15 and TF2.3 Nov 25, 2021
@albertz
Copy link
Member

albertz commented Nov 25, 2021

I reported the problem for TF here: tensorflow/tensorflow#53204

albertz added a commit that referenced this issue Nov 25, 2021
The behavior is slightly different to the Keras Nadam optimizer.

Fix #766.
tensorflow/tensorflow#53204
albertz added a commit that referenced this issue Nov 25, 2021
The behavior is slightly different to the Keras Nadam optimizer.

Fix #766.
tensorflow/tensorflow#53204
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants