Resume training from checkpoint result in NaN? #87

Approximetal · 2020-09-23T10:17:51Z

If I continue training by loading a checkpoint, it will occur NaN in forward step.
It first happens in here:
output = self.w_2(F.relu(self.w_1(output)))
But also appears in other place sometimes.
I list the value before and after this line:

pos_ffn: before tensor([[[ 8.4082e-01, -1.4385e+00, -1.0504e-01,  ...,  7.0752e-01,
          -1.2129e+00,  7.7100e-01],
         [ 1.0547e+00, -1.8789e+00, -2.2241e-01,  ..., -4.3579e-01,
           2.4097e-01,  9.2139e-01],
         [-2.6123e-01, -2.6914e+00, -1.7651e-01,  ...,  5.9961e+00,
          -1.1836e+00,  1.0342e+00],
         ...,
         [ 5.3833e-02, -1.3711e+00,  1.1494e+00,  ...,  1.4092e+00,
          -7.7295e-01,  3.0957e+00],
         [-3.6963e-01, -1.0107e+00, -1.1016e+00,  ..., -1.1743e-01,
          -5.3125e-01,  1.1270e+00],
         [-6.2744e-01, -6.1230e-01,  6.5527e-01,  ...,  1.2744e+00,
          -5.0439e-01,  1.5063e-01]],

        [[ 2.3364e-01, -5.9277e-01, -4.2285e-01,  ..., -2.7808e-01,
          -5.5908e-01,  2.1426e+00],
         [ 1.1465e+00, -1.4453e+00, -2.9248e-01,  ..., -2.5269e-02,
           5.4053e-01,  7.4902e-01],
         [ 3.4326e-01, -1.2061e+00, -8.5400e-01,  ..., -6.9238e-01,
           3.3618e-01, -8.5388e-02],
         ...,
         [-0.0000e+00, -0.0000e+00, -0.0000e+00,  ..., -0.0000e+00,
           0.0000e+00,  0.0000e+00],
         [ 0.0000e+00,  0.0000e+00, -0.0000e+00,  ..., -0.0000e+00,
           0.0000e+00,  0.0000e+00],
         [ 0.0000e+00,  0.0000e+00, -0.0000e+00,  ..., -0.0000e+00,
           0.0000e+00, -0.0000e+00]],

        [[-4.7607e-02, -3.5669e-01, -2.1680e-01,  ...,  3.5059e-01,
          -7.7637e-01,  8.0225e-01],
         [-1.3892e-01, -1.9568e-01, -1.7261e-01,  ...,  1.2422e+00,
          -3.2471e-01,  1.4746e+00],
         [-7.5439e-02, -1.4395e+00, -8.2812e-01,  ...,  3.7168e+00,
          -2.3560e-02,  8.5449e-02],
         ...,
         [-0.0000e+00,  0.0000e+00, -0.0000e+00,  ...,  0.0000e+00,
           0.0000e+00,  0.0000e+00],
         [-0.0000e+00, -0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
           0.0000e+00, -0.0000e+00],
         [ 0.0000e+00, -0.0000e+00, -0.0000e+00,  ...,  0.0000e+00,
           0.0000e+00, -0.0000e+00]],

        [[-2.4414e-04, -2.3193e-03,  3.2837e-01,  ...,  1.4004e+00,
          -6.9434e-01,  2.2578e+00],
         [ 4.5605e-01, -1.4121e+00,  8.1104e-01,  ...,  1.1855e+00,
          -5.8447e-01,  1.4521e+00],
         [-3.2275e-01, -9.7461e-01,  2.0630e-01,  ...,  2.1460e-01,
          -6.7432e-01,  2.7500e+00],
         ...,
         [-0.0000e+00, -0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
           0.0000e+00,  0.0000e+00],
         [ 0.0000e+00, -0.0000e+00, -0.0000e+00,  ...,  0.0000e+00,
           0.0000e+00,  0.0000e+00],
         [ 0.0000e+00,  0.0000e+00, -0.0000e+00,  ...,  0.0000e+00,
           0.0000e+00, -0.0000e+00]]], device='cuda:0', dtype=torch.float16,
       grad_fn=<MulBackward0>)

pos_ffn: after w1, w2 tensor([[[-5.7184e+04,        -inf,        -inf,  ...,        -inf,
          -6.5152e+04, -6.1216e+04],
         [ 6.3760e+03,  1.0024e+04,  1.1752e+04,  ...,  8.5680e+03,
           5.6360e+03,  5.5480e+03],
         [-1.4960e+04, -1.8672e+04, -2.1344e+04,  ..., -2.4176e+04,
          -1.6688e+04, -1.5432e+04],
         ...,
         [ 2.1760e+04,  1.9536e+04,  3.3152e+04,  ...,  2.2080e+04,
           2.1488e+04,  2.0080e+04],
         [ 2.9792e+04,  3.4048e+04,  4.5632e+04,  ...,  4.6624e+04,
           4.0672e+04,  4.0096e+04],
         [ 5.9040e+03,  7.7200e+02,  2.2304e+04,  ..., -5.2680e+03,
           4.4000e+03,  9.1360e+03]],

        [[-5.3216e+04,        -inf,        -inf,  ..., -1.6350e+02,
          -1.6350e+02, -1.6350e+02],
         [ 6.4600e+03,  7.5320e+03,  1.1776e+04,  ...,  1.6156e+01,
           1.6156e+01,  1.6156e+01],
         [-1.1384e+04, -2.8704e+04, -2.7088e+04,  ..., -5.2375e+01,
          -5.2375e+01, -5.2375e+01],
         ...,
         [ 2.4736e+04,  2.2448e+04,  2.1840e+04,  ...,  7.9500e+01,
           7.9500e+01,  7.9500e+01],
         [ 2.3728e+04,  2.9952e+04,  4.0864e+04,  ...,  9.1500e+01,
           9.1500e+01,  9.1500e+01],
         [ 9.2000e+03, -3.2140e+03,  6.8120e+03,  ...,  3.4031e+01,
           3.4031e+01,  3.4031e+01]],

        [[-6.2880e+04,        -inf,        -inf,  ..., -1.6350e+02,
          -1.6350e+02, -1.6350e+02],
         [ 1.1512e+04,  1.2672e+04,  1.4088e+04,  ...,  1.6156e+01,
           1.6156e+01,  1.6156e+01],
         [-2.2048e+04, -3.3664e+04, -2.5120e+04,  ..., -5.2375e+01,
          -5.2375e+01, -5.2375e+01],
         ...,
         [ 1.3288e+04,  1.4744e+04,  3.3376e+04,  ...,  7.9500e+01,
           7.9500e+01,  7.9500e+01],
         [ 3.0048e+04,  2.9840e+04,  4.2688e+04,  ...,  9.1500e+01,
           9.1500e+01,  9.1500e+01],
         [ 1.1912e+04, -3.1400e+03,  2.3280e+04,  ...,  3.4031e+01,
           3.4031e+01,  3.4031e+01]],

        [[-6.3872e+04, -6.1344e+04,        -inf,  ..., -1.6350e+02,
          -1.6350e+02, -1.6350e+02],
         [ 8.3840e+03,  6.2280e+03,  1.4944e+04,  ...,  1.6156e+01,
           1.6156e+01,  1.6156e+01],
         [-1.9136e+04, -1.0264e+04, -2.0544e+04,  ..., -5.2375e+01,
          -5.2375e+01, -5.2375e+01],
         ...,
         [ 2.5040e+04,  2.1520e+04,  1.7504e+04,  ...,  7.9500e+01,
           7.9500e+01,  7.9500e+01],
         [ 2.6064e+04,  3.1792e+04,  4.3488e+04,  ...,  9.1500e+01,
           9.1500e+01,  9.1500e+01],
         [ 1.0384e+04,  2.0110e+03,  1.0344e+04,  ...,  3.4031e+01,
           3.4031e+01,  3.4031e+01]]], device='cuda:0', dtype=torch.float16,
       grad_fn=<SqueezeBackward1>)

The text was updated successfully, but these errors were encountered:

Approximetal · 2020-09-23T11:03:51Z

It seems the pytorch and APEX issue.
NVIDIA/apex#651

Approximetal changed the title ~~Continue training from checkpoint result in NaN?~~ Resume training from checkpoint result in NaN? Sep 23, 2020

Approximetal closed this as completed Sep 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resume training from checkpoint result in NaN? #87

Resume training from checkpoint result in NaN? #87

Approximetal commented Sep 23, 2020 •

edited

Loading

Approximetal commented Sep 23, 2020 •

edited

Loading

Resume training from checkpoint result in NaN? #87

Resume training from checkpoint result in NaN? #87

Comments

Approximetal commented Sep 23, 2020 • edited Loading

Approximetal commented Sep 23, 2020 • edited Loading

Approximetal commented Sep 23, 2020 •

edited

Loading

Approximetal commented Sep 23, 2020 •

edited

Loading