For Adam and Adadelta optimizer, when the model is close to convergence, the accuracy often suddenly drops to 0 with perplexity going to NAN, as shown below:
Epoch 3, 251750/348124; acc: 70.47; ppl: 3.77; 3911 tok/s; lr: 0.0010000; 717152.5 s elapsed
Epoch 3, 251800/348124; acc: 71.91; ppl: 3.53; 3796 tok/s; lr: 0.0010000; 717190.5 s elapsed
Epoch 3, 251850/348124; acc: 71.03; ppl: 3.58; 3752 tok/s; lr: 0.0010000; 717227.2 s elapsed
Epoch 3, 251900/348124; acc: 69.85; ppl: 3.86; 3830 tok/s; lr: 0.0010000; 717266.6 s elapsed
Epoch 3, 251950/348124; acc: 70.55; ppl: 3.73; 3930 tok/s; lr: 0.0010000; 717302.3 s elapsed
Epoch 3, 252000/348124; acc: 69.78; ppl: 4.03; 3912 tok/s; lr: 0.0010000; 717340.9 s elapsed
Epoch 3, 252050/348124; acc: 69.01; ppl: 4.18; 2699 tok/s; lr: 0.0010000; 717392.5 s elapsed
Epoch 3, 252100/348124; acc: 70.09; ppl: 3.90; 3935 tok/s; lr: 0.0010000; 717429.4 s elapsed
Epoch 3, 252150/348124; acc: 69.48; ppl: 4.18; 3758 tok/s; lr: 0.0010000; 717463.5 s elapsed
Epoch 3, 252200/348124; acc: 26.95; ppl: nan; 3753 tok/s; lr: 0.0010000; 717506.3 s elapsed
Epoch 3, 252250/348124; acc: 0.00; ppl: nan; 3925 tok/s; lr: 0.0010000; 717546.5 s elapsed
Epoch 3, 252300/348124; acc: 0.00; ppl: nan; 3822 tok/s; lr: 0.0010000; 717584.6 s elapsed
Epoch 3, 252350/348124; acc: 0.00; ppl: nan; 3813 tok/s; lr: 0.0010000; 717622.8 s elapsed
Epoch 3, 252400/348124; acc: 0.00; ppl: nan; 3677 tok/s; lr: 0.0010000; 717661.0 s elapsed
Epoch 3, 252450/348124; acc: 0.00; ppl: nan; 3999 tok/s; lr: 0.0010000; 717699.2 s elapsed
Epoch 3, 252500/348124; acc: 0.00; ppl: nan; 3939 tok/s; lr: 0.0010000; 717738.1 s elapsed
Epoch 3, 252550/348124; acc: 0.00; ppl: nan; 3872 tok/s; lr: 0.0010000; 717771.3 s elapsed
The code I have run is OpenNMT-py on a large dataset with 16M parallel sentences (Unite Nation Parallel Corpus v1.0), this phenomenon is observed on Adam and Adadelta which involves division, so far not seen on SGD. I suggest developers to check for divide by zero in Adam and Adadelta optimizers, and probably others.
For Adam and Adadelta optimizer, when the model is close to convergence, the accuracy often suddenly drops to 0 with perplexity going to NAN, as shown below:
Epoch 3, 251750/348124; acc: 70.47; ppl: 3.77; 3911 tok/s; lr: 0.0010000; 717152.5 s elapsed
Epoch 3, 251800/348124; acc: 71.91; ppl: 3.53; 3796 tok/s; lr: 0.0010000; 717190.5 s elapsed
Epoch 3, 251850/348124; acc: 71.03; ppl: 3.58; 3752 tok/s; lr: 0.0010000; 717227.2 s elapsed
Epoch 3, 251900/348124; acc: 69.85; ppl: 3.86; 3830 tok/s; lr: 0.0010000; 717266.6 s elapsed
Epoch 3, 251950/348124; acc: 70.55; ppl: 3.73; 3930 tok/s; lr: 0.0010000; 717302.3 s elapsed
Epoch 3, 252000/348124; acc: 69.78; ppl: 4.03; 3912 tok/s; lr: 0.0010000; 717340.9 s elapsed
Epoch 3, 252050/348124; acc: 69.01; ppl: 4.18; 2699 tok/s; lr: 0.0010000; 717392.5 s elapsed
Epoch 3, 252100/348124; acc: 70.09; ppl: 3.90; 3935 tok/s; lr: 0.0010000; 717429.4 s elapsed
Epoch 3, 252150/348124; acc: 69.48; ppl: 4.18; 3758 tok/s; lr: 0.0010000; 717463.5 s elapsed
Epoch 3, 252200/348124; acc: 26.95; ppl: nan; 3753 tok/s; lr: 0.0010000; 717506.3 s elapsed
Epoch 3, 252250/348124; acc: 0.00; ppl: nan; 3925 tok/s; lr: 0.0010000; 717546.5 s elapsed
Epoch 3, 252300/348124; acc: 0.00; ppl: nan; 3822 tok/s; lr: 0.0010000; 717584.6 s elapsed
Epoch 3, 252350/348124; acc: 0.00; ppl: nan; 3813 tok/s; lr: 0.0010000; 717622.8 s elapsed
Epoch 3, 252400/348124; acc: 0.00; ppl: nan; 3677 tok/s; lr: 0.0010000; 717661.0 s elapsed
Epoch 3, 252450/348124; acc: 0.00; ppl: nan; 3999 tok/s; lr: 0.0010000; 717699.2 s elapsed
Epoch 3, 252500/348124; acc: 0.00; ppl: nan; 3939 tok/s; lr: 0.0010000; 717738.1 s elapsed
Epoch 3, 252550/348124; acc: 0.00; ppl: nan; 3872 tok/s; lr: 0.0010000; 717771.3 s elapsed
The code I have run is OpenNMT-py on a large dataset with 16M parallel sentences (Unite Nation Parallel Corpus v1.0), this phenomenon is observed on Adam and Adadelta which involves division, so far not seen on SGD. I suggest developers to check for divide by zero in Adam and Adadelta optimizers, and probably others.