Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

FloatingPointError: Minimum loss scale reached (0.0001). #1529

Closed
KelleyYin opened this issue Dec 20, 2019 · 1 comment
Closed

FloatingPointError: Minimum loss scale reached (0.0001). #1529

KelleyYin opened this issue Dec 20, 2019 · 1 comment
Labels

Comments

@KelleyYin
Copy link

@KelleyYin KelleyYin commented Dec 20, 2019

馃悰 Bug

Hi, guys.
I met the same issue as #515 .
I tried some methods, such as reducing the learning rate and increasing the batch-size, but none of them can solve the problem .

To Reproduce

Steps to reproduce the behavior (always include the command you ran):

My training command as follow:

export CUDA_VISIBLE_DEVICES=4,5,6,7
python train.py $data_bin \
      -s zh -t en \
      --lr 0.0005 --min-lr 1e-09 \
      --weight-decay 0 --clip-norm 0.0 \
      --dropout 0.3 \
      --max-tokens 30000 \
      --arch transformer \
      --optimizer adam --adam-betas '(0.9, 0.98)' \
      --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 \
      --warmup-updates 4000 \
      --ddp-backend=no_c10d \
      --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
      --save-dir $checkpoints \
      --fp16 

Training log:

| epoch 028 | loss 3.976 | nll_loss 2.204 | ppl 4.61 | wps 279460 | ups 3 | wpb 110529.673 | bsz 3780.211 | num_updates 9165 | lr 0.000330319 | gnorm 0.635 | clip 0.000 | oom 0.000 | loss_scale 0.000 | wall 4043 | train_wall 3606
| epoch 028 | valid on 'valid' subset | loss 9.329 | nll_loss 8.088 | ppl 272.04 | num_updates 9165 | best_loss 9.11817
| saved checkpoint ./checkpoints/LDC_zh-en_32k_fp16/checkpoint28.pt (epoch 28 @ 9165 updates) (writing took 7.268606901168823 seconds)
| WARNING: overflow detected, setting loss scale to: 0.0001220703125
| epoch 029 | loss 3.979 | nll_loss 2.176 | ppl 4.52 | wps 280188 | ups 3 | wpb 110521.492 | bsz 3796.456 | num_updates 9492 | lr 0.00032458 | gnorm 0.652 | clip 0.000 | oom 0.000 | loss_scale 0.000 | wall 4189 | train_wall 3734
| epoch 029 | valid on 'valid' subset | loss 9.580 | nll_loss 8.300 | ppl 315.24 | num_updates 9492 | best_loss 9.11817
| saved checkpoint ./checkpoints/LDC_zh-en_32k_fp16/checkpoint29.pt (epoch 29 @ 9492 updates) (writing took 7.078469276428223 seconds)
Traceback (most recent call last):
  File "train.py", line 337, in <module>
    cli_main()
  File "train.py", line 329, in cli_main
    nprocs=args.distributed_world_size,
  File "/home/duantea/anaconda3/envs/torch1.2cuda10/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
    while not spawn_context.join():
  File "/home/duantea/anaconda3/envs/torch1.2cuda10/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join
    raise Exception(msg)
Exception:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/duantea/anaconda3/envs/torch1.2cuda10/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/home/duantea/mmyin/fairseq/train.py", line 296, in distributed_main
    main(args, init_distributed=True)
  File "/home/duantea/mmyin/fairseq/train.py", line 86, in main
    train(args, trainer, task, epoch_itr)
  File "/home/duantea/mmyin/fairseq/train.py", line 127, in train
    log_output = trainer.train_step(samples)
  File "/home/duantea/mmyin/fairseq/fairseq/trainer.py", line 437, in train_step
    grad_norm = self.optimizer.clip_grad_norm(self.args.clip_norm)
  File "/home/duantea/mmyin/fairseq/fairseq/optim/fp16_optimizer.py", line 146, in clip_grad_norm
    ).format(self.min_loss_scale))
FloatingPointError: Minimum loss scale reached (0.0001). Your loss is probably exploding. Try lowering the learning rate, using gradient clipping or increasing the batch size.


Code sample

Expected behavior

Environment

  • fairseq Version (e.g., 1.0 or master): master
  • PyTorch Version (e.g., 1.0): v1.3
  • OS (e.g., Linux): Linnux
  • How you installed fairseq (pip, source): source
  • Build command you used (if compiling from source):
  • Python version: 3.6
  • CUDA/cuDNN version: CUDA-v10.1
  • GPU models and configuration: V100-32G
  • Any other relevant information:

Additional context

@myleott

This comment has been minimized.

Copy link
Contributor

@myleott myleott commented Dec 20, 2019

The loss is overflowing repeatedly, which causes batches to be thrown away. fairseq eventually terminates training so that you don't waste computation indefinitely.

There are a few options:

  1. --fp16-scale-tolerance=0.25: Allow some tolerance before decreasing the loss scale. This setting will allow one out of every four updates to overflow before lowering the loss scale. I'd recommend trying this first.
  2. --min-loss-scale=0.5: Prevent the loss scale from going below a certain value (in this case 0.5). Note that this could waste a lot of computation -- we may throw away a lot of batches due to overflow and not make any progress on training.
  3. Further decrease the learning rate.
  4. Switch to FP32 training.
@myleott myleott closed this Dec 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can鈥檛 perform that action at this time.