You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Making this issue mostly for myself, and to track/document/point others to the problem.
The LibriSpeech conformer-transducer recipe fails to train with the stock parameters. The loss remains high, and there are non-finite loss warnings every few batches (but not every).
The NaNs seem to appear even in the forward. This might be an upstream bug but we don't support another NaN loss that can be made to work with AMD ROCm (yet).
Only tested with fp16 so far. bf16 cannot be tested since the torchaudio transducer does not support it.
The NaN loss seems to originate from the transducer loss but it might not necessarily be the direct culprit.
The non-finite loss doesn't seem to reproduce:
On the LibriSpeech conformer-large CTC recipe
With fp32 it seems okayish so far, but there are oddly still a few Patience not yet exhausted. messages.
Expected behaviour
Model should succesfully train.
To Reproduce
Train the model on a MI250X (e.g. Adastra) with fp16 on stock parameters.
Environment Details
Latest develop.
Reproduces on 1x and 8x MI250X GCDs. Running on CINES' Adastra.
Custom compiled torchaudio from ROCm/audio's release/2.1_add_rnnt branch (can provide instructions if anyone is interested).
Relevant Log Output
No response
Additional Context
No response
The text was updated successfully, but these errors were encountered:
Describe the bug
Making this issue mostly for myself, and to track/document/point others to the problem.
The LibriSpeech conformer-transducer recipe fails to train with the stock parameters. The loss remains high, and there are non-finite loss warnings every few batches (but not every).
The NaNs seem to appear even in the forward. This might be an upstream bug but we don't support another NaN loss that can be made to work with AMD ROCm (yet).
Only tested with fp16 so far. bf16 cannot be tested since the torchaudio transducer does not support it.
The NaN loss seems to originate from the transducer loss but it might not necessarily be the direct culprit.
The non-finite loss doesn't seem to reproduce:
Patience not yet exhausted.
messages.Expected behaviour
Model should succesfully train.
To Reproduce
Train the model on a MI250X (e.g. Adastra) with fp16 on stock parameters.
Environment Details
Latest
develop
.Reproduces on 1x and 8x MI250X GCDs. Running on CINES' Adastra.
Custom compiled
torchaudio
fromROCm/audio
'srelease/2.1_add_rnnt
branch (can provide instructions if anyone is interested).Relevant Log Output
No response
Additional Context
No response
The text was updated successfully, but these errors were encountered: