AMD ROCm: Conformer-transducer diverges #2551

asumagic · 2024-05-21T11:20:07Z

Describe the bug

Making this issue mostly for myself, and to track/document/point others to the problem.

The LibriSpeech conformer-transducer recipe fails to train with the stock parameters. The loss remains high, and there are non-finite loss warnings every few batches (but not every).

The NaNs seem to appear even in the forward. This might be an upstream bug but we don't support another NaN loss that can be made to work with AMD ROCm (yet).

Only tested with fp16 so far. bf16 cannot be tested since the torchaudio transducer does not support it.

The NaN loss seems to originate from the transducer loss but it might not necessarily be the direct culprit.

The non-finite loss doesn't seem to reproduce:

On the LibriSpeech conformer-large CTC recipe
With fp32 it seems okayish so far, but there are oddly still a few Patience not yet exhausted. messages.

Expected behaviour

Model should succesfully train.

To Reproduce

Train the model on a MI250X (e.g. Adastra) with fp16 on stock parameters.

Environment Details

Latest develop.

Reproduces on 1x and 8x MI250X GCDs. Running on CINES' Adastra.

Custom compiled torchaudio from ROCm/audio's release/2.1_add_rnnt branch (can provide instructions if anyone is interested).

Relevant Log Output

No response

Additional Context

No response

The text was updated successfully, but these errors were encountered:

asumagic · 2024-05-21T11:23:03Z

If this doesn't reproduce with CTC alone (need to check), one option would be to use #1465 and to fork k2's fast_rnnt with HIPify support.

Maybe this is related to issues we were encountering on #2533 but I doubt it.

asumagic · 2024-05-30T13:45:35Z

Currently poking around the pruned RNN-T loss and managed to get something to start converging, so it's not some catastrophic kind of issue.

asumagic added the bug Something isn't working label May 21, 2024

asumagic closed this as not planned Won't fix, can't repro, duplicate, stale May 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AMD ROCm: Conformer-transducer diverges #2551

AMD ROCm: Conformer-transducer diverges #2551

asumagic commented May 21, 2024 •

edited

asumagic commented May 21, 2024

asumagic commented May 30, 2024

AMD ROCm: Conformer-transducer diverges #2551

AMD ROCm: Conformer-transducer diverges #2551

Comments

asumagic commented May 21, 2024 • edited

Describe the bug

Expected behaviour

To Reproduce

Environment Details

Relevant Log Output

Additional Context

asumagic commented May 21, 2024

asumagic commented May 30, 2024

asumagic commented May 21, 2024 •

edited