Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AMD ROCm: Conformer-transducer diverges #2551

Closed
asumagic opened this issue May 21, 2024 · 2 comments
Closed

AMD ROCm: Conformer-transducer diverges #2551

asumagic opened this issue May 21, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@asumagic
Copy link
Collaborator

asumagic commented May 21, 2024

Describe the bug

Making this issue mostly for myself, and to track/document/point others to the problem.

The LibriSpeech conformer-transducer recipe fails to train with the stock parameters. The loss remains high, and there are non-finite loss warnings every few batches (but not every).

The NaNs seem to appear even in the forward. This might be an upstream bug but we don't support another NaN loss that can be made to work with AMD ROCm (yet).

Only tested with fp16 so far. bf16 cannot be tested since the torchaudio transducer does not support it.

The NaN loss seems to originate from the transducer loss but it might not necessarily be the direct culprit.

The non-finite loss doesn't seem to reproduce:

  • On the LibriSpeech conformer-large CTC recipe
  • With fp32 it seems okayish so far, but there are oddly still a few Patience not yet exhausted. messages.

Expected behaviour

Model should succesfully train.

To Reproduce

Train the model on a MI250X (e.g. Adastra) with fp16 on stock parameters.

Environment Details

Latest develop.

Reproduces on 1x and 8x MI250X GCDs. Running on CINES' Adastra.

Custom compiled torchaudio from ROCm/audio's release/2.1_add_rnnt branch (can provide instructions if anyone is interested).

Relevant Log Output

No response

Additional Context

No response

@asumagic asumagic added the bug Something isn't working label May 21, 2024
@asumagic
Copy link
Collaborator Author

If this doesn't reproduce with CTC alone (need to check), one option would be to use #1465 and to fork k2's fast_rnnt with HIPify support.

Maybe this is related to issues we were encountering on #2533 but I doubt it.

@asumagic
Copy link
Collaborator Author

Currently poking around the pruned RNN-T loss and managed to get something to start converging, so it's not some catastrophic kind of issue.

@asumagic asumagic closed this as not planned Won't fix, can't repro, duplicate, stale May 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant