Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ctc_loss problem when running on multiple GPUs #38

Closed
gmh8000 opened this issue Jul 8, 2021 · 3 comments
Closed

ctc_loss problem when running on multiple GPUs #38

gmh8000 opened this issue Jul 8, 2021 · 3 comments

Comments

@gmh8000
Copy link

gmh8000 commented Jul 8, 2021

Hi, I didn't have any bugs when training AV on a single GPU;But when I try to use multi-GPU training AV, CTC_LOSS gets the following error:

Traceback (most recent call last):
File "/home/cca01/work2020/guominghao/Deep_AV_ASR/audio_visual/train.py", line 162, in
main()
File "/home/cca01/work2020/guominghao/Deep_AV_ASR/audio_visual/train.py", line 111, in main
trainingLoss, trainingCER, trainingWER = train(model, trainLoader, optimizer, loss_function, device, trainParams)
File "/home/cca01/work2020/guominghao/Deep_AV_ASR/audio_visual/utils/general.py", line 74, in train
loss = loss_function(outputBatch, targetBatch, inputLenBatch, targetLenBatch)
File "/home/cca01/work2020/guominghao/.conda/envs/Deep_AV_PYTORCH/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/home/cca01/work2020/guominghao/.conda/envs/Deep_AV_PYTORCH/lib/python3.6/site-packages/torch/nn/modules/loss.py", line 1295, in forward
self.zero_infinity)
File "/home/cca01/work2020/guominghao/.conda/envs/Deep_AV_PYTORCH/lib/python3.6/site-packages/torch/nn/functional.py", line 1767, in ctc_loss
zero_infinity)
RuntimeError: Expected tensor to have size at least 660 at dimension 1, but got size 1474 for argument #2 'targets' (while checking arguments for ctc_loss_gpu)

Have you ever encountered this problem?

@gmh8000 gmh8000 closed this as completed Jul 8, 2021
@smeetrs
Copy link
Owner

smeetrs commented Jul 8, 2021

Kindly share the solution as it may help others.

@gmh8000 gmh8000 reopened this Jul 20, 2021
@gmh8000
Copy link
Author

gmh8000 commented Jul 20, 2021

I just changed the batch input to be equally divided according to the second dimension, for example model = torch.nn.DataParallel(model, device_ids=[5,7], dim=1)

@gmh8000 gmh8000 closed this as completed Jul 20, 2021
@smeetrs
Copy link
Owner

smeetrs commented Jul 20, 2021

Thanks for sharing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants