-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multi-gpu-dataparrel-cls.py 报错inf #4
Comments
我也不知道是什么原因了。 |
感谢,我再研究下,如果解决就贴在这里 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
实验条件:4卡 Ubuntu
背景:项目的其他 训练脚本 都能运行复现成功,所以框架环境应该没问题,只有下面的脚本执行有问题
运行命令:CUDA_VISIBLE_DEVICES=0,1 python multi-gpu-dataparallel-cls.py
报错如下:
warnings.warn('Was asked to gather along dimension 0, but all '
我的排查过程:
我把模型的输出(logits, label = self.on_step(batch_data)
loss = self.criterion(logits, label)这两行的结果变量)打印了一下,然后手动计算loss,发现确实是inf
然后我把同样的输入在 model(不打开 数据并行)单卡测试,发现loss正常,
然后把同样的输入在 model(打开 数据并行)上双卡测试,发现loss确实 inf.
一直没有排查出原因,还请大佬指教
The text was updated successfully, but these errors were encountered: