-
Notifications
You must be signed in to change notification settings - Fork 331
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
多卡并行训练报错 #44
Comments
|
哟西,我就说怎么你之前的可以,现在的不可以了呢。 另外之前的是可以多卡,但是除了第一个卡满载运行,其它感觉 都是闲着的啊 |
是的,我还在优化,第一版的并行也不优雅,我继续修改!
发自我的 iPhone
在 2023年3月31日,11:23,chenyiwan ***@***.***> 写道:
1. 在老版本中,模型可以多卡,但是我用了新版本代码,就不能并行了。我还在调试bug,
哟西,我就说怎么你之前的可以,现在的不可以了呢。 另外之前的是可以多卡,但是除了第一个卡满载运行,其它感觉 都是闲着的啊
—
Reply to this email directly, view it on GitHub<#44 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AHJRI6IIGXSIFVLKGA3EE3LW6ZE2XANCNFSM6AAAAAAWNMRE2I>.
You are receiving this because you commented.Message ID: ***@***.***>
|
okok |
添加了单机多卡训练代码,链接放在这里,https://github.com/yuanzhoulvpi2017/zero_nlp/tree/main/Chatglm6b_ModelParallel 单机多卡、模型并行方式训练chatglm6b模型代码。 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
我这里用了8张P40,指定了训练程序用,1,2,3,4 号卡
RuntimeError: Caught RuntimeError in replica 0 on device 0.
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling
cublasGemmStridedBatchedExFix( handle, opa, opb, m, n, k, (void*)(&falpha), a, CUDA_R_16F, lda, stridea, b, CUDA_R_16F, ldb, strideb, (void*)(&fbeta), c, CUDA_R_16F, ldc, stridec, num_batches, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)
第二次再运行,卡在加载模型后,就不动了。错也不报。进程 也杀不了了,还影响了0号卡上的生成文本应用,这时候只能重启
The text was updated successfully, but these errors were encountered: