多卡并行训练报错 #44

cywjava · 2023-03-30T14:34:38Z

我这里用了8张P40，指定了训练程序用，1，2，3，4 号卡
RuntimeError: Caught RuntimeError in replica 0 on device 0.

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasGemmStridedBatchedExFix( handle, opa, opb, m, n, k, (void*)(&falpha), a, CUDA_R_16F, lda, stridea, b, CUDA_R_16F, ldb, strideb, (void*)(&fbeta), c, CUDA_R_16F, ldc, stridec, num_batches, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)

第二次再运行，卡在加载模型后，就不动了。错也不报。进程也杀不了了，还影响了0号卡上的生成文本应用，这时候只能重启

The text was updated successfully, but these errors were encountered:

yuanzhoulvpi2017 · 2023-03-31T01:04:03Z

在老版本中，模型可以多卡，但是我用了新版本代码，就不能并行了。我还在调试bug，

cywjava · 2023-03-31T03:23:12Z

在老版本中，模型可以多卡，但是我用了新版本代码，就不能并行了。我还在调试bug，

哟西，我就说怎么你之前的可以，现在的不可以了呢。另外之前的是可以多卡，但是除了第一个卡满载运行，其它感觉都是闲着的啊

yuanzhoulvpi2017 · 2023-03-31T03:24:40Z

是的，我还在优化，第一版的并行也不优雅，我继续修改！发自我的 iPhone 在 2023年3月31日，11:23，chenyiwan ***@***.***> 写道： 1. 在老版本中，模型可以多卡，但是我用了新版本代码，就不能并行了。我还在调试bug，哟西，我就说怎么你之前的可以，现在的不可以了呢。另外之前的是可以多卡，但是除了第一个卡满载运行，其它感觉都是闲着的啊 — Reply to this email directly, view it on GitHub<#44 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AHJRI6IIGXSIFVLKGA3EE3LW6ZE2XANCNFSM6AAAAAAWNMRE2I>. You are receiving this because you commented.Message ID: ***@***.***>

Chenzongchao · 2023-03-31T03:52:49Z

okok

yuanzhoulvpi2017 · 2023-04-01T18:10:42Z

添加了单机多卡训练代码，链接放在这里，https://github.com/yuanzhoulvpi2017/zero_nlp/tree/main/Chatglm6b_ModelParallel

单机多卡、模型并行方式训练chatglm6b模型代码。
同时，结合lora算法、fp16精度、使用checkpoint等方法，可以在文本长度为1024、batchsize=4的情况下，在两个T4显卡上跑的很快乐（显卡的显存最大为16G，但是实际上卡1用了8G，卡2用了11G），甚至batchsize还可以提高。

yuanzhoulvpi2017 closed this as completed Apr 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

多卡并行训练报错 #44

多卡并行训练报错 #44

cywjava commented Mar 30, 2023

yuanzhoulvpi2017 commented Mar 31, 2023

cywjava commented Mar 31, 2023

yuanzhoulvpi2017 commented Mar 31, 2023 via email

Chenzongchao commented Mar 31, 2023

yuanzhoulvpi2017 commented Apr 1, 2023

多卡并行训练报错 #44

多卡并行训练报错 #44

Comments

cywjava commented Mar 30, 2023

yuanzhoulvpi2017 commented Mar 31, 2023

cywjava commented Mar 31, 2023

yuanzhoulvpi2017 commented Mar 31, 2023 via email

Chenzongchao commented Mar 31, 2023

yuanzhoulvpi2017 commented Apr 1, 2023