Can not run training program with cuda 10.2 #220

t13m · 2020-01-23T13:15:36Z

Hi, I was trying to run eesen in nvidia's docker container, and failed.

The container has cuda 10.2 in it. Eesen can be compiled, but when invoking "train-ctc-parallel", it crash with following logs:

LOG (train-ctc-parallel:DisableCaching():cuda-device.cc:731) Disabling caching of GPU memory.
LOG (train-ctc-parallel:SetUpdateAlgorithm():net.cc:483) Selecting SGD with momentum as optimization algorithm.
LOG (train-ctc-parallel:SetTrainMode():net.cc:408) Setting TrainMode for layer 0
LOG (train-ctc-parallel:SetTrainMode():net.cc:408) Setting TrainMode for layer 1
LOG (train-ctc-parallel:SetTrainMode():net.cc:408) Setting TrainMode for layer 2
LOG (train-ctc-parallel:SetTrainMode():net.cc:408) Setting TrainMode for layer 3
LOG (train-ctc-parallel:SetTrainMode():net.cc:408) Setting TrainMode for layer 4
add-deltas ark:- ark:-
copy-feats scp:exp/train_char_l5_c320/train_local.scp ark:-
LOG (train-ctc-parallel:main():train-ctc-parallel.cc:133) TRAINING STARTED
ERROR (train-ctc-parallel:AddVecToRows():cuda-matrix.cc:541) cudaError_t 209 : "no kernel image is available for execution on the device" returned from 'cudaGetLastError()'
WARNING (train-ctc-parallel:Close():kaldi-io.cc:446) Pipe gunzip -c exp/train_char_l5_c320/labels.tr.gz| had nonzero return status 13
WARNING (train-ctc-parallel:Close():kaldi-io.cc:446) Pipe copy-feats scp:exp/train_char_l5_c320/train_local.scp ark:- | add-deltas ark:- ark:- | had nonzero return status 36096
ERROR (train-ctc-parallel:AddVecToRows():cuda-matrix.cc:541) cudaError_t 209 : "no kernel image is available for execution on the device" returned from 'cudaGetLastError()'
[stack trace: ]
eesen::KaldiGetStackTraceabi:cxx11
eesen::KaldiErrorMessage::~KaldiErrorMessage()
eesen::CuMatrixBase::AddVecToRows(float, eesen::CuVectorBase const&, float)
eesen::BiLstmParallel::PropagateFncVanillaPassForward(eesen::CuMatrixBase const&, int, int)
eesen::BiLstmParallel::PropagateFnc(eesen::CuMatrixBase const&, eesen::CuMatrixBase)
eesen::Layer::Propagate(eesen::CuMatrixBase const&, eesen::CuMatrix)
eesen::Net::Propagate(eesen::CuMatrixBase const&, eesen::CuMatrix*)
train-ctc-parallel(main+0x148d) [0x5583f00fe692]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7f385afb9b97]
train-ctc-parallel(_start+0x2a) [0x5583f00fb44a]

Is there any workaround about this? I don't know much about cuda, I tried to add "-gencode arch=compute_{70,72,75},code={70,72,75}" to gpucompute/Makefile but it still crash.

liyongze · 2021-07-15T08:49:14Z

could you tell me how did you fix this problem? I met the same problem.

t13m · 2021-08-21T07:59:11Z

Hi, I didn't manage to make it work. My experiments were conducted on cpu.

liyongze · 2021-08-23T11:39:53Z

thanks for your reply! t13m ***@***.***> 于2021年8月21日周六下午3:59写道：

…

Hi, I didn't manage to make it work. My experiments were conducted on cpu. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#220 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AH2ATNTRT6HKFNAXLAPBXMTT55MFTANCNFSM4KKWICMA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email> .

t13m closed this as completed Jan 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can not run training program with cuda 10.2 #220

Can not run training program with cuda 10.2 #220

t13m commented Jan 23, 2020

liyongze commented Jul 15, 2021

t13m commented Aug 21, 2021

liyongze commented Aug 23, 2021 via email

Can not run training program with cuda 10.2 #220

Can not run training program with cuda 10.2 #220

Comments

t13m commented Jan 23, 2020

liyongze commented Jul 15, 2021

t13m commented Aug 21, 2021

liyongze commented Aug 23, 2021 via email