multi-gpu training error #18

jwnsu · 2018-06-03T01:16:41Z

Trying examples/voc/res101-9s-600-rfcn-cascade. Single GPU training with some gpu id (e.g. gpu id 1 ok, but not 2) is ok, however when train with 2 GPUs, got following error quickly:

F0602 18:15:13.552311 13690 decode_bbox_layer.cpp:110] Check failed: keep_num > 0 (0 vs. 0)
*** Check failure stack trace: ***
F0602 18:15:13.553015 13740 decode_bbox_layer.cpp:110] Check failed: keep_num > 0 (0 vs. 0)
*** Check failure stack trace: ***
    @     0x7f1b5745b5cd  google::LogMessage::Fail()
    @     0x7f1b5745b5cd  google::LogMessage::Fail()
    @     0x7f1b5745d433  google::LogMessage::SendToLog()
    @     0x7f1b5745d433  google::LogMessage::SendToLog()
    @     0x7f1b5745b15b  google::LogMessage::Flush()
    @     0x7f1b5745b15b  google::LogMessage::Flush()
    @     0x7f1b5745de1e  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f1b5745de1e  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f1b57b9de37  caffe::DecodeBBoxLayer<>::Forward_cpu()
    @     0x7f1b57b9de37  caffe::DecodeBBoxLayer<>::Forward_cpu()
    @     0x7f1b57d245e7  caffe::Net<>::ForwardFromTo()
    @     0x7f1b57d245e7  caffe::Net<>::ForwardFromTo()
    @     0x7f1b57d24977  caffe::Net<>::Forward()
    @     0x7f1b57d1d878  caffe::Solver<>::Step()
    @     0x7f1b57d24977  caffe::Net<>::Forward()
    @     0x7f1b57d09d5e  caffe::Worker<>::InternalThreadEntry()
    @     0x7f1b57d1d878  caffe::Solver<>::Step()
    @     0x7f1b57b2c535  caffe::InternalThread::entry()
    @     0x7f1b57d1e39a  caffe::Solver<>::Solve()
    @     0x7f1b57b2d3fe  boost::detail::thread_data<>::run()
    @     0x7f1b493865d5  (unknown)
    @     0x7f1b57d0891c  caffe::NCCL<>::Run()
    @           0x411522  train()
    @           0x40c2eb  main
    @     0x7f1b560a36ba  start_thread
    @     0x7f1b55cf2830  __libc_start_main
    @           0x40d089  _start
    @     0x7f1b55dd941d  clone
    @              (nil)  (unknown)

coco models seem to be fine.

The text was updated successfully, but these errors were encountered:

jwnsu · 2018-06-03T16:05:17Z

duplicate to #3.

pyupcgithub · 2018-07-25T07:42:29Z

have you solved this problem ?
lower the learning rate ?

pyupcgithub · 2018-07-25T07:42:39Z

@jwnsu

jwnsu closed this as completed Jun 3, 2018

This was referenced Aug 7, 2018

Train error： many params are -1, can't save the trained model #47

Open

when i try to train the res101-9s-600-rfcn-cascade detector using my gpus 4,5 , it said #39

Closed

how to train it on my own dataset #3

Open

xiaoxiongli mentioned this issue Sep 20, 2018

during training, why so many "-1" in the training log? #58

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi-gpu training error #18

multi-gpu training error #18

jwnsu commented Jun 3, 2018 •

edited

Loading

jwnsu commented Jun 3, 2018

pyupcgithub commented Jul 25, 2018

pyupcgithub commented Jul 25, 2018

multi-gpu training error #18

multi-gpu training error #18

Comments

jwnsu commented Jun 3, 2018 • edited Loading

jwnsu commented Jun 3, 2018

pyupcgithub commented Jul 25, 2018

pyupcgithub commented Jul 25, 2018

jwnsu commented Jun 3, 2018 •

edited

Loading