Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi-gpu training error #18

Closed
jwnsu opened this issue Jun 3, 2018 · 3 comments
Closed

multi-gpu training error #18

jwnsu opened this issue Jun 3, 2018 · 3 comments

Comments

@jwnsu
Copy link

jwnsu commented Jun 3, 2018

Trying examples/voc/res101-9s-600-rfcn-cascade. Single GPU training with some gpu id (e.g. gpu id 1 ok, but not 2) is ok, however when train with 2 GPUs, got following error quickly:

F0602 18:15:13.552311 13690 decode_bbox_layer.cpp:110] Check failed: keep_num > 0 (0 vs. 0)
*** Check failure stack trace: ***
F0602 18:15:13.553015 13740 decode_bbox_layer.cpp:110] Check failed: keep_num > 0 (0 vs. 0)
*** Check failure stack trace: ***
    @     0x7f1b5745b5cd  google::LogMessage::Fail()
    @     0x7f1b5745b5cd  google::LogMessage::Fail()
    @     0x7f1b5745d433  google::LogMessage::SendToLog()
    @     0x7f1b5745d433  google::LogMessage::SendToLog()
    @     0x7f1b5745b15b  google::LogMessage::Flush()
    @     0x7f1b5745b15b  google::LogMessage::Flush()
    @     0x7f1b5745de1e  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f1b5745de1e  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f1b57b9de37  caffe::DecodeBBoxLayer<>::Forward_cpu()
    @     0x7f1b57b9de37  caffe::DecodeBBoxLayer<>::Forward_cpu()
    @     0x7f1b57d245e7  caffe::Net<>::ForwardFromTo()
    @     0x7f1b57d245e7  caffe::Net<>::ForwardFromTo()
    @     0x7f1b57d24977  caffe::Net<>::Forward()
    @     0x7f1b57d1d878  caffe::Solver<>::Step()
    @     0x7f1b57d24977  caffe::Net<>::Forward()
    @     0x7f1b57d09d5e  caffe::Worker<>::InternalThreadEntry()
    @     0x7f1b57d1d878  caffe::Solver<>::Step()
    @     0x7f1b57b2c535  caffe::InternalThread::entry()
    @     0x7f1b57d1e39a  caffe::Solver<>::Solve()
    @     0x7f1b57b2d3fe  boost::detail::thread_data<>::run()
    @     0x7f1b493865d5  (unknown)
    @     0x7f1b57d0891c  caffe::NCCL<>::Run()
    @           0x411522  train()
    @           0x40c2eb  main
    @     0x7f1b560a36ba  start_thread
    @     0x7f1b55cf2830  __libc_start_main
    @           0x40d089  _start
    @     0x7f1b55dd941d  clone
    @              (nil)  (unknown)

coco models seem to be fine.

@jwnsu
Copy link
Author

jwnsu commented Jun 3, 2018

duplicate to #3.

@jwnsu jwnsu closed this as completed Jun 3, 2018
@pyupcgithub
Copy link

have you solved this problem ?
lower the learning rate ?

@pyupcgithub
Copy link

@jwnsu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants