libprotobuf error causes crash in the middle of training #16

Feynman27 · 2017-09-21T23:42:46Z

I'm running into a strange error. During training (after several thousand iterations), my training script crashes with the error

libprotobuf FATAL google/protobuf/wire_format.cc:830] CHECK failed: (output->ByteCount()) == (expected_endpoint): : Protocol message serialized to a size different from what was originally expected.  Perhaps it was modified by another thread during serialization?
terminate called after throwing an instance of 'google::protobuf::FatalException'
  what():  CHECK failed: (output->ByteCount()) == (expected_endpoint): : Protocol message serialized to a size different from what was originally expected.  Perhaps it was modified by another thread during serialization?
Command terminated by signal 6

It doesn't look like I'm running out of memory on my gpu. Could this be a result of writing too much information to tensorboard?

ruotianluo · 2017-09-21T23:58:57Z

I met this too. It doensn't always happen for me. But it does happen occasionally. Currently I just ignore this. But sure it's good to investigate it sometime

Feynman27 · 2017-09-22T00:02:36Z

It's happening to me ~ every 10k iterations after 130k iterations. Before this, it only happened every 70k iterations or so. Do you have any idea what may be causing it. I'm also getting a bad_alloc error, so something funny is happening with the memory.

ruotianluo · 2017-09-22T01:14:04Z

For now, just replace https://github.com/ruotianluo/pytorch-faster-rcnn/blob/master/lib/model/train_val.py#L252
with
if False:

Feynman27 · 2017-09-22T01:15:25Z

Yeah, I just commented it out. I'll post if I discover a more permanent solution.

Feynman27 changed the title ~~libprotobuf error causes crash in the middle if training~~ libprotobuf error causes crash in the middle of training Sep 21, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

libprotobuf error causes crash in the middle of training #16

libprotobuf error causes crash in the middle of training #16

Feynman27 commented Sep 21, 2017

ruotianluo commented Sep 21, 2017

Feynman27 commented Sep 22, 2017

ruotianluo commented Sep 22, 2017

Feynman27 commented Sep 22, 2017

libprotobuf error causes crash in the middle of training #16

libprotobuf error causes crash in the middle of training #16

Comments

Feynman27 commented Sep 21, 2017

ruotianluo commented Sep 21, 2017

Feynman27 commented Sep 22, 2017

ruotianluo commented Sep 22, 2017

Feynman27 commented Sep 22, 2017