Poor results #4

philokey · 2017-09-07T03:46:34Z

Hi,
I using TITANX, Ubuntu 16.04, Cuda8, Cudnn 5, and train the model as readme. However, I can not get the result mentioned in README. I have two problems.

In the beginning of training, sometime the loss will become nan , it seems there are something wrong in initialization, but I can not find the bug in the code.

iter: 20 / 110000, total loss: nan
 >>> rpn_loss_cls: 0.691554
 >>> rpn_loss_box: 0.019584
 >>> loss_cls: nan
 >>> loss_box: nan
 >>> lr: 0.001000
speed: 0.382s / iter

When losses are not nan, I get pool results: Mean AP = 0.6542 or Mean AP = 0.5809. Train on VOC 2007+2012 trainval and test on VOC 2007 by using default config.

Can you help me?

The text was updated successfully, but these errors were encountered:

ruotianluo · 2017-09-07T03:47:46Z

Are you using the latest master? Which network are you using vgg, res101?

philokey · 2017-09-07T03:58:32Z

@ruotianluo Yes. I am using latest master(commit 7aa7cda). My network is res101.

ruotianluo · 2017-09-07T03:59:55Z

This is really weird.
Can you try downloading the pretrained model and test the pretrained model?

philokey · 2017-09-07T04:02:31Z

Ok, I will try it. By the way, I will get nan in the beginning with a probability about 0.5.

ruotianluo · 2017-09-07T04:04:01Z

I have run this for many times, this never happens to me.

philokey · 2017-09-07T04:22:17Z

I test your pretrained model and get Mean AP = 0.7834 with default config.

ruotianluo · 2017-09-07T04:26:48Z

This number sounds reasonable (The numbers haven't been updated). (Although the number is a little different from what I got.)

ruotianluo · 2017-09-07T04:32:30Z

I verified, the master branch is the same as the code I'm running.

philokey · 2017-09-07T05:01:02Z

It's strange. Can you tell me the information your environment?

ruotianluo · 2017-09-07T05:02:10Z

TITANX PASCAL, python 2.7, cuda8.0, cudnn6.0, ubuntu16.04

ruotianluo · 2017-09-07T05:09:38Z

Can you check the intermediate results when the nan emerge.

philokey · 2017-09-07T05:20:41Z

For example? Now, I upgrade to cudnn 6.0 and still encounter nan.

ruotianluo · 2017-09-07T05:22:42Z

   # RCNN, class loss
    cls_score = self._predictions["cls_score"]
    assert (cls_score.data == cls_score.data).all()
    label = self._proposal_targets["labels"].view(-1)
    cross_entropy = F.cross_entropy(cls_score.view(-1, self._num_classes), label)

Add assert like, the asserterror will be triggered when the data has nan.

philokey · 2017-09-07T05:26:26Z

It throws AssertionError.

Loading initial model weights from data/imagenet_weights/res101.pth
Loaded.
iter: 20 / 110000, total loss: 63182.523438
 >>> rpn_loss_cls: 0.000000
 >>> rpn_loss_box: 3896.581299
 >>> loss_cls: 58923.082031
 >>> loss_box: 362.857910
 >>> lr: 0.001000
speed: 0.385s / iter
Traceback (most recent call last):
  File "./tools/trainval_net.py", line 138, in <module>
    max_iters=args.max_iters)
  File "/home/hezheqi/Project/pytorch-faster-rcnn/tools/../lib/model/train_val.py", line 348, in train_net
    sw.train_model(max_iters)
  File "/home/hezheqi/Project/pytorch-faster-rcnn/tools/../lib/model/train_val.py", line 265, in train_model
    self.net.train_step(blobs, self.optimizer)
  File "/home/hezheqi/Project/pytorch-faster-rcnn/tools/../lib/nets/network.py", line 437, in train_step
    self.forward(blobs['data'], blobs['im_info'], blobs['gt_boxes'])
  File "/home/hezheqi/Project/pytorch-faster-rcnn/tools/../lib/nets/network.py", line 386, in forward
    self._add_losses() # compute losses
  File "/home/hezheqi/Project/pytorch-faster-rcnn/tools/../lib/nets/network.py", line 216, in _add_losses
    assert (cls_score.data == cls_score.data).all()
AssertionError

ruotianluo · 2017-09-07T06:54:25Z

Which pretrained resnet model did you download?

philokey · 2017-09-07T07:07:10Z

The recommended one.
https://download.pytorch.org/models/resnet101-5d3b4d8f.pth

ruotianluo · 2017-09-07T07:07:59Z

That's the reason. Please follow the instruction..........

https://drive.google.com/open?id=0B7fNdx_jAqhtaXZ4aWppWV96czg

philokey · 2017-09-07T07:16:33Z

Thanks a lot.

ruotianluo changed the title ~~Pool results~~ Poor results Sep 7, 2017

ruotianluo closed this as completed Sep 7, 2017

ruotianluo reopened this Sep 7, 2017

ruotianluo closed this as completed Sep 7, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor results #4

Poor results #4

philokey commented Sep 7, 2017 •

edited

ruotianluo commented Sep 7, 2017

philokey commented Sep 7, 2017

ruotianluo commented Sep 7, 2017

philokey commented Sep 7, 2017

ruotianluo commented Sep 7, 2017

philokey commented Sep 7, 2017

ruotianluo commented Sep 7, 2017

ruotianluo commented Sep 7, 2017

philokey commented Sep 7, 2017

ruotianluo commented Sep 7, 2017

ruotianluo commented Sep 7, 2017

philokey commented Sep 7, 2017

ruotianluo commented Sep 7, 2017 •

edited

philokey commented Sep 7, 2017 •

edited

ruotianluo commented Sep 7, 2017

philokey commented Sep 7, 2017

ruotianluo commented Sep 7, 2017 •

edited

philokey commented Sep 7, 2017

Poor results #4

Poor results #4

Comments

philokey commented Sep 7, 2017 • edited

ruotianluo commented Sep 7, 2017

philokey commented Sep 7, 2017

ruotianluo commented Sep 7, 2017

philokey commented Sep 7, 2017

ruotianluo commented Sep 7, 2017

philokey commented Sep 7, 2017

ruotianluo commented Sep 7, 2017

ruotianluo commented Sep 7, 2017

philokey commented Sep 7, 2017

ruotianluo commented Sep 7, 2017

ruotianluo commented Sep 7, 2017

philokey commented Sep 7, 2017

ruotianluo commented Sep 7, 2017 • edited

philokey commented Sep 7, 2017 • edited

ruotianluo commented Sep 7, 2017

philokey commented Sep 7, 2017

ruotianluo commented Sep 7, 2017 • edited

philokey commented Sep 7, 2017

philokey commented Sep 7, 2017 •

edited

ruotianluo commented Sep 7, 2017 •

edited

philokey commented Sep 7, 2017 •

edited

ruotianluo commented Sep 7, 2017 •

edited