Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor results #4

Closed
philokey opened this issue Sep 7, 2017 · 18 comments
Closed

Poor results #4

philokey opened this issue Sep 7, 2017 · 18 comments

Comments

@philokey
Copy link

philokey commented Sep 7, 2017

Hi,
I using TITANX, Ubuntu 16.04, Cuda8, Cudnn 5, and train the model as readme. However, I can not get the result mentioned in README. I have two problems.

  1. In the beginning of training, sometime the loss will become nan , it seems there are something wrong in initialization, but I can not find the bug in the code.
iter: 20 / 110000, total loss: nan
 >>> rpn_loss_cls: 0.691554
 >>> rpn_loss_box: 0.019584
 >>> loss_cls: nan
 >>> loss_box: nan
 >>> lr: 0.001000
speed: 0.382s / iter
  1. When losses are not nan, I get pool results: Mean AP = 0.6542 or Mean AP = 0.5809. Train on VOC 2007+2012 trainval and test on VOC 2007 by using default config.

Can you help me?

@ruotianluo
Copy link
Owner

Are you using the latest master? Which network are you using vgg, res101?

@philokey
Copy link
Author

philokey commented Sep 7, 2017

@ruotianluo Yes. I am using latest master(commit 7aa7cda). My network is res101.

@ruotianluo
Copy link
Owner

This is really weird.
Can you try downloading the pretrained model and test the pretrained model?

@philokey
Copy link
Author

philokey commented Sep 7, 2017

Ok, I will try it. By the way, I will get nan in the beginning with a probability about 0.5.

@ruotianluo
Copy link
Owner

I have run this for many times, this never happens to me.

@philokey
Copy link
Author

philokey commented Sep 7, 2017

I test your pretrained model and get Mean AP = 0.7834 with default config.

@ruotianluo
Copy link
Owner

This number sounds reasonable (The numbers haven't been updated). (Although the number is a little different from what I got.)

@ruotianluo
Copy link
Owner

I verified, the master branch is the same as the code I'm running.

@philokey
Copy link
Author

philokey commented Sep 7, 2017

It's strange. Can you tell me the information your environment?

@ruotianluo
Copy link
Owner

TITANX PASCAL, python 2.7, cuda8.0, cudnn6.0, ubuntu16.04

@ruotianluo
Copy link
Owner

Can you check the intermediate results when the nan emerge.

@philokey
Copy link
Author

philokey commented Sep 7, 2017

For example? Now, I upgrade to cudnn 6.0 and still encounter nan.

@ruotianluo
Copy link
Owner

ruotianluo commented Sep 7, 2017

   # RCNN, class loss
    cls_score = self._predictions["cls_score"]
    assert (cls_score.data == cls_score.data).all()
    label = self._proposal_targets["labels"].view(-1)
    cross_entropy = F.cross_entropy(cls_score.view(-1, self._num_classes), label)

Add assert like, the asserterror will be triggered when the data has nan.

@philokey
Copy link
Author

philokey commented Sep 7, 2017

It throws AssertionError.

Loading initial model weights from data/imagenet_weights/res101.pth
Loaded.
iter: 20 / 110000, total loss: 63182.523438
 >>> rpn_loss_cls: 0.000000
 >>> rpn_loss_box: 3896.581299
 >>> loss_cls: 58923.082031
 >>> loss_box: 362.857910
 >>> lr: 0.001000
speed: 0.385s / iter
Traceback (most recent call last):
  File "./tools/trainval_net.py", line 138, in <module>
    max_iters=args.max_iters)
  File "/home/hezheqi/Project/pytorch-faster-rcnn/tools/../lib/model/train_val.py", line 348, in train_net
    sw.train_model(max_iters)
  File "/home/hezheqi/Project/pytorch-faster-rcnn/tools/../lib/model/train_val.py", line 265, in train_model
    self.net.train_step(blobs, self.optimizer)
  File "/home/hezheqi/Project/pytorch-faster-rcnn/tools/../lib/nets/network.py", line 437, in train_step
    self.forward(blobs['data'], blobs['im_info'], blobs['gt_boxes'])
  File "/home/hezheqi/Project/pytorch-faster-rcnn/tools/../lib/nets/network.py", line 386, in forward
    self._add_losses() # compute losses
  File "/home/hezheqi/Project/pytorch-faster-rcnn/tools/../lib/nets/network.py", line 216, in _add_losses
    assert (cls_score.data == cls_score.data).all()
AssertionError

@ruotianluo ruotianluo changed the title Pool results Poor results Sep 7, 2017
@ruotianluo
Copy link
Owner

Which pretrained resnet model did you download?

@ruotianluo ruotianluo reopened this Sep 7, 2017
@philokey
Copy link
Author

philokey commented Sep 7, 2017

The recommended one.
https://download.pytorch.org/models/resnet101-5d3b4d8f.pth

@ruotianluo
Copy link
Owner

ruotianluo commented Sep 7, 2017

That's the reason. Please follow the instruction..........

https://drive.google.com/open?id=0B7fNdx_jAqhtaXZ4aWppWV96czg

@philokey
Copy link
Author

philokey commented Sep 7, 2017

Thanks a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants