Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bbox_transform.py:48: RuntimeWarning: overflow encountered in exp ... #65

Closed
ZhengRui opened this issue Jan 14, 2016 · 35 comments
Closed

Comments

@ZhengRui
Copy link

in proposal_layer.py 's forward() function, when i print out bbox_deltas.min() and bbox_deltas.max(), at some point it suddenly become large and cause overflow and core dump, here is the log:

-0.843478 0.695785
-1.53431 1.09048
-2.39332 1.81395
-2.74009 1.98957
-0.368922 0.236118
-0.707322 0.23115
I0115 01:02:23.016412 31390 solver.cpp:242] Iteration 40, loss = 1.8799
I0115 01:02:23.016444 31390 solver.cpp:258]     Train net output #0: loss_bbox = 0.040359 (* 1 = 0.040359 loss)
I0115 01:02:23.016451 31390 solver.cpp:258]     Train net output #1: loss_cls = 0.240918 (* 1 = 0.240918 loss)
I0115 01:02:23.016456 31390 solver.cpp:258]     Train net output #2: rpn_cls_loss = 0.585994 (* 1 = 0.585994 loss)
I0115 01:02:23.016461 31390 solver.cpp:258]     Train net output #3: rpn_loss_bbox = 0.883768 (* 1 = 0.883768 loss)
I0115 01:02:23.016472 31390 solver.cpp:571] Iteration 40, lr = 0.001
-5.10243 4.21369
-5.00325 3.99131
-6.54417 1.98163
-8.08618 2.3706
-8.88115 1.3943
-3.91337 0.64184
-1.92415 2.32623
-1.19933 1.09659
-4.12942 3.17897
-4.96536 3.46139
-2.6374 1.79074
-2.49792 1.71725
-17.6217 14.487
-22.151 18.5744
-21.8844 17.797
-16.2733 13.0881
-1.50187 1.95901
-1.43967 1.50989
-14.1043 28.3903
-9.91849 19.1581
-31.6936 8.46099
-27.262 7.1318
-28.2349 125.76
/home/zerry/Work/Libs/py-faster-rcnn/tools/../lib/fast_rcnn/bbox_transform.py:48: RuntimeWarning: overflow encountered in exp
  pred_w = np.exp(dw) * widths[:, np.newaxis]
/home/zerry/Work/Libs/py-faster-rcnn/tools/../lib/fast_rcnn/bbox_transform.py:48: RuntimeWarning: overflow encountered in multiply
  pred_w = np.exp(dw) * widths[:, np.newaxis]
-26.7032 118.203
-741.881 505.883
/home/zerry/Work/Libs/py-faster-rcnn/tools/../lib/fast_rcnn/bbox_transform.py:49: RuntimeWarning: overflow encountered in exp
  pred_h = np.exp(dh) * heights[:, np.newaxis]
/home/zerry/Work/Libs/py-faster-rcnn/tools/../lib/fast_rcnn/bbox_transform.py:49: RuntimeWarning: overflow encountered in multiply
  pred_h = np.exp(dh) * heights[:, np.newaxis]
-692.391 472.156
-9.47346e+25 1.02213e+26
-6.35599e+25 6.85811e+25
nan nan
/home/zerry/Work/Libs/py-faster-rcnn/tools/../lib/rpn/proposal_layer.py:176: RuntimeWarning: invalid value encountered in greater_equal
  keep = np.where((ws >= min_size) & (hs >= min_size))[0]
./experiments/scripts/faster_rcnn_end2end_handdet.sh: line 39: 31390 Floating point exception(core dumped

Anyone can help figuring out what could be the problem?

@pyoguy
Copy link

pyoguy commented Jan 15, 2016

I have the same issue here.
In my case, it didn't happen when I used cuda toolkit 6.5, but after changing to 7.5, it happens for every training cases. I re-installed OS(ubuntu 14.04) and all other libraries again, but it didn't disappear...

@ZhengRui
Copy link
Author

After many tries and test, i found it is related to RNG_SEED, for my dataset, using VGG16 pre-trained model with default RNG_SEED = 3, it always leads to the instability of dw and dh (in the paper are tw and th).

Symptom: if print out dw.max() and dh.max(), after a certain point between iterations 20 and 40, it becomes larger than 10, then in the following iterations, they could oscillate between a huge value (which could cause the termination of the program as shown above) and almost 0.

Using default random seed, I still got a very low chance to make it go through the whole training process successfully if i tried many times. If dw.max() and dh.max() remains smaller than 2 or 3 in the first 200 iterations then it's very likely the training process has passed the dangerous zone and is already on the right track. After changing the random seed to another value like 17, almost every try goes properly. And the default value 3 works perfectly for ZF net.

So my conclusion is try to change the RNG_SEED value if met the same problem, watch out the maximum value of dw and dh, if it goes to more than 400 iterations without overflow problem, then it gonna be fine.

Btw, i tried to change batch size and rpn batch size in yml file, after caffe loaded it did show the batch size values i set but the memory usage of gpu seems the same, is it normal?

@MenglaiWang
Copy link

@ZhengRui I have met similar problem too.Follow your instruction, I print out dw.max() and dh.max(), they come around 3 and 4 respectively. But they suddenly becomes nan without a gradual change. It is confusing!
Also, changing the random seed to another value like 17, it can runs properly ,but the next time it failed.Larger random seed seems to encounter with the same situation. So, I have difficulty understanding the function of RNG_SEED. Is it going to impact the speed of convergence or something else? Looking forward to your reply.Thank you!

@ZhengRui
Copy link
Author

RNG_SEED is just the seed for random number generator, you can change it to whatever number so long as it works. So change it to some number that you have a lower chance to meet this issue. I don't think it will impact the speed of convergence of something else, it's just a random seed.

@LiberiFatali
Copy link

I got this issue, too. Since I use training images that are cropped on target objects, I suspect that the use of bounding box closed to image size cause this problem. I haven't found out solution yet.

@ZhengRui
Copy link
Author

You can try to pad zeros around your training images before feeding to the network so that the ratio of your target objects is smaller. The network always rescale your input images to around 600x1000 or 1000x600 in the first step, so simply downsizing images won't work, have to do padding.

@RyanLiuNtust
Copy link

I also encountered this problems recently, this my suggestion to check what bug comes from. Most importantly, I found this bug shown when load error roi bounding box information into db (e.g. 65535), and it causes by error implementation of loading annotation.
In my loading annotation method, I copy the pascal loading annotation method to load xml file and that method will minus all bbox value(x1, y1, x2, y2) with 1, so if where occurs any value is 0(usinged int) and minus 1 will cause 65535.

@neuleaf
Copy link

neuleaf commented May 22, 2016

Has anyone solved this? I tried @RyanLiuNtust @ZhengRui 's method, but it doesn't work for me.
I got this error:

/faster-rcnn-py/tools/../lib/fast_rcnn/bbox_transform.py:50: RuntimeWarning: overflow encountered in exp
pred_h = np.exp(dh) * heights[:, np.newaxis]
faster-rcnn-py/tools/../lib/rpn/proposal_layer.py:176: RuntimeWarning: invalid value encountered in greater_equal
keep = np.where((ws >= min_size) & (hs >= min_size))[0]

I output 'ws, hs' , and they come to be "NAN".
Could anyone help?

@azamattokhtaev
Copy link

azamattokhtaev commented May 31, 2016

A possible solution could be to decrease the base learning rate in the solver.prototxt
As it is recommended here http://caffe.berkeleyvision.org/tutorial/solver.html
Just change the base_lr: 0.001 to 0.0001

Note also that the above settings are merely guidelines, and they’re definitely not guaranteed to be optimal (or even work at all!) in every situation. If learning diverges (e.g., you start to see very large or NaN or inf loss values or outputs), try dropping the base_lr (e.g., base_lr: 0.001) and re-training, repeating this until you find a base_lr value that works.

I did try to change the base_lr value and now the NAN value disappeared.

@cyberdecker
Copy link

A question... @ZhengRui what do you mean by "You can try to pad zeros around your training images"? Can you give a quick example? Thanks!

@ZhengRui
Copy link
Author

ZhengRui commented Jul 20, 2016

*********
***obj***
*********

to

*********************
*********************
*********obj*********
*********************
*********************

@vikiboy
Copy link

vikiboy commented Jul 21, 2016

is there a caffe argument to zero pad all the images ?

@ZhengRui
Copy link
Author

I am not sure if caffe has it now or not, but i did it by myself before everything as data augmentation

def paddingzeros(im, desMin, desMax):
    if im.shape[0] <= im.shape[1]:
        if im.shape[0] < desMin:
            im = np.lib.pad(im, (((desMin-im.shape[0])/2, (desMin+1-im.shape[0])/2), (0, 0), (0, 0)), 'constant')
        if im.shape[1] < desMax:
            im = np.lib.pad(im, ((0, 0), ((desMax-im.shape[1])/2, (desMax+1-im.shape[1])/2), (0, 0)), 'constant')

    if im.shape[0] > im.shape[1]:
        if im.shape[0] < desMax:
            im = np.lib.pad(im, (((desMax-im.shape[0])/2, (desMax+1-im.shape[0])/2), (0, 0), (0, 0)), 'constant')
        if im.shape[1] < desMin:
            im = np.lib.pad(im, ((0, 0), ((desMin-im.shape[1])/2, (desMin+1-im.shape[1])/2), (0, 0)), 'constant')

    print 'after padding: ', im.shape
    return im


im = paddingzeros(im, 600, 1000)

@vikiboy
Copy link

vikiboy commented Jul 22, 2016

Doing this would also require changing the values of the bounding boxes in the annotation XML files as well ?

@ZhengRui
Copy link
Author

ZhengRui commented Jul 22, 2016

yes, also have to change the bounding boxes annotations in padded images by adding the offsets in width and height directions

@hgaiser
Copy link

hgaiser commented Aug 2, 2016

Can someone explain the intuition behind this issue? Is the backpropagation algorithm oscillating with increasingly bigger steps around the optimum, eventually causing overflows in python?

If so, are these understandings then correct:

  • A different starting point (aka changing RNG_SEED) might make it optimize correctly.
  • A lower learning rate should prevent it from starting to oscillate out of control, as suggested by @azamattokhtaev

I'm trying to get a better understanding of what is going wrong here :).

@assafmus
Copy link

For me it was @RyanLiuNtust method that worked. Fixed the ground truth xml files so coordinates are 1-based

@fbi0817
Copy link

fbi0817 commented Oct 17, 2016

@neuleaf Have you solved the problem? I meet the same problem like yours, my output is also NAN. I have checked the output of bottom[0],bottom[1],and bottom[2] in proposal_layer.py. Only bottom[2] have value. And i have tried the solution of @azamattokhtaev ,but it doesn't help.

@zwyzwy
Copy link

zwyzwy commented Oct 25, 2016

where to change the RNG_SEED value?
@ZhengRui

@hgaiser
Copy link

hgaiser commented Oct 25, 2016

For me the issue was resolved when lowering the learning rate in solver.prototxt.

The RNG_SEED can be changed in the config file.

@fernandorovai
Copy link

fernandorovai commented Dec 6, 2016

@fbi0817 @ZhengRui Did you solve the problem? I am facing the same thing. I tried to reduce the learning rate but had no progress. Suddenly my loss turns to nan and I receive the warning overflow encountered in exp (around iteration 6000). I am using Pascal Voc2007. Could you help me, please?

@DeepDriving
Copy link

@ZhengRui @fernandorovai Have you solved the problem? I am also running into the same issue.

@abhiML
Copy link

abhiML commented Jun 26, 2017

Hey I changed the lr to 0.00001, removed the minus 1 while reading the annotations and changed the RNG seed to 17. Still keep getting the error.

@acgtyrant
Copy link

@jiangwqcooler
Copy link

I also encountered this problem, and i changed RNG_SEED = 4 solved it

@skyuuka
Copy link

skyuuka commented Nov 24, 2017

I have this problem when running on my own dataset, things to check is

  • incorrect bbox => better to have some check the load_annotation function
  • learning rate too large => reduce the learning rate

@ml930310
Copy link

ml930310 commented Dec 6, 2017

@skyuuka better to have some check the load_annotation function,How is your function written?

@AIML
Copy link

AIML commented Jan 22, 2018

As bbox_transform_inv is just for draw the bbox on the image, it's none of business with training. Reduce the learning rate can avoid the occurrence of nan.

@meetps
Copy link

meetps commented Feb 19, 2018

The issue that causes this NaN in the loss is because the dw and dh explode to extremely high values in the intiial stages of training when using a large learning rate (lr > 0.01)

The new Detectron code released has a fix for this.

Just update your config file to have a line:

__C.BBOX_XFORM_CLIP = np.log(1000. / 16.)

and add these lines just before the predict_ctr_x is computed in your bbox_transform.py:

    # Prevent sending too large values into np.exp()
    dw = np.minimum(dw, cfg.BBOX_XFORM_CLIP)
    dh = np.minimum(dh, cfg.BBOX_XFORM_CLIP)

It works with any RNG_SEED and very high learning rates (lr = 0.01)

@zqdeepbluesky
Copy link

zqdeepbluesky commented Mar 12, 2018

@ZhengRui @pyoguy @MenglaiWang @LiberiFatali @RyanLiuNtust
hi,guys ,when I train FPN on my own dataset,I met the same error:

I0312 16:25:25.883342 2983 sgd_solver.cpp:106] Iteration 0, lr = 0.0005
/home/zq/py-faster-rcnn/tools/../lib/rpn/proposal_layer.py:175: RuntimeWarning: invalid value encountered in greater_equal
keep = np.where((ws >= min_size) & (hs >= min_size))[0]
Floating point exception (core dumped)

I try to change lr from 0.001 to 0.0001,but it didn't work.I also change RNG_SEED,and it also didn't work.
I don't know how to solve it.please help me,thanks so much!

@kingchenchina
Copy link

@meetshah1995 use cpu only mode, after apply your solution, the problem still exist, though no nan value
ws [1. 1. 1. ... 1. 1. 1.] hs [1. 1. 1. ... 1. 1. 1.] min_size 25.600000381469727 keep [] experiments/scripts/faster_rcnn_end2end.sh: line 58: 67407 Floating point exception(core dumped) ./tools/train_net.py --solver models/${PT_DIR}/${NET}/faster_rcnn_end2end/solver.prototxt --weights data/imagenet_models/${NET}.v2.caffemodel --imdb ${TRAIN_IMDB} --iters ${ITERS} --cfg experiments/cfgs/faster_rcnn_end2end.yml ${EXTRA_ARGS}

@ygren
Copy link

ygren commented Mar 24, 2018

@MenglaiWang I use multi gpu mode with you solution,the problem still exists .

@zchrissirhcz
Copy link

zchrissirhcz commented Apr 8, 2018

@kingchenchina Seeing so many 1., the problem is very likely that the generated proposals are all just one pixel in length. Then, in proposal_layer.py, it calls _filter_boxes( ), which makes no proposals left. Then, the empty proposals will be used as rois blobs, which, in its reshaping function gives floating point exception.

Can use the following strategy to avoid empty proposals:

        keep = _filter_boxes(proposals, min_size * im_info[2])
        if len(keep)!=0:
            proposals = proposals[keep, :]
            scores = scores[keep]

But then the loss becomes nan. So turn down learning rate would be a better approach.

@niuniu111
Copy link

I have met similar problem too.Follow your instruction, I print out dw.max() and dh.max(), they come around 7489.9507 and 11519.379 respectively. I can't understand why there is such a large number. I don't know why. I hope someone can give us some advice.

@st20080675
Copy link

st20080675 commented Oct 17, 2019

I solved my 'Floating point exception (core dumped)' by modifying the function 'is_valid' in function 'filter_roidb' in file da-faster-rcnn-master/lib/fast_rcnn/train.py:

def filter_roidb(roidb):
"""Remove roidb entries that have no usable RoIs."""

def is_valid(entry):
    # Valid images have:
    #   (1) At least one foreground RoI OR
    #   (2) At least one background RoI
    overlaps = entry['max_overlaps']
    # added to handle empty boxes, see https://github.com/rbgirshick/py-faster-rcnn/issues/159
    not_empty = np.zeros(len(entry['max_overlaps']), dtype=bool)
    cur_boxes = entry['boxes']
    for i in range(len(not_empty)):
        if (cur_boxes[i][2] - cur_boxes[i][0] > 1 and cur_boxes[i][3] - cur_boxes[i][1] > 1):
            not_empty[i] = True

    # find boxes with sufficient overlap
    fg_inds = np.where(overlaps >= cfg.TRAIN.FG_THRESH)[0]
    # Select background RoIs as those within [BG_THRESH_LO, BG_THRESH_HI)
    bg_inds = np.where((overlaps < cfg.TRAIN.BG_THRESH_HI) &
                       (overlaps >= cfg.TRAIN.BG_THRESH_LO) & not_empty)[0]
                           
    # image is only valid if such boxes exist
    valid = len(fg_inds) > 0 or len(bg_inds) > 0
   
    return valid

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests