bbox_transform.py:48: RuntimeWarning: overflow encountered in exp ... #65

ZhengRui · 2016-01-14T17:31:03Z

in proposal_layer.py 's forward() function, when i print out bbox_deltas.min() and bbox_deltas.max(), at some point it suddenly become large and cause overflow and core dump, here is the log:

-0.843478 0.695785
-1.53431 1.09048
-2.39332 1.81395
-2.74009 1.98957
-0.368922 0.236118
-0.707322 0.23115
I0115 01:02:23.016412 31390 solver.cpp:242] Iteration 40, loss = 1.8799
I0115 01:02:23.016444 31390 solver.cpp:258]     Train net output #0: loss_bbox = 0.040359 (* 1 = 0.040359 loss)
I0115 01:02:23.016451 31390 solver.cpp:258]     Train net output #1: loss_cls = 0.240918 (* 1 = 0.240918 loss)
I0115 01:02:23.016456 31390 solver.cpp:258]     Train net output #2: rpn_cls_loss = 0.585994 (* 1 = 0.585994 loss)
I0115 01:02:23.016461 31390 solver.cpp:258]     Train net output #3: rpn_loss_bbox = 0.883768 (* 1 = 0.883768 loss)
I0115 01:02:23.016472 31390 solver.cpp:571] Iteration 40, lr = 0.001
-5.10243 4.21369
-5.00325 3.99131
-6.54417 1.98163
-8.08618 2.3706
-8.88115 1.3943
-3.91337 0.64184
-1.92415 2.32623
-1.19933 1.09659
-4.12942 3.17897
-4.96536 3.46139
-2.6374 1.79074
-2.49792 1.71725
-17.6217 14.487
-22.151 18.5744
-21.8844 17.797
-16.2733 13.0881
-1.50187 1.95901
-1.43967 1.50989
-14.1043 28.3903
-9.91849 19.1581
-31.6936 8.46099
-27.262 7.1318
-28.2349 125.76
/home/zerry/Work/Libs/py-faster-rcnn/tools/../lib/fast_rcnn/bbox_transform.py:48: RuntimeWarning: overflow encountered in exp
  pred_w = np.exp(dw) * widths[:, np.newaxis]
/home/zerry/Work/Libs/py-faster-rcnn/tools/../lib/fast_rcnn/bbox_transform.py:48: RuntimeWarning: overflow encountered in multiply
  pred_w = np.exp(dw) * widths[:, np.newaxis]
-26.7032 118.203
-741.881 505.883
/home/zerry/Work/Libs/py-faster-rcnn/tools/../lib/fast_rcnn/bbox_transform.py:49: RuntimeWarning: overflow encountered in exp
  pred_h = np.exp(dh) * heights[:, np.newaxis]
/home/zerry/Work/Libs/py-faster-rcnn/tools/../lib/fast_rcnn/bbox_transform.py:49: RuntimeWarning: overflow encountered in multiply
  pred_h = np.exp(dh) * heights[:, np.newaxis]
-692.391 472.156
-9.47346e+25 1.02213e+26
-6.35599e+25 6.85811e+25
nan nan
/home/zerry/Work/Libs/py-faster-rcnn/tools/../lib/rpn/proposal_layer.py:176: RuntimeWarning: invalid value encountered in greater_equal
  keep = np.where((ws >= min_size) & (hs >= min_size))[0]
./experiments/scripts/faster_rcnn_end2end_handdet.sh: line 39: 31390 Floating point exception(core dumped

Anyone can help figuring out what could be the problem?

The text was updated successfully, but these errors were encountered:

pyoguy · 2016-01-15T03:51:25Z

I have the same issue here.
In my case, it didn't happen when I used cuda toolkit 6.5, but after changing to 7.5, it happens for every training cases. I re-installed OS(ubuntu 14.04) and all other libraries again, but it didn't disappear...

ZhengRui · 2016-01-16T10:17:33Z

After many tries and test, i found it is related to RNG_SEED, for my dataset, using VGG16 pre-trained model with default RNG_SEED = 3, it always leads to the instability of dw and dh (in the paper are tw and th).

Symptom: if print out dw.max() and dh.max(), after a certain point between iterations 20 and 40, it becomes larger than 10, then in the following iterations, they could oscillate between a huge value (which could cause the termination of the program as shown above) and almost 0.

Using default random seed, I still got a very low chance to make it go through the whole training process successfully if i tried many times. If dw.max() and dh.max() remains smaller than 2 or 3 in the first 200 iterations then it's very likely the training process has passed the dangerous zone and is already on the right track. After changing the random seed to another value like 17, almost every try goes properly. And the default value 3 works perfectly for ZF net.

So my conclusion is try to change the RNG_SEED value if met the same problem, watch out the maximum value of dw and dh, if it goes to more than 400 iterations without overflow problem, then it gonna be fine.

Btw, i tried to change batch size and rpn batch size in yml file, after caffe loaded it did show the batch size values i set but the memory usage of gpu seems the same, is it normal?

MenglaiWang · 2016-01-19T02:38:16Z

@ZhengRui I have met similar problem too.Follow your instruction, I print out dw.max() and dh.max(), they come around 3 and 4 respectively. But they suddenly becomes nan without a gradual change. It is confusing!
Also, changing the random seed to another value like 17, it can runs properly ,but the next time it failed.Larger random seed seems to encounter with the same situation. So, I have difficulty understanding the function of RNG_SEED. Is it going to impact the speed of convergence or something else? Looking forward to your reply.Thank you!

ZhengRui · 2016-01-19T08:31:29Z

RNG_SEED is just the seed for random number generator, you can change it to whatever number so long as it works. So change it to some number that you have a lower chance to meet this issue. I don't think it will impact the speed of convergence of something else, it's just a random seed.

LiberiFatali · 2016-02-18T04:45:43Z

I got this issue, too. Since I use training images that are cropped on target objects, I suspect that the use of bounding box closed to image size cause this problem. I haven't found out solution yet.

ZhengRui · 2016-02-18T11:18:29Z

You can try to pad zeros around your training images before feeding to the network so that the ratio of your target objects is smaller. The network always rescale your input images to around 600x1000 or 1000x600 in the first step, so simply downsizing images won't work, have to do padding.

RyanLiuNtust · 2016-05-12T02:16:11Z

I also encountered this problems recently, this my suggestion to check what bug comes from. Most importantly, I found this bug shown when load error roi bounding box information into db (e.g. 65535), and it causes by error implementation of loading annotation.
In my loading annotation method, I copy the pascal loading annotation method to load xml file and that method will minus all bbox value(x1, y1, x2, y2) with 1, so if where occurs any value is 0(usinged int) and minus 1 will cause 65535.

neuleaf · 2016-05-22T09:08:07Z

Has anyone solved this? I tried @RyanLiuNtust @ZhengRui 's method, but it doesn't work for me.
I got this error:

/faster-rcnn-py/tools/../lib/fast_rcnn/bbox_transform.py:50: RuntimeWarning: overflow encountered in exp
pred_h = np.exp(dh) * heights[:, np.newaxis]
faster-rcnn-py/tools/../lib/rpn/proposal_layer.py:176: RuntimeWarning: invalid value encountered in greater_equal
keep = np.where((ws >= min_size) & (hs >= min_size))[0]

I output 'ws, hs' , and they come to be "NAN".
Could anyone help?

azamattokhtaev · 2016-05-31T05:58:52Z

A possible solution could be to decrease the base learning rate in the solver.prototxt
As it is recommended here http://caffe.berkeleyvision.org/tutorial/solver.html
Just change the base_lr: 0.001 to 0.0001

Note also that the above settings are merely guidelines, and they’re definitely not guaranteed to be optimal (or even work at all!) in every situation. If learning diverges (e.g., you start to see very large or NaN or inf loss values or outputs), try dropping the base_lr (e.g., base_lr: 0.001) and re-training, repeating this until you find a base_lr value that works.

I did try to change the base_lr value and now the NAN value disappeared.

cyberdecker · 2016-07-20T13:19:03Z

A question... @ZhengRui what do you mean by "You can try to pad zeros around your training images"? Can you give a quick example? Thanks!

ZhengRui · 2016-07-20T16:22:41Z

*********
***obj***
*********

to

*********************
*********************
*********obj*********
*********************
*********************

vikiboy · 2016-07-21T20:40:26Z

is there a caffe argument to zero pad all the images ?

ZhengRui · 2016-07-22T03:46:34Z

I am not sure if caffe has it now or not, but i did it by myself before everything as data augmentation

def paddingzeros(im, desMin, desMax):
    if im.shape[0] <= im.shape[1]:
        if im.shape[0] < desMin:
            im = np.lib.pad(im, (((desMin-im.shape[0])/2, (desMin+1-im.shape[0])/2), (0, 0), (0, 0)), 'constant')
        if im.shape[1] < desMax:
            im = np.lib.pad(im, ((0, 0), ((desMax-im.shape[1])/2, (desMax+1-im.shape[1])/2), (0, 0)), 'constant')

    if im.shape[0] > im.shape[1]:
        if im.shape[0] < desMax:
            im = np.lib.pad(im, (((desMax-im.shape[0])/2, (desMax+1-im.shape[0])/2), (0, 0), (0, 0)), 'constant')
        if im.shape[1] < desMin:
            im = np.lib.pad(im, ((0, 0), ((desMin-im.shape[1])/2, (desMin+1-im.shape[1])/2), (0, 0)), 'constant')

    print 'after padding: ', im.shape
    return im


im = paddingzeros(im, 600, 1000)

vikiboy · 2016-07-22T19:14:23Z

Doing this would also require changing the values of the bounding boxes in the annotation XML files as well ?

ZhengRui · 2016-07-22T20:33:03Z

yes, also have to change the bounding boxes annotations in padded images by adding the offsets in width and height directions

hgaiser · 2016-08-02T08:28:08Z

Can someone explain the intuition behind this issue? Is the backpropagation algorithm oscillating with increasingly bigger steps around the optimum, eventually causing overflows in python?

If so, are these understandings then correct:

A different starting point (aka changing RNG_SEED) might make it optimize correctly.
A lower learning rate should prevent it from starting to oscillate out of control, as suggested by @azamattokhtaev

I'm trying to get a better understanding of what is going wrong here :).

assafmus · 2016-08-30T07:09:18Z

For me it was @RyanLiuNtust method that worked. Fixed the ground truth xml files so coordinates are 1-based

fbi0817 · 2016-10-17T15:23:43Z

@neuleaf Have you solved the problem? I meet the same problem like yours, my output is also NAN. I have checked the output of bottom[0],bottom[1],and bottom[2] in proposal_layer.py. Only bottom[2] have value. And i have tried the solution of @azamattokhtaev ,but it doesn't help.

zwyzwy · 2016-10-25T14:40:10Z

where to change the RNG_SEED value?
@ZhengRui

hgaiser · 2016-10-25T14:42:18Z

For me the issue was resolved when lowering the learning rate in solver.prototxt.

The RNG_SEED can be changed in the config file.

fernandorovai · 2016-12-06T16:00:35Z

@fbi0817 @ZhengRui Did you solve the problem? I am facing the same thing. I tried to reduce the learning rate but had no progress. Suddenly my loss turns to nan and I receive the warning overflow encountered in exp (around iteration 6000). I am using Pascal Voc2007. Could you help me, please?

DeepDriving · 2017-06-10T04:08:34Z

@ZhengRui @fernandorovai Have you solved the problem? I am also running into the same issue.

abhiML · 2017-06-26T14:28:03Z

Hey I changed the lr to 0.00001, removed the minus 1 while reading the annotations and changed the RNG seed to 17. Still keep getting the error.

acgtyrant · 2017-06-26T14:41:19Z

@abhiML Study this: https://github.com/longcw/faster_rcnn_pytorch/blob/master/faster_rcnn/network.py#L109

jiangwqcooler · 2017-09-05T07:25:51Z

I also encountered this problem, and i changed RNG_SEED = 4 solved it

skyuuka · 2017-11-24T19:15:48Z

I have this problem when running on my own dataset, things to check is

incorrect bbox => better to have some check the load_annotation function
learning rate too large => reduce the learning rate

ml930310 · 2017-12-06T12:56:18Z

@skyuuka better to have some check the load_annotation function,How is your function written?

AIML · 2018-01-22T08:36:35Z

As bbox_transform_inv is just for draw the bbox on the image, it's none of business with training. Reduce the learning rate can avoid the occurrence of nan.

meetps · 2018-02-19T08:02:17Z

The issue that causes this NaN in the loss is because the dw and dh explode to extremely high values in the intiial stages of training when using a large learning rate (lr > 0.01)

The new Detectron code released has a fix for this.

Just update your config file to have a line:

__C.BBOX_XFORM_CLIP = np.log(1000. / 16.)

and add these lines just before the predict_ctr_x is computed in your bbox_transform.py:

    # Prevent sending too large values into np.exp()
    dw = np.minimum(dw, cfg.BBOX_XFORM_CLIP)
    dh = np.minimum(dh, cfg.BBOX_XFORM_CLIP)

It works with any RNG_SEED and very high learning rates (lr = 0.01)

zqdeepbluesky · 2018-03-12T09:14:55Z

@ZhengRui @pyoguy @MenglaiWang @LiberiFatali @RyanLiuNtust
hi,guys ,when I train FPN on my own dataset,I met the same error:

I0312 16:25:25.883342 2983 sgd_solver.cpp:106] Iteration 0, lr = 0.0005
/home/zq/py-faster-rcnn/tools/../lib/rpn/proposal_layer.py:175: RuntimeWarning: invalid value encountered in greater_equal
keep = np.where((ws >= min_size) & (hs >= min_size))[0]
Floating point exception (core dumped)

I try to change lr from 0.001 to 0.0001,but it didn't work.I also change RNG_SEED,and it also didn't work.
I don't know how to solve it.please help me,thanks so much!

kingchenchina · 2018-03-23T06:58:53Z

@meetshah1995 use cpu only mode, after apply your solution, the problem still exist, though no nan value
ws [1. 1. 1. ... 1. 1. 1.] hs [1. 1. 1. ... 1. 1. 1.] min_size 25.600000381469727 keep [] experiments/scripts/faster_rcnn_end2end.sh: line 58: 67407 Floating point exception(core dumped) ./tools/train_net.py --solver models/${PT_DIR}/${NET}/faster_rcnn_end2end/solver.prototxt --weights data/imagenet_models/${NET}.v2.caffemodel --imdb ${TRAIN_IMDB} --iters ${ITERS} --cfg experiments/cfgs/faster_rcnn_end2end.yml ${EXTRA_ARGS}

ygren · 2018-03-24T13:34:22Z

@MenglaiWang I use multi gpu mode with you solution,the problem still exists .

zchrissirhcz · 2018-04-08T06:53:59Z

@kingchenchina Seeing so many 1., the problem is very likely that the generated proposals are all just one pixel in length. Then, in proposal_layer.py, it calls _filter_boxes( ), which makes no proposals left. Then, the empty proposals will be used as rois blobs, which, in its reshaping function gives floating point exception.

Can use the following strategy to avoid empty proposals:

        keep = _filter_boxes(proposals, min_size * im_info[2])
        if len(keep)!=0:
            proposals = proposals[keep, :]
            scores = scores[keep]

But then the loss becomes nan. So turn down learning rate would be a better approach.

niuniu111 · 2018-05-07T02:32:58Z

I have met similar problem too.Follow your instruction, I print out dw.max() and dh.max(), they come around 7489.9507 and 11519.379 respectively. I can't understand why there is such a large number. I don't know why. I hope someone can give us some advice.

st20080675 · 2019-10-17T08:48:52Z

I solved my 'Floating point exception (core dumped)' by modifying the function 'is_valid' in function 'filter_roidb' in file da-faster-rcnn-master/lib/fast_rcnn/train.py:

def filter_roidb(roidb):
"""Remove roidb entries that have no usable RoIs."""

def is_valid(entry):
    # Valid images have:
    #   (1) At least one foreground RoI OR
    #   (2) At least one background RoI
    overlaps = entry['max_overlaps']
    # added to handle empty boxes, see https://github.com/rbgirshick/py-faster-rcnn/issues/159
    not_empty = np.zeros(len(entry['max_overlaps']), dtype=bool)
    cur_boxes = entry['boxes']
    for i in range(len(not_empty)):
        if (cur_boxes[i][2] - cur_boxes[i][0] > 1 and cur_boxes[i][3] - cur_boxes[i][1] > 1):
            not_empty[i] = True

    # find boxes with sufficient overlap
    fg_inds = np.where(overlaps >= cfg.TRAIN.FG_THRESH)[0]
    # Select background RoIs as those within [BG_THRESH_LO, BG_THRESH_HI)
    bg_inds = np.where((overlaps < cfg.TRAIN.BG_THRESH_HI) &
                       (overlaps >= cfg.TRAIN.BG_THRESH_LO) & not_empty)[0]
                           
    # image is only valid if such boxes exist
    valid = len(fg_inds) > 0 or len(bg_inds) > 0
   
    return valid

ZhengRui closed this as completed Jan 18, 2016

smichalowski mentioned this issue May 9, 2016

Floating point exception #159

Closed

tornadomeet mentioned this issue Aug 20, 2016

approximate joint end-2-end training. ijkguo/mx-rcnn#23

Closed

qinglintian mentioned this issue Mar 13, 2017

any idea on multi batch training smallcorgi/Faster-RCNN_TF#54

Open

abhiML mentioned this issue Jun 26, 2017

Train new dataset: zeros after conv3 in vgg16 longcw/faster_rcnn_pytorch#20

Open

hadign20 mentioned this issue Jul 24, 2017

RuntimeWarning: invalid value encountered in greater_equal keep = np.where((ws >= min_size) & (hs >= min_size))[0] smallcorgi/Faster-RCNN_TF#158

Open

karaspd mentioned this issue Dec 8, 2017

Floating Point Exception #740

Open

robingong mentioned this issue Dec 21, 2018

RuntimeWarning: invalid value encountered in log targets_dh = np.log(gt_heights / ex_heights eragonruan/text-detection-ctpn#226

Open

bbox_transform.py:48: RuntimeWarning: overflow encountered in exp ... #65

bbox_transform.py:48: RuntimeWarning: overflow encountered in exp ... #65

Comments

ZhengRui commented Jan 14, 2016

pyoguy commented Jan 15, 2016

ZhengRui commented Jan 16, 2016

MenglaiWang commented Jan 19, 2016

ZhengRui commented Jan 19, 2016

LiberiFatali commented Feb 18, 2016

ZhengRui commented Feb 18, 2016

RyanLiuNtust commented May 12, 2016

neuleaf commented May 22, 2016 • edited Loading

azamattokhtaev commented May 31, 2016 • edited Loading

cyberdecker commented Jul 20, 2016

ZhengRui commented Jul 20, 2016 • edited Loading

vikiboy commented Jul 21, 2016

ZhengRui commented Jul 22, 2016

vikiboy commented Jul 22, 2016

ZhengRui commented Jul 22, 2016 • edited Loading

hgaiser commented Aug 2, 2016

assafmus commented Aug 30, 2016

fbi0817 commented Oct 17, 2016

zwyzwy commented Oct 25, 2016

hgaiser commented Oct 25, 2016

fernandorovai commented Dec 6, 2016 • edited Loading

DeepDriving commented Jun 10, 2017

abhiML commented Jun 26, 2017

acgtyrant commented Jun 26, 2017

jiangwqcooler commented Sep 5, 2017

skyuuka commented Nov 24, 2017

ml930310 commented Dec 6, 2017

AIML commented Jan 22, 2018

meetps commented Feb 19, 2018

zqdeepbluesky commented Mar 12, 2018 • edited Loading

kingchenchina commented Mar 23, 2018

ygren commented Mar 24, 2018

zchrissirhcz commented Apr 8, 2018 • edited Loading

niuniu111 commented May 7, 2018

st20080675 commented Oct 17, 2019 • edited Loading

neuleaf commented May 22, 2016 •

edited

Loading

azamattokhtaev commented May 31, 2016 •

edited

Loading

ZhengRui commented Jul 20, 2016 •

edited

Loading

ZhengRui commented Jul 22, 2016 •

edited

Loading

fernandorovai commented Dec 6, 2016 •

edited

Loading

zqdeepbluesky commented Mar 12, 2018 •

edited

Loading

zchrissirhcz commented Apr 8, 2018 •

edited

Loading

st20080675 commented Oct 17, 2019 •

edited

Loading